Tuesday, October 16, 2012

How to demultiplex fastq files with a dedicated, separate barcode file

Oh, Barcodes, How do I obtain thee?  Let me count the ways:
1) Internal to one of the reads
2) Printed in the header:
3) Encoded in the header
4) In a separate file

There are many tools out there to deal with multiplexed read data.
fastx_barcode_splitter.pl (from FASTX_Toolkit)
fastq-multx (from ea-utils)
split_libraries_fastq.py (from QIIME)

As far as I know, none of the above adequately deal with case (4) above.  The QIIME tools does, but appears to turn the fastq into a fasta file in the process (?).

Here is a one-liner homebrew solution to this situation:  just tack the index reads onto the front of read1, and then use an off-the-shelf tool (I like the FASTX_Toolkit one above) to demultiplex.  Like this:

paste -d '' <(echo; sed -n '1,${n;p;}' R2.fq | sed G) R1.fq | sed '/^$/d' | fastx_barcode_splitter.pl --bol --bcfile barcodes.txt --prefix debarcoded

I am assuming that R2.fq is the barcode fastq file, and R1.fq is the is the Read 1 file.  The commands above take the 2nd and 4th lines from the barcode fastq file, spaces them out correctly, pastes them together (with no delimiter - note: '' is a double single-quote, not a single double-quote), and then uses another paste command to pre-pend THAT result to the R1 file.  Also there is a final sed command to remove the trailing blank lines.  It ain't pretty, but it works.

The fancy <(  ) constructs is a bash thing called "process substitution".  If you're not using bash, it won't work.

Hope this helps someone.

(edit:  I simplified the initial sed command on 2/1/2013, so let me know if it somehow broke, but it should still work fine.)

3 comments:

  1. This worked great and saved me a bunch of regex hair pulling. I actually had dual indexed reads I wanted to plop into qiime, so rather than piping into fastx splitter, I wrote that output to a new file, then repeated the command with the second index and the new, prepended reads. Concatenated my indices in the mapping file, and used the old split_libraries.py to demultiplex (after converting from fastq to fasta with fastx toolkit).

    The downside is that I don't get the benefit of the quality filter as in split_libraries_fastq.py, but you can either do that first in fastx-toolkit, or if you only care about one of your sequence reads, you could instead just concatenate the indices together and use that as input for splitting as fastqs. However, I would like to be able to use pandaseq or fastq-join in a simple fashion to utilize both reads, but as you may know, by removing unjoined reads from your fastq, your sequences are no longer in the same order as your index reads, and qiime hates this. Now if you quality filter your reads with fastx, then prepend read1 with your index (or indices), then perform fastq-join, then you can still input quality-filtered data into qiime, now with joined PE reads.

    ReplyDelete
  2. Thank you - that was just what I needed.
    I had to replace [paste -d '' ] with [paste -d '\0'] to stop bash complaining about missing delimiters - just in case somebody else has that problem.

    ReplyDelete
  3. Hi, I'm having a problem with situation 3) or maybe 2) (not sure what the encoded is).
    I have the barcode in the header of the read. We used 6 bases barcodes, but the sequence is 8 bases long (probably to accomodate for longer barcodes in other experiments). Our barcode is the first 6 bases of those 8.
    I tried the ea-utils toolkit, but 88% of the reads are umatched. . The log of ea-utils says "End used: end", so I believe it was looking for the barcode at the 3' of the read.
    I haven't found any option to specify "look in the header". Moreover, I am afraid the extra 2 bases in the header might mess up with the ea-utils software.

    Anyone with similar issue?
    Cheers

    ReplyDelete