[关闭]
@Rumia 2017-03-10T08:57:39.000000Z 字数 4026 阅读 814

Research of Bioinformatics Data Compression Methods



1. Compression for .fastq

(1) .fastq format

A FASTQ file normally uses four lines per sequence.

A FASTQ file containing a single sequence might look like this:

  1. @SEQ_ID
  2. GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
  3. +
  4. !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

  1. !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Example:

  1. @HISEQ:210:H9L41ADXX:1:1215:6810:37655
  2. CTCCAGCACCAAAAAAGAAAAAAAAAAAAAGAAAAGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAG
  3. +
  4. ?@?BDEADFDFAFGEGAEGG9FFE@@0;55',6>>3;5;;;A>A<55<?B><??C?<?AABA322(3<@??BBBAB?BC?@8CB@AC<(2>ABBC22898

(2) Corresponding tools

i. fqzcomp

  1. fqz_comp v4.6. Author James Bonfield, 2011-2013
  2. The range coder is derived from Eugene Shelwien.
  3. To compress:
  4. fqz_comp [options] [input_file [output_file]]
  5. -Q <num> Perform lossy compression with all quality values
  6. being within 'num' distance from their original value.
  7. Default is lossless, i.e. "-q 0"
  8. -s <level> Sequence compression level. 1-9 [Def. 3]
  9. Specifying '+' on the end (eg -s5+) will use
  10. models of multiple sizes for improved compression.
  11. -b Use both strands in sequence hash table.
  12. -e Extra seq compression: 16-bit vs 8-bit counters.
  13. -q <level> Quality compression level. 1-3 [Def. 2]
  14. -n <level> Name compression level. 1-2 [Def. 2]
  15. -P Disable multi-threading
  16. -X Disable generation/verification of check sums
  17. -S SOLiD format
  18. To decompress:
  19. fqz_comp -d < foo.fqz > foo.fastq
  20. or fqz_comp -d foo.fqz foo.fastq

It can be assumed that the arg -s -q -n influence the COMP RATIO, but we do not know how. After performing a series of test, results are drawn and shown in figures below:

Result of a simple test:

  1. [root@localhost fastq]# time fqz_comp -s6 -n1 -q3 -e -b ENCFF002EXT.fastq out.fqz
  2. Names 71134666 -> 3367344 (0.047)
  3. Bases 117600900 -> 12974412 (0.110)
  4. Quals 117600900 -> 28828099 (0.245)
  5. real 0m9.370s
  6. user 0m16.791s
  7. sys 0m0.478s
  1. [root@localhost fastq]# time fqz_comp -d out.fqz out.revert.fq
  2. real 0m12.860s
  3. user 0m20.844s
  4. sys 0m0.502s

The file size of ENCFF002EXT.fastq is 299MB, and the compressed file out.fqz is about 44MB, so the rough COMP RATIO is about 1/7. Time cost for compressing this test file is about 9.37sec, and the COMP SPEED is about 30MB/sec. However, it takes more time to decompress (time cost for decompressing is about 12.86sec).

ii. scalce

Similar to fqzcomp, scalce divides a .fastq file into 3 streams (names/sequences/qualities) and process them respectively. While different from fqzcomp, scalce focus on the lossy compression of qualities.

Result of a simple test:

  1. [root@localhost fastq]# scalce ENCFF002EXT.fastq -o out
  2. SCALCE 2.8 [pthreads; available cores=5]
  3. Buffer size: 131072K, bucket storage size: 4194304K
  4. Allocating pool... OK!
  5. Preprocessing FASTQ files ...
  6. Paired end #1, quality offset: 33
  7. read length: 100
  8. OK
  9. Using 4 threads...
  10. Done with file ENCFF002EXT.fastq, 1176009 reads found
  11. Created temp file 0
  12. Merging results ... 1
  13. Generating 1 scalce file(s) from temp #0 compression: pigz
  14. shrink factor for 0: 1
  15. Written 234255 cores
  16. Read bit size: paired end 0 = 0.77
  17. Quality bit size: paired end 0 = 2.32
  18. Cleaning ...
  19. Statistics:
  20. Total number of reads: 1176009
  21. Read length: first end 100
  22. Unbucketed reads count: 35, bucketed percentage 100.00
  23. Lossy percentage: 0
  24. Time elapsed: 00:00:12
  25. Compression time: 00:00:04
  26. Original size: 298.76M, new size: 52.63M, compression factor: 5.68
  27. Done!

scalce compresses the origin file into 3 files namely out_1.scalcen out_1.scalceq out_1.scalcer (names/qualities/reads):

  1. -rw-------. 1 root root 9.3M Mar 10 18:38 out_1.scalcen
  2. -rw-------. 1 root root 11M Mar 10 18:38 out_1.scalcer
  3. -rw-r--r--. 1 root root 33M Mar 10 18:38 out_1.scalceq
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注