@Rumia 2017-03-10T08:57:39.000000Z 字数 4026 阅读 814

Research of Bioinformatics Data Compression Methods

Research of Bioinformatics Data Compression Methods
- 1. Compression for .fastq
  - (1) .fastq format
  - (2) Corresponding tools
    - i. fqzcomp
    - ii. scalce

1. Compression for .fastq

(1) .fastq format

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Example:

@HISEQ:210:H9L41ADXX:1:1215:6810:37655
CTCCAGCACCAAAAAAGAAAAAAAAAAAAAGAAAAGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAG
+
?@?BDEADFDFAFGEGAEGG9FFE@@0;55',6>>3;5;;;A>A<55<?B><??C?<?AABA322(3<@??BBBAB?BC?@8CB@AC<(2>ABBC22898

(2) Corresponding tools

i. fqzcomp

The full text about this tool can be downloaded here.
A brief manual is shown below:

fqz_comp v4.6. Author James Bonfield, 2011-2013
The range coder is derived from Eugene Shelwien.
To compress:
  fqz_comp [options] [input_file [output_file]]
    -Q <num>       Perform lossy compression with all quality values
                   being within 'num' distance from their original value.
                   Default is lossless, i.e. "-q 0"
    -s <level>     Sequence compression level. 1-9 [Def. 3]
                   Specifying '+' on the end (eg -s5+) will use
                   models of multiple sizes for improved compression.
    -b             Use both strands in sequence hash table.
    -e             Extra seq compression: 16-bit vs 8-bit counters.
    -q <level>     Quality compression level.  1-3 [Def. 2]
    -n <level>     Name compression level.  1-2 [Def. 2]
    -P             Disable multi-threading
    -X             Disable generation/verification of check sums
    -S             SOLiD format
To decompress:
   fqz_comp -d < foo.fqz > foo.fastq
or fqz_comp -d foo.fqz foo.fastq

It can be assumed that the arg -s -q -n influence the COMP RATIO, but we do not know how. After performing a series of test, results are drawn and shown in figures below:

Result of a simple test:

Compress

[root@localhost fastq]# time fqz_comp -s6 -n1 -q3 -e -b ENCFF002EXT.fastq out.fqz
Names   71134666 ->    3367344 (0.047)
Bases  117600900 ->   12974412 (0.110)
Quals  117600900 ->   28828099 (0.245)
real 0m9.370s
user 0m16.791s
sys 0m0.478s

Decompress

[root@localhost fastq]# time fqz_comp -d out.fqz out.revert.fq
real 0m12.860s
user 0m20.844s
sys 0m0.502s

The file size of ENCFF002EXT.fastq is 299MB, and the compressed file out.fqz is about 44MB, so the rough COMP RATIO is about 1/7. Time cost for compressing this test file is about 9.37sec, and the COMP SPEED is about 30MB/sec. However, it takes more time to decompress (time cost for decompressing is about 12.86sec).

ii. scalce

The main page of scalce is here.
The full text paper can be downloaded here.
Github repository is here.

Similar to fqzcomp, scalce divides a .fastq file into 3 streams (names/sequences/qualities) and process them respectively. While different from fqzcomp, scalce focus on the lossy compression of qualities.