@Rumia
2017-03-10T08:57:39.000000Z
字数 4026
阅读 814
A FASTQ file normally uses four lines per sequence.
A FASTQ file containing a single sequence might look like this:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Example:
@HISEQ:210:H9L41ADXX:1:1215:6810:37655
CTCCAGCACCAAAAAAGAAAAAAAAAAAAAGAAAAGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAG
+
?@?BDEADFDFAFGEGAEGG9FFE@@0;55',6>>3;5;;;A>A<55<?B><??C?<?AABA322(3<@??BBBAB?BC?@8CB@AC<(2>ABBC22898
fqz_comp v4.6. Author James Bonfield, 2011-2013
The range coder is derived from Eugene Shelwien.
To compress:
fqz_comp [options] [input_file [output_file]]
-Q <num> Perform lossy compression with all quality values
being within 'num' distance from their original value.
Default is lossless, i.e. "-q 0"
-s <level> Sequence compression level. 1-9 [Def. 3]
Specifying '+' on the end (eg -s5+) will use
models of multiple sizes for improved compression.
-b Use both strands in sequence hash table.
-e Extra seq compression: 16-bit vs 8-bit counters.
-q <level> Quality compression level. 1-3 [Def. 2]
-n <level> Name compression level. 1-2 [Def. 2]
-P Disable multi-threading
-X Disable generation/verification of check sums
-S SOLiD format
To decompress:
fqz_comp -d < foo.fqz > foo.fastq
or fqz_comp -d foo.fqz foo.fastq
It can be assumed that the arg -s
-q
-n
influence the COMP RATIO, but we do not know how. After performing a series of test, results are drawn and shown in figures below:
Result of a simple test:
[root@localhost fastq]# time fqz_comp -s6 -n1 -q3 -e -b ENCFF002EXT.fastq out.fqz
Names 71134666 -> 3367344 (0.047)
Bases 117600900 -> 12974412 (0.110)
Quals 117600900 -> 28828099 (0.245)
real 0m9.370s
user 0m16.791s
sys 0m0.478s
[root@localhost fastq]# time fqz_comp -d out.fqz out.revert.fq
real 0m12.860s
user 0m20.844s
sys 0m0.502s
The file size of ENCFF002EXT.fastq
is 299MB, and the compressed file out.fqz
is about 44MB, so the rough COMP RATIO is about 1/7. Time cost for compressing this test file is about 9.37sec, and the COMP SPEED is about 30MB/sec. However, it takes more time to decompress (time cost for decompressing is about 12.86sec).
Similar to fqzcomp, scalce divides a .fastq file into 3 streams (names/sequences/qualities) and process them respectively. While different from fqzcomp, scalce focus on the lossy compression of qualities.
Result of a simple test:
[root@localhost fastq]# scalce ENCFF002EXT.fastq -o out
SCALCE 2.8 [pthreads; available cores=5]
Buffer size: 131072K, bucket storage size: 4194304K
Allocating pool... OK!
Preprocessing FASTQ files ...
Paired end #1, quality offset: 33
read length: 100
OK
Using 4 threads...
Done with file ENCFF002EXT.fastq, 1176009 reads found
Created temp file 0
Merging results ... 1
Generating 1 scalce file(s) from temp #0 compression: pigz
shrink factor for 0: 1
Written 234255 cores
Read bit size: paired end 0 = 0.77
Quality bit size: paired end 0 = 2.32
Cleaning ...
Statistics:
Total number of reads: 1176009
Read length: first end 100
Unbucketed reads count: 35, bucketed percentage 100.00
Lossy percentage: 0
Time elapsed: 00:00:12
Compression time: 00:00:04
Original size: 298.76M, new size: 52.63M, compression factor: 5.68
Done!
scalce compresses the origin file into 3 files namely out_1.scalcen
out_1.scalceq
out_1.scalcer
(names/qualities/reads):
-rw-------. 1 root root 9.3M Mar 10 18:38 out_1.scalcen
-rw-------. 1 root root 11M Mar 10 18:38 out_1.scalcer
-rw-r--r--. 1 root root 33M Mar 10 18:38 out_1.scalceq