@lakesea
2020-10-08T06:00:31.000000Z
字数 8721
阅读 504
项目
The RNA-seq libs were selected:
SRR8692168.sraSRR8692152.sraSRR8633200.sraSRR7126367.sraSRR6378268.sraSRR6378275.sraSRR6378282.sraSRR6116784.sraSRR4929869.sraSRR3176315.sraSRR1531072.sraSRR1531044.sraSRR5786000.sraSRR6884529.sraSRR6911324.sraSRR7072169.sraSRR9137587.sraSRR9109599.sraSRR5228682.sraSRR2079878.sraSRR2079857.sraSRR8216284.sraSRR9179250.sraSRR8902636.sraSRR8216279.sraSRR7072170.sraSRR5822167.sraSRR3970455.sraSRR1299039.sraSRR7184316.sraSRR8985031.sraSRR8985046.sraSRR7779699.sraSRR7503967.sraSRR5190400.sraSRR4030880.sraSRR5828096.sraSRR5828100.sraSRR5839742.sraSRR5839746.sraSRR6830245.sraSRR6830234.sraSRR6830225.sraSRR6830201.sraSRR6830216.sraSRR4933174.sraSRR5190502.sraSRR5190512.sraSRR7072163.sraSRR7009040.sraSRR5340698.sraSRR1299037.sraSRR3965980.sraSRR6380258.sraSRR5823222.sraSRR5823204.sraSRR5823188.sraSRR2032312.sraSRR2032322.sraSRR2032333.sraSRR5119412.sraSRR7900202.sraSRR7900205.sraSRR7900209.sraSRR7127919.sraSRR7127901.sraSRR6076932.sraSRR4045466.sraSRR7127910.sraSRR5824317.sraSRR5824320.sraSRR5824325.sraSRR9179219.sraSRR9179228.sraSRR9179247.sraSRR1289906.sraSRR1290657.sraSRR1290680.sraSRR5004517SRR5004746SRR5007601SRR5007738ERR569571ERR569564SRR6357659.sraSRR6357662.sraSRR6116750.sraSRR6116762.sraSRR6116778.sraERR597330.sraERR597353.sraSRR5190404.sraSRR8927005.sraSRR8927016.sraSRR7963328.sraSRR7963310.sraSRR7904023.sraSRR2005777.sraSRR4101252.sraSRR9211413.sraSRR8490105.sraSRR7779717.sraSRR7110440.sraSRR5089019.sraSRR1769980.sraSRR6076921.sraSRR4478735.sraSRR4478689.sraSRR4478643.sraSRR4478667.sraSRR8356628.sraSRR7779731.sraSRR5190474.sraSRR4096123.sraSRR2516426.sraSRR1821184.sraSRR1821188.sra
RNA-seq align to pangenome and run stringtie:
fastp -i SRR9179250_1.fastq -I SRR9179250_2.fastq -o SRR9179250.clean_1.fastq -O SRR9179250.clean_2.fastq -w 4hisat2 -x /scratch/pawsey0149/hhu/mainwork/chi_pan/ref/pan_nonref.36.gt500.round2.fa -1 SRR9179250.clean_1.fastq -2 SRR9179250.clean_2.fastq -p 4 |samtools view -@ 3 -Sb - | samtools sort -@ 3 -o /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/SRR9179250.sort.bamsamtools sort -@ 28 all.merge.bam -o all.sort.bamsamtools index -@ 28 all.sort.bam/scratch/pawsey0149/hhu/tool/stringtie-2.1.4.Linux_x86_64/stringtie -p 16 -o merged.gtf /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/all.sort.bam
assembly the transcript and convert it into transcript cds fasta:
util/cufflinks_gtf_genome_to_cdna_fasta.pl merged.gtf input/chi_masked.fa > transcripts.fastautil/cufflinks_gtf_to_alignment_gff3.pl merged.gtf > transcripts.gff3TransDecoder.LongOrfs -t transcripts.fastaTransDecoder.Predict -t transcripts.fastautil/cdna_alignment_orf_to_genome_orf.pl \transcripts.fasta.transdecoder.gff3 \transcripts.gff3 \transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3
remove the highly similar sequences from the protein sequences.
cat Aves_Nchicken.fasta Chicken.fasta human.fasta > subseq.fasta~/biosoft/cd-hit-v4.6.8-2017-0621/cd-hit -i subseq.fasta -o subseq.dedup -M 6000
Generate based on the previous notes, which resulted from the first round of the MAKER annotation.
BRAKER can used to generate the gene prdiction gene model.This step will genereate the augustus and genemark gene prediction model, which can be used for maker annotation.
/group/pawsey0149/hhu/bioconda/miniconda2/bin/perl /group/pawsey0149/groupEes=chi_pan_36 --bam=all.sort.bam --cores 16
All the evidence is intergrated by Maker.
#-----Genome (these are always required)genome=/ws/hhu/soybean_data/braker/chi_maker/nonref.36.gt500.round2.fa #genome sequence (fasta file or fasta embeded in GFF3 file)organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic#-----Re-annotation Using MAKER Derived GFF3maker_gff= #MAKER derived GFF3 fileest_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = noaltest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = noprotein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = norm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = nomodel_pass=0 #use gene models in maker_gff: 1 = yes, 0 = nopred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = noother_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no#-----EST Evidence (for best results provide a file for at least one)est=/ws/hhu/soybean_data/braker/chi_maker/final.transcript1s.fasta.transdecoder.cds.red.fa #set of ESTs or assembled mRNA-seq in fasta formataltest= #EST/cDNA sequence file in fasta format from an alternate organismest_gff= #aligned ESTs or mRNA-seq from an external GFF3 filealtest_gff= #aligned ESTs from a closly relate species in GFF3 format#-----Protein Homology Evidence (for best results provide a file for at least one)protein=/ws/hhu/soybean_data/braker/chi_maker/subseq.dedup.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)protein_gff= #aligned protein homology evidence from an external GFF3 file#-----Repeat Masking (leave values blank to skip repeat masking)model_org=simple #select a model organism for RepBase masking in RepeatMaskerrmlib= #provide an organism specific repeat library in fasta format for RepeatMaskerrepeat_protein=/ws/groupEnv/appbio/app/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunnerrm_gff=/ws/hhu/soybean_data/braker/chi_maker/full_mask.complex.reformat.gff3#pre-identified repeat elements from an external GFF3 fileprok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = nosoftmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)#-----Gene Predictionsnaphmm=/ws/hhu/soybean_data/braker/chi_maker/Pan_rnd1.zff.length50_aed0.25.hmm #SNAP HMM filegmhmm= #GeneMark HMM fileaugustus_species=chi_pan_36 #Augustus gene prediction species modelfgenesh_par_file= #FGENESH parameter filepred_gff= #ab-initio predictions from an external GFF3 filemodel_gff= #annotated gene models from an external GFF3 file (annotation pass-through)est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = noprotein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = notrna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = nosnoscan_rrna= #rRNA file to have Snoscan find snoRNAsunmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no#-----Other Annotation Feature Types (features MAKER doesn't recognize)other_gff= #extra features to pass-through to final MAKER generated GFF3 file#-----External Application Behavior Optionsalt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databasescpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)#-----MAKER Behavior Optionsmax_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)min_contig=1 #skip genome contigs below this length (under 10kb are often useless)pred_flank=200 #flank for extending evidence clusters sent to gene predictorspred_stats=0 #report AED and QI statistics for all predictions as well as modelsAED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)min_protein=0 #require at least this many amino acids in predicted proteinsalt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = noalways_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = nomap_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = nokeep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = nosingle_length=250 #min length required for single exon ESTs if 'single_exon is enabled'correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genestries=2 #number of times to try a contig if there is a failure for some reasonclean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = noclean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = noTMP=/ws/hhu/soybean_data/braker/chi_maker/tmp/ #specify a directory other than the system default temporary directory for temporary files
First extract thpse proteins with AA length longer than 50AA:
perl remove_small.pl 49 chi_pan.all.maker.augustus_masked.proteins.fasta > chi_pan.all.maker.proteins_50aa.fasta
script:
#!/usr/bin/perl#remove small sequences from fasta fileuse strict;use warnings;my $minlen = shift or die "Error: `minlen` parameter not provided\n";{local $/=">";while(<>) {chomp;next unless /\w/;s/>$//gs;my @chunk = split /\n/;my $header = shift @chunk;my $seqlen = length join "", @chunk;print ">$_" if($seqlen >= $minlen);}local $/="\n";}
Using cd-hits with 90% cut-off:
cd-hit-2d -i Gallus_gallus.GRCg6a.pep.all.fa -i2 chi_pan.all.maker.proteins_50aa.fasta -o chi_pan.all.maker.proteins_50aa_rd_0.9.fa -c 0.9
Based on the filtered set of protein file, removed those filtred genes from the gff file.
# create naming table (there are additional options for naming beyond defaults)maker_map_ids --prefix PanGallus_Gene --justify 5 chi_pan.all.gff > rename.map# replace names in GFF filesmap_gff_ids rename.map chi_pan.all.gffgrep "maker" chi_pan.all.gff >chi.pan.gff# replace names in FASTA headersmap_fasta_ids rename.map chi_pan.all.maker.proteins_50aa_rd_0.9.fa