This pipeline is used to analyze RNA sequencing data obtained from different biological contexts with a reference genome and annotation. It takes sample summary and runinfo files derived from NCBI SRAas inputs, performs quality control and alignment steps, and finally produces the gene/transcript reads count matrices.
1. SRAtoolkit (ver. 2.9.6.1)
2. Trimmomatic (ver. 0.39)
3. HISAT2 (ver. 2.2.1)
4. STAR
5. RSeQC (ver. 2.6.4)
6. FeatureCounts
7. Kallisto
8. StringTie (ver. 2.0.6)
The sra files are downloaded with the following command:
wget -c -t 0 download_url -P output_path
The sra files are converted into fastq files by fasterq-dump with the following command:
fasterq-dump -p -e 32 --split-3 sra_file -O output_dir
The runs from the same sample are merged with the following command:
For paired-end sequencing data
cat sample/*_1.fastq >sample/sample_1.fastqcat sample/*_2.fastq >sample/sample_2.fastq cat sample/*.fastq >sample/sample.fastq
Library preparation protocols for RNA-Seq can be stranded or unstranded. The alignment and quantification tools depend on the strandedness to improve the specificity of alignment and reads allocation, separately. Manually curating this information from GEO or SRA is time-consuming. Fortunately, the strandedness can be auto inferred by infer_experiment.py script of RSeQC package with the following command:
python2 infer_experiment.py -i sample_Alignment-unsorted.sort.bam -r hg38BED > infer_experiment_log
-i sample_Alignment-unsorted.sort.bam specifies the input BAM file.
Trimmomatic is used to filter out low-quality reads and sequencing adapters with the following command:
For paired-end sequencing data
java -jar Trimmomatic PE -threads threads -phred33 sample_1.fastq sample_2.fastq sample_paired_3.fastq sample_unpaired_3.fastq sample_paired_4.fastq sample_unpaired_4.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15java -jar Trimmomatic SE -threads threads -phred33 sample.fastq sample_paired.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15
Kallisto, a fast, accurate and alignment-free program, has been widely used for transcript quantification. Here, Kallisto is used to quantify transcripts with the following command:
For paired-end sequencing data
Kallisto quant -i kallisto_index -t threadsinferred_strandednesssample_paired_3.fastq sample_paired_4.fastq -o output_dirKallisto quant -i kallisto_index --single -l 150 -s 10 -t threadsinferred_strandednesssample_paired.fastq -o output_dirkallisto index -i kallisto_index lncbook.transcript.falncbook.transcript.fa contains all transcript sequences.inferred_strandednessis inferred from RSeQC.–singleindicates that the input sequencing data is single-end rather than paired-end. -l 150 specifies the average fragment length of the sequencing reads.-s 10 specifies the standard deviation of the fragment length. -t specifies the number of threads.
STAR is used to align reads to reference genome with the following command:
For paired-end sequencing data
STAR --runThreadN 4 --genomeDir star_reference --out FilterType BySJout --out FilterMultimapNmax 20 --out FilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --align SJDBoverhangMin 1 --sjdbScore 1 --out SAMtype BAM SortedByCoordinate --out BAMsortingThreadN 16 –readFilesIn sample_paired_3.fastq sample_paired_4.fastq --out FileNamePrefix sampleSTAR --runThreadN 4 --genomeDir star_reference --out FilterTypeBySJout --out FilterMultimapNmax 20 --out FilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --align SJDBoverhangMin 1 --sjdbScore 1 --out SAMtype BAM SortedByCoordinate --out BAMsortingThreadN 16 --readFilesInsample_paired.fastq --out FileNamePrefixsampleIn LncExpDB 1.0, gene expression was quantified using featureCounts.
featureCounts is used to quantify the reads on each gene with the following command:
For paired-end sequencing data
featureCounts -T threads -s inferred_strandedness -p -t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bamfeatureCounts -T threads -s inferred_strandedness-t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bam -TIn LncExpDB 2.0, gene expression was quantified using RSEM for greater accuracy.
RSEM is used to quantify gene and isoform expression from RNA-seq data with the following command:
For paired-end sequencing data
frsem-calculate-expression --paired-end --bam --estimate-rspd --strandedness none -p 4 sampleAligned.sortedByCoord.out.bam rsem_reference/sample sample_RSEMrsem-calculate-expression --single-cell-stranded --bam --estimate-rspd --strandedness none -p 4 sampleAligned.sortedByCoord.out.bam rsem_reference/sample sample_RSEM
The pipeline can be performed with the following command:
python RNA-seq.py -a sra_result.csv -b sra_runinfo.csv -o output_path -t threads
The structure of the output folder is as follows:
Normalization of RNA-Seq data has been proven essential to ensure accurate inferences and replication of biological findings. The major goal of normalization is to remove biology-unrelevant variances introducing by sequencing platform, library protocol, sequencing depth, gene/transcript length, or GC content. While comparing the expression difference from different datasets, it is necessary to remove the variances introducing by sequenicng platform and library protocol. Here, we only compare the expression difference between samples from the same dataset, ensuring the consistence of sequencing platform and library protocol. TMM (trimmed mean of M values), an effective between-sample normalization method to remove variances under the assumption that the most genes show little variation across different samples [A scaling normalization method for differential expression analysis of RNA-seq data], has been proven effective in identifying differentially expressed genes [Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions]. We perform gene/transcript abundance normalization with the calc NormFactors function in edgeR package. Then convert normalized abundance matrices into CPM, FPKM and TPM matrices.