This pipeline is used to analyze RNA sequencing data obtained from different biological contexts with a reference genome and annotation. It takes sample summary and runinfo files derived from NCBI SRA as inputs, performs quality control and alignment steps, and finally produces the gene/transcript reads count matrices.
1. SRAtoolkit (ver. 2.9.6.1)
2. Trimmomatic (ver. 0.39)
3. HISAT2 (ver. 2.2.1)
4. STAR
5. RSeQC (ver. 2.6.4)
6. FeatureCounts
7. Kallisto
8. StringTie (ver. 2.0.6)
The sra files are downloaded with the following command:
wget -c -t 0 download_url -P output_path
The sra files are converted into fastq files by fasterq-dump with the following command:
fasterq-dump -p -e 32 --split-3 sra_file -O output_dir
The runs from the same sample are merged with the following command:
For paired-end sequencing data
cat sample/*_1.fastq >sample/sample_1.fastq
cat sample/*_2.fastq >sample/sample_2.fastq
cat sample/*.fastq >sample/sample.fastq
Library preparation protocols for RNA-Seq can be stranded or unstranded. The alignment and quantification tools depend on the strandedness to improve the specificity of alignment and reads allocation, separately. Manually curating this information from GEO or SRA is time-consuming. Fortunately, the strandedness can be auto inferred by infer_experiment.py script of RSeQC package with the following command:
python2 infer_experiment.py -i sample_Alignment-unsorted.sort.bam -r hg38BED > infer_experiment_log
-i sample_Alignment-unsorted.sort.bam specifies the input BAM file.
Trimmomatic is used to filter out low-quality reads and sequencing adapters with the following command:
For paired-end sequencing data
java -jar Trimmomatic PE -threads threads -phred33 sample_1.fastq sample_2.fastq sample_paired_3.fastq sample_unpaired_3.fastq sample_paired_4.fastq sample_unpaired_4.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15
java -jar Trimmomatic SE -threads threads -phred33 sample.fastq sample_paired.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15
Kallisto, a fast, accurate and alignment-free program, has been widely used for transcript quantification. Here, Kallisto is used to quantify transcripts with the following command:
For paired-end sequencing data
Kallisto quant -i kallisto_index -t threadsinferred_strandednesssample_paired_3.fastq sample_paired_4.fastq -o output_dir
Kallisto quant -i kallisto_index --single -l 150 -s 10 -t threads inferred_strandednesssample_paired.fastq -o output_dir
kallisto index -i kallisto_index lncbook.transcript.fa
STAR is used to align reads to reference genome with the following command:
For paired-end sequencing data
STAR --runThreadN 4 --genomeDir star_reference --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 16 –readFilesIn sample_paired_3.fastq sample_paired_4.fastq --outFileNamePrefix sample
STAR --runThreadN 4 --genomeDirstar_reference --outFilterTypeBySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 16 --readFilesInsample_paired.fastq --outFileNamePrefixsample
featureCounts is used to quantify the reads on each gene with the following command:
For paired-end sequencing data
featureCounts -T threads -s inferred_strandedness -p -t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bam
featureCounts -T threads -s inferred_strandedness-t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bam -T
The pipeline can be performed with the following command:
python RNA-seq.py -a sra_result.csv -b sra_runinfo.csv -o output_path -t threads
The structure of the output folder is as follows:
Normalization of RNA-Seq data has been proven essential to ensure accurate inferences and replication of biological findings. The major goal of normalization is to remove biology-unrelevant variances introducing by sequencing platform, library protocol, sequencing depth, gene/transcript length, or GC content. While comparing the expression difference from different datasets, it is necessary to remove the variances introducing by sequenicng platform and library protocol. Here, we only compare the expression difference between samples from the same dataset, ensuring the consistence of sequencing platform and library protocol. TMM (trimmed mean of M values), an effective between-sample normalization method to remove variances under the assumption that the most genes show little variation across different samples [A scaling normalization method for differential expression analysis of RNA-seq data], has been proven effective in identifying differentially expressed genes[Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions]. We perform gene/transcript abundance normalization with the calcNormFactorsfunction in edgeR package. Then convert normalized abundance matrices into CPM, FPKM and TPM matrices.