LncExpDB

Expression Database of Human Long non-coding RNAs

Introduction

1 RNA-seq data processing pipeline

This pipeline is used to analyze RNA sequencing data obtained from different biological contexts with a reference genome and annotation. It takes sample summary and runinfo files derived from NCBI SRA as inputs, performs quality control and alignment steps, and finally produces the gene/transcript reads count matrices.

Environment requirement

1. SRAtoolkit (ver. 2.9.6.1)
2. Trimmomatic (ver. 0.39)
3. HISAT2 (ver. 2.2.1)
4. STAR
5. RSeQC (ver. 2.6.4)
6. FeatureCounts
7. Kallisto
8. StringTie (ver. 2.0.6)

2.1 Download sra files

The sra files are downloaded with the following command:

wget -c -t 0 download_url -P output_path

The -c parameter enables resume capability. -t 0 sets the retry attempts to infinite. -P specifies that the downloaded file will be saved in the output_path directory.

2.2 Convert sra files into fastq files

The sra files are converted into fastq files by fasterq-dump with the following command:

fasterq-dump -p -e 32 --split-3 sra_file -O output_dir

-p enables multi-threading to speed up the data extraction process. -e 32 specifies the use of 32 threads for processing the data. --split-3 splits the sra file into R1 and R2 fastq files for paired-end sequencing data or single fastq file for single-end sequencing data. -O specifies the output directory where the extracted fastq files will be saved.

2.3 Merge multiple fastq files

The runs from the same sample are merged with the following command:
For paired-end sequencing data

cat sample/*_1.fastq >sample/sample_1.fastq

cat sample/*_2.fastq >sample/sample_2.fastq

For single-end sequencing data
 cat sample/*.fastq >sample/sample.fastq

2.4 Infer library strandedness from fastq files

Library preparation protocols for RNA-Seq can be stranded or unstranded. The alignment and quantification tools depend on the strandedness to improve the specificity of alignment and reads allocation, separately. Manually curating this information from GEO or SRA is time-consuming. Fortunately, the strandedness can be auto inferred by infer_experiment.py script of RSeQC package with the following command:

python2 infer_experiment.py -i sample_Alignment-unsorted.sort.bam -r hg38BED > infer_experiment_log 
-i sample_Alignment-unsorted.sort.bam specifies the input BAM file.
Here, this file is generated from randomly selecting a sample to align by HISAT2. -r provides a BED file containing genomic regions of all genes. infer_experiment_log contains the fraction of reads mapped to forward strand and reverse strand. This fraction is used to infer the strandedness.

2.5 Filter low-quality reads and sequencing adapters with Trimmomatic

Trimmomatic is used to filter out low-quality reads and sequencing adapters with the following command:
For paired-end sequencing data

java -jar Trimmomatic PE -threads threads -phred33 sample_1.fastq sample_2.fastq sample_paired_3.fastq sample_unpaired_3.fastq sample_paired_4.fastq sample_unpaired_4.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15

For single-end sequencing data
java -jar Trimmomatic SE -threads threads -phred33 sample.fastq sample_paired.fastq ILLUMINACLIP:%s:2:30:10:2:TRUE HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15

SE and PE indicate that the input is single-end and paired-end sequencing data, separately. -threads specifies the number of threads. -phred33 indicates that the quality scores in the fastq files are encoded using the Phred+33 format. sample_1.fastq and sample_2.fastq represent the forward and reverse reads of paired-end sequencing data. sample_paired_3.fastq and sample_paired_4.fastq are the paired reads that pass the filtering. sample_unpaired_3.fastq and sample_unpaired_4.fastq record the unpaired reads after filtering. ILLUMINACLIP cuts adapter and other illumina-specific sequences from the read. HEADCROP:3 indicates that the first 3 bases from the start of the read will be removed. LEADING:3 will remove low-quality bases from the start of the read, with a quality threshold of 3. Meanwhile, TRAILING:3 removes low-quality bases from the end of each sequencewith a quality threshold of 3. SLIDINGWINDOW:4:15 uses a sliding window approach to trim the sequence based on quality, with a window size of 4 bases. If the average quality within the window falls below 15, the sequence is trimmed from the end of that window. MINLEN:15 represents thatreads shorter than 15 bases after trimming will be discards.

Kallisto, a fast, accurate and alignment-free program, has been widely used for transcript quantification. Here, Kallisto is used to quantify transcripts with the following command:
For paired-end sequencing data

Kallisto quant -i kallisto_index -t threadsinferred_strandednesssample_paired_3.fastq sample_paired_4.fastq -o output_dir

For single-end sequencing data
Kallisto quant -i kallisto_index --single -l 150 -s 10 -t threads inferred_strandednesssample_paired.fastq -o output_dir

-i specifies the index file to be used by Kallisto, where kallisto_index is the pre-built transcriptome index. It can be constructed with the following command:
kallisto index -i kallisto_index lncbook.transcript.fa

lncbook.transcript.fa contains all transcript sequences.
inferred_strandedness is inferred from RSeQC. –single indicates that the input sequencing data is single-end rather than paired-end. -l 150 specifies the average fragment length of the sequencing reads.-s 10 specifies the standard deviation of the fragment length. -t specifies the number of threads.

2.7 Align reads to genome with STAR

STAR is used to align reads to reference genome with the following command:
For paired-end sequencing data

STAR --runThreadN 4 --genomeDir star_reference --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 16 –readFilesIn sample_paired_3.fastq sample_paired_4.fastq --outFileNamePrefix sample

For single-end sequencing data
STAR --runThreadN 4 --genomeDirstar_reference --outFilterTypeBySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 16 --readFilesInsample_paired.fastq --outFileNamePrefixsample

This parameter combination was used by ENCODE project to quantify abundance of deep-sequencing samples. Specifically, --runThreadN specifies the number of threads. --genomeDir specifies the directory containing the STAR genome index files. --outFilterTypeBySJout specifies the default filtering mode where alignments are filtered based on splice junctions. --outFilterMultimapNmax 20 allows up to 20 multiple mappings. --outFilterMismatch Nmax indicates the max allowed mismatches in the alignment. --outFilterMismatchNoverLmax0.04 limits the maximum number of mismatches to 4% of the read length. --alignIntronMin specifies the minimum intron length to be considered for alignment. --alignIntronMax specifies the maximum intron length to be considered for alignment. --alignMatesGapMax specifies the maximum gap size between paired-end reads. --alignSJoverhangMin specifies the minimum overhang length of the splice junctions to be considered for alignment. --sjdbScore sets the score for each splice junction. A value of 1 is the minimum score threshold for splice junctions, used to filter out low-quality splice junctions and enhance the identification of high-quality ones. --outSAMtype BAM SortedByCoordinate specifies the output file format and sorting. BAM indicates that the output will be in BAM format, and SortedByCoordinate means that the alignments will be sorted by coordinate in the output file. --outBAMsortingThreadN specifies the number of threads to be used for sorting the BAM file. –readFilesIn specifies the input fastq files. –outFileNamePrefix sets the prefix for the output files.

2.8 Gene quantification with featureCounts

featureCounts is used to quantify the reads on each gene with the following command:
For paired-end sequencing data

featureCounts -T threads -s inferred_strandedness -p -t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bam

For single-end sequencing data
featureCounts -T threads -s inferred_strandedness-t exon -g gene_id -a lncbook_gtf -o sample_gene_quant.txt sample.Aligned.sortedByCoord.out.bam -T

specifies the number of threads. -sdefines the strand specificity of the reads. -p indicates that the input data consists of paired-end reads. -t exon specifies that the counting should be performed on exon features. -g gene_id specifies that the counts should be summarized by gene_id, using the gene identifier field from the annotation file.-aprovides the annotation file. -osets the output file name.

Database Usage

The pipeline can be performed with the following command:

python RNA-seq.py -a sra_result.csv -b sra_runinfo.csv -o output_path -t threads

The sra_result.csvand sra_runinfo.csvcan be obtained from NCBI SRA. The sra_result.csvcontains the basic sample information such as the accession, organism name, sequencing instrument, library strategy,file size and etc. The example that includes the complete columns is shown below:
The sra_runinfo.csvcontains the basic information for each sequencing run. The basic information contains the total base number, average read length, file size, download link and etc. The example that includes the complete columns is shown below:
The output_path specifies the directory where all output files will be stored. The -t parameterspecifies the number of threads used for Trimmomatic, Kallisto and featureCounts.

Output

The structure of the output folder is as follows:

SRP219013
SRX6748536
  • abundance.tsv
  • SRX6748536_gene_quant.txt
  • logs
    • download.log
    • fasterq.log
    • FeatureCounts_Log.txt
    • Kallisto_Log.txt
    • QC_Rate_Log.txt
    • STAR_Log.txt
The projectID of each dataset isusedas the name of the output directory. For each sample under this project, the read count matrices produced by featureCounts(sample_gene_quant.txt) and Kallisto (abundance.tsv) are stored in subfolders named with the sample IDs. The output logs are stored in the corresponding logs folder.

Abundance normalization pipeline

Normalization of RNA-Seq data has been proven essential to ensure accurate inferences and replication of biological findings. The major goal of normalization is to remove biology-unrelevant variances introducing by sequencing platform, library protocol, sequencing depth, gene/transcript length, or GC content. While comparing the expression difference from different datasets, it is necessary to remove the variances introducing by sequenicng platform and library protocol. Here, we only compare the expression difference between samples from the same dataset, ensuring the consistence of sequencing platform and library protocol. TMM (trimmed mean of M values), an effective between-sample normalization method to remove variances under the assumption that the most genes show little variation across different samples [A scaling normalization method for differential expression analysis of RNA-seq data], has been proven effective in identifying differentially expressed genes[Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions]. We perform gene/transcript abundance normalization with the calcNormFactorsfunction in edgeR package. Then convert normalized abundance matrices into CPM, FPKM and TPM matrices.