CIRIquant Comprehensive toolkit for circRNA quantification and differential expression analysis
Manual
Installation
NOTES: Only python2 is supported
Prerequisites
Softwares:
-
bwa
-
hisat2
-
stringtie
-
samtools >= 1.9 (older version of samtools may use deprecated parameters in
sort
andindex
commands)
Python packages:
-
PyYAML
-
argparse
-
pysam
-
numpy
-
scipy
-
scikit-learn
Install CIRIquant from source code
Please use the latest released version from GitHub or SourceForge
Use the setup.py for CIRIquant installation (clean install using virutalenv
is highly recommended).
# create and activate virtual env
pip install virtualenv
virtualenv -p /path/to/your/python2/executable venv
source ./venv/bin/activate
# Install CIRIquant and its requirement automatically
tar zxvf CIRIquant.tar.gz
cd CIRIquant
python setup.py install
# Manual installation of required pacakges is also supported
pip install -r requirements.txt
The package should take approximately 40 seconds to install on a normal computer.
Install CIRIquant using pip
pip install CIRIquant
Usage 1: circRNA quantifcation
Basic options
Usage:
CIRIquant [options] --config <config> -1 <m1> -2 <m2>
<config> Config file
<m1> Input mate1 reads (for paired-end data)
<m2> Input mate2 reads (for paired-end data)
Options (defaults in parentheses):
-v Run in verbose mode
-o, --out Output directory (default: current directory)
-e, --log Specific log file (default: sample_prefix.log)
-p, --prefix Output sample prefix (default: input sample name)
-t, --threads Number of CPU threads to use (defualt: 4)
-a, --anchor Minimum anchor length for junction alignment (default: 5)
-l, --library-type Library type, 0: unstranded, 1: read1 match the sense strand, 2: read1 match the antisense strand (default: 0)
--bed User provided Back-Spliced Junction Site in BED format
--circ circRNA prediction results from other tools
--tool Tool name, required when --circ is specified ([CIRI2/CIRCexplorer2/DCC/KNIFE/MapSplice/UROBORUS/circRNA_finder/find_circ])
--RNaseR CIRIquant output file of RNase R data (required for RNase R correction)
--bam Specific hisat2 alignment bam file against reference genome
--no-gene Skip StringTie estimation of gene abundance
NOTE:
-
For now, –circ and –tool options support results from
CIRI2
/CIRCexplorer2
/DCC
/KNIFE
/MapSplice
/UROBORUS
/circRNA_finder
/find_circ
-
For tools like
DCC
andcircRNA_finder
, please manually remove duplicated circRNAs with same junction postion but have opposite strands. -
Gene expression values are needed for normalization, do not use
--no-gene
if you need to run DE analysis afterwards.
Example config
A YAML-formated config file is needed for CIRIquant to find software and reference needed. A valid example of config file is demonstrated below.
// Example of config file
name: hg19
tools:
bwa: /home/zhangjy/bin/bwa
hisat2: /home/zhangjy/bin/hisat2
stringtie: /home/zhangjy/bin/stringtie
samtools: /home/zhangjy/bin/samtools
reference:
fasta: /home/zhangjy/Data/database/hg19.fa
gtf: /home/zhangjy/Data/database/gencode.v19.annotation.gtf
bwa_index: /home/zhangjy/Data/database/hg19/_BWAtmp/hg19
hisat_index: /home/zhangjy/Data/database/hg19/_HISATtmp/hg19
Key | Description |
---|---|
name | the name of config file |
bwa | the path of bwa |
hisat2 | the path of hisat2 |
stringtie | the path of stringite |
samtools | the path of samtools , samtools version below 1.3.1 is not supported |
fasta | reference genome fasta, a fai index by samtools faidx is also needed under the same directory |
gtf | annotation file of reference genome in GTF/GFF3 format |
bwa_index | prefix of BWA index for reference genome |
hisat_index | prefix of HISAT2 index for reference genome |
For quantification of user-provided circRNAs, a list of junction sites in bed format is required, the 4th column must be in “chrom:start|end” format. For example:
chr1 10000 10099 chr1:10000|10099 . +
chr1 31000 31200 chr1:31000|31200 . -
Example Usage
Recommended: Predict circRNAs using CIRI2 (packaged in CIRIquant)
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test
Quantify circRNAs using provided BED format input
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test \
--bed your_circRNAs.bed
Quantify circRNAs using results from other tools
For example, if you have find_circ
results of predicted circRNAs.
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test \
--circ find_circ_results.txt \
--tool find_circ
Output format
The main output of CIRIquant is a GTF file, that contains detailed information of BSJ and FSJ reads of circRNAs and annotation of circRNA back-spliced regions in the attribute columns
Description of each columns’s value
column | name | description |
---|---|---|
1 | chrom | chromosome / contig name |
2 | source | CIRIquant |
3 | type | circRNA |
4 | start | 5' back-spliced junction site |
5 | end | 3' back-spliced junction site |
6 | score | CPM of circRNAs (#BSJ / #Mapped reads) |
7 | strand | strand information |
8 | . | . |
9 | attributes | attributes seperated by semicolon |
The attributes containing several pre-defined keys and values:
key | description |
---|---|
circ_id | name of circRNA |
circ_type | circRNA types: exon / intron / intergenic |
bsj | number of bsj reads |
fsj | number of fsj reads |
junc_ratio | circular to linear ratio: 2 * bsj / ( 2 * bsj + fsj) |
rnaser_bsj | number of bsj reads in RNase R data (only when --RNaseR is specificed) |
rnaser_fsj | number of fsj reads in RNase R data (only when --RNaseR is specificed) |
gene_id | ensemble id of host gene |
gene_name | HGNC symbol of host gene |
gene_type | type of host gene in gtf file |
Usage 2: RNase R effect correction
When you have both RNase R treated and untreated samples, CIRIquant can estimate the before-treatment expression levels of circRNAs detected in RNase R data.
In order to remove RNase R treatment effect, two steps are needed:
-
Run CIRIquant with RNase R treated sample.
-
Run CIRIquant with untreaded total RNA sample, specific
--RNaseR
option using the output gtf file in Step1
Then, CIRIquant will output estimated expression levels of circRNAs detected in RNaseR data, and the header lines will include additional information of RNase R treatment effciency.
Example usage
# Step1. Run CIRIquant with RNase R treated data
CIRIquant --config ./hg19.yml \
-1 ./RNaseR_treated_1.fq.gz \
-2 ./RNaseR_treated_2.fq.gz \
--no-gene \
-o ./RNaseR_treated \
-p RNaseR_treated \
-t 6
# Step2. Run CIRIquant with untreated total RNA
CIRIquant --config ./hg19.yml \
-1 ./TotalRNA_1.fq.gz \
-2 ./TotalRNA_2.fq.gz \
-o ./TotalRNA \
-p TotalRNA \
-t 6 \
--RNaseR ./RNaseR_treated/RNaseR_treated.gtf
Usage 3: Differential expression analysis
Study without biological replicate
For sample without replicate, the differential expression & differential splicing analysis is performed using CIRI_DE
Usage:
CIRI_DE [options] -n <control> -c <case> -o <out>
<control> CIRIquant result of control sample
<case> CIRIquant result of treatment cases
<out> Output file
Options (defaults in parentheses):
-p p value threshold for DE and DS score calculation (default: 0.05)
-t numer of threads (default: 4)
Example usage:
CIRI_DE -n control.gtf -c case.gtf -o CIRI_DE.tsv
The output format CIRI_DE
is in the format below:
column | name | description |
---|---|---|
1 | circRNA_ID | circRNA identifier |
2 | Case_BSJ | number of BSJ reads in case |
3 | Case_FSJ | number of FSJ reads in case |
4 | Case_Ratio | junction ratio in case |
5 | Ctrl_BSJ | number of BSJ reads in control |
6 | Ctrl_FSJ | number of FSJ reads in control |
7 | Ctrl_Ratio | junction ratio in control |
8 | DE_score | differential expression score |
9 | DS_score | differential splicing score |
Study with biological replicates
For study with biological replicates, a customed analysis pipeline of edgeR is recommended and we provide prep_CIRIquant
to generate matrix of circRNA expression level / junction ratio and CIRI_DE_replicate
for DE analysis
Step1: Prepare CIRIquant output files
One should provide a text file listing sample information and path to CIRIquant output GTF files
CONTROL1 ./c1/c1.gtf C 1
CONTROL2 ./c2/c2.gtf C 2
CONTROL3 ./c3/c3.gtf C 3
CASE1 ./t1/t1.gtf T 1
CASE2 ./t2/t2.gtf T 2
CASE3 ./t3/t3.gtf T 3
The first three columns is required by default. For paired samples, you could also add a column of subject name.
column | description |
---|---|
1 | sample name |
2 | path to CIRIquant output gtf |
3 | group ("C" for control, "T" for treatment) |
4 | subject (optional, only for paired samples) |
Note: If you are planning to use CIRI_DE for differential expression, then group name in column 3 must be either “C” or “T”.
Then, run prep_CIRIquant
to summarize the circRNA expression profile in all samples
Usage:
prep_CIRIquant [options]
-i the file of sample list
--lib where to output library information
--circ where to output circRNA annotation information
--bsj where to output the circRNA expression matrix
--ratio where to output the circRNA junction ratio matrix
Example:
prep_CIRIquant -i sample.lst \
--lib library_info.csv \
--circ circRNA_info.csv \
--bsj circRNA_bsj.csv \
--ratio circRNA_ratio.csv
These count matrices (CSV files) can then be imported into R for use by DESeq2 and edgeR (using the DESeqDataSetFromMatrix and DGEList functions, respectively).
Step2: Prepare StringTie output
The output of StringTie should locate under output_dir/gene/prefix_out.gtf
. You need to use prepDE.py from stringTie to generate the gene count matrix for normalization.
For example, one can provide a text file sample_gene.lst
containing sample IDs and path to StringTie outputs:
CONTROL1 ./c1/gene/c1_out.gtf
CONTROL2 ./c2/gene/c2_out.gtf
CONTROL3 ./c3/gene/c3_out.gtf
CASE1 ./t1/gene/t1_out.gtf
CASE2 ./t2/gene/t2_out.gtf
CASE3 ./t3/gene/t3_out.gtf
Then, run prepDE.py -i sample_gene.lst
and use gene_count_matrix.csv
generated under current working directory for further analysis.
Step3: Differential expression analysis
For differential analysis using CIRI_DE_replicate
, you need to install a R environment and edgeR
package from Bioconductor.
Usage:
CIRI_DE_replicate [options]
--lib library information by CIRIquant
--bsj circRNA expression matrix
--gene gene expression matrix
--out output differential expression result
Example:
CIRI_DE_replicate --lib library_info.csv \
--bsj circRNA_bsj.csv \
--gene gene_count_matrix.csv \
--out circRNA_de.tsv
Please be noted that the output results is unfiltered, and you could apply a more stringent filter on expression values to get a more convincing result.
Test data
Download test dataset
Test dataset can be downloaded from Github.
wget https://github.com/Kevinzjy/CIRIquant/releases/download/v0.2.0/test_data.tar.gz
tar zxvf test_data.tar.gz
circRNA quantification
Folder quant
contain the test dataset for circRNA quantification.
1. Generate hisat2 and bwa index
cd ./test_data/quant
bwa index -a bwtsw -p chr1.fa chr1.fa
hisat2-build ./chr1.fa ./chr1.fa
2. Customize the configuration
Replace the path of bwa
/hisat2
/stringtie
/samtools
in chr1.yml
with your own version.
3. Run test dataset
Test data set can be retrived under test_data/quant
folder, you can replace the path of required software in the chr1.yml
with your own version
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
--no-gene \
-o ./test \
-p test
The demo dataset should take approximately 5 minutes on a personal computer. It has been tested on my PC with Intel i7-8700 processor and 16G of memory, running Ubuntu 18.04 LTS.
The structure of output directory ./test
should be like this:
test
├── align
│ ├── test.bam
│ ├── test.sorted.bam
│ └── test.sorted.bam.bai
├── circ
│ ├── test.ciri
│ ├── test.ciri.bed
│ ├── test_denovo.bam
│ ├── test_denovo.sorted.bam
│ ├── test_denovo.sorted.bam.bai
│ ├── test_index.1.ht2
│ ├── test_index.2.ht2
│ ├── test_index.3.ht2
│ ├── test_index.4.ht2
│ ├── test_index.5.ht2
│ ├── test_index.6.ht2
│ ├── test_index.7.ht2
│ ├── test_index.8.ht2
│ ├── test_index.fa
│ └── test_unmapped.sam
├── CIRIerror.log
├── test.bed
├── test.gtf
└── test.log
Then, you can check the main output in ./test/test.gtf
.
Differential expression analysis
Folder DE
contain the test dataset for differential expression analysis
cd ./test_data/DE
# Test for DE-score and DS-score calculation
CIRI_DE -n ctrl.gtf \
-c case.gtf \
-o CIRI_DE.tsv
# Test for RNase R correction
CIRI_DE -n ctrl_corrected.gtf \
-c case_corrected.gtf \
-o CIRI_DE_corrected.tsv