CIRIquant Comprehensive toolkit for circRNA quantification and differential expression analysis

Manual

Installation

NOTES: Only python2 is supported

Prerequisites

Softwares:

  • bwa

  • hisat2

  • stringtie

  • samtools >= 1.9 (older version of samtools may use deprecated parameters in sort and index commands)

Python packages:

  • PyYAML

  • argparse

  • pysam

  • numpy

  • scipy

  • scikit-learn

Install CIRIquant from source code

Please use the latest released version from GitHub or SourceForge

Use the setup.py for CIRIquant installation (clean install using virutalenv is highly recommended).

# create and activate virtual env
pip install virtualenv
virtualenv -p /path/to/your/python2/executable venv
source ./venv/bin/activate

# Install CIRIquant and its requirement automatically
tar zxvf CIRIquant.tar.gz
cd CIRIquant
python setup.py install

# Manual installation of required pacakges is also supported
pip install -r requirements.txt

The package should take approximately 40 seconds to install on a normal computer.

Install CIRIquant using pip

pip install CIRIquant

Usage 1: circRNA quantifcation

Basic options

Usage:
  CIRIquant [options] --config <config> -1 <m1> -2 <m2>

  <config>            Config file
  <m1>                Input mate1 reads (for paired-end data)
  <m2>                Input mate2 reads (for paired-end data)

Options (defaults in parentheses):

  -v                  Run in verbose mode
  -o, --out           Output directory (default: current directory)
  -e, --log           Specific log file (default: sample_prefix.log)
  -p, --prefix        Output sample prefix (default: input sample name)
  -t, --threads       Number of CPU threads to use (defualt: 4)
  -a, --anchor        Minimum anchor length for junction alignment (default: 5)
  -l, --library-type  Library type, 0: unstranded, 1: read1 match the sense strand, 2: read1 match the antisense strand (default: 0)

  --bed               User provided Back-Spliced Junction Site in BED format
  --circ              circRNA prediction results from other tools
  --tool              Tool name, required when --circ is specified ([CIRI2/CIRCexplorer2/DCC/KNIFE/MapSplice/UROBORUS/circRNA_finder/find_circ])

  --RNaseR            CIRIquant output file of RNase R data (required for RNase R correction)
  --bam               Specific hisat2 alignment bam file against reference genome
  --no-gene           Skip StringTie estimation of gene abundance

NOTE:

  • For now, –circ and –tool options support results from CIRI2 / CIRCexplorer2 / DCC / KNIFE / MapSplice / UROBORUS / circRNA_finder / find_circ

  • For tools like DCC and circRNA_finder, please manually remove duplicated circRNAs with same junction postion but have opposite strands.

  • Gene expression values are needed for normalization, do not use --no-gene if you need to run DE analysis afterwards.

Example config

A YAML-formated config file is needed for CIRIquant to find software and reference needed. A valid example of config file is demonstrated below.

// Example of config file
name: hg19
tools:
  bwa: /home/zhangjy/bin/bwa
  hisat2: /home/zhangjy/bin/hisat2
  stringtie: /home/zhangjy/bin/stringtie
  samtools: /home/zhangjy/bin/samtools

reference:
  fasta: /home/zhangjy/Data/database/hg19.fa
  gtf: /home/zhangjy/Data/database/gencode.v19.annotation.gtf
  bwa_index: /home/zhangjy/Data/database/hg19/_BWAtmp/hg19
  hisat_index: /home/zhangjy/Data/database/hg19/_HISATtmp/hg19
Key Description
name the name of config file
bwa the path of bwa
hisat2 the path of hisat2
stringtie the path of stringite
samtools the path of samtools, samtools version below 1.3.1 is not supported
fasta reference genome fasta, a fai index by samtools faidx is also needed under the same directory
gtf annotation file of reference genome in GTF/GFF3 format
bwa_index prefix of BWA index for reference genome
hisat_index prefix of HISAT2 index for reference genome

For quantification of user-provided circRNAs, a list of junction sites in bed format is required, the 4th column must be in “chrom:start|end” format. For example:

chr1    10000   10099   chr1:10000|10099    .   +
chr1    31000   31200   chr1:31000|31200    .   -

Example Usage

Quantify circRNAs using provided BED format input

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          -o ./test \
          -p test \
          --bed your_circRNAs.bed

Quantify circRNAs using results from other tools

For example, if you have find_circ results of predicted circRNAs.

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          -o ./test \
          -p test \
          --circ find_circ_results.txt \
          --tool find_circ

Output format

The main output of CIRIquant is a GTF file, that contains detailed information of BSJ and FSJ reads of circRNAs and annotation of circRNA back-spliced regions in the attribute columns

Description of each columns’s value

column name description
1 chrom chromosome / contig name
2 source CIRIquant
3 type circRNA
4 start 5' back-spliced junction site
5 end 3' back-spliced junction site
6 score CPM of circRNAs (#BSJ / #Mapped reads)
7 strand strand information
8 . .
9 attributes attributes seperated by semicolon

The attributes containing several pre-defined keys and values:

key description
circ_id name of circRNA
circ_type circRNA types: exon / intron / intergenic
bsj number of bsj reads
fsj number of fsj reads
junc_ratio circular to linear ratio: 2 * bsj / ( 2 * bsj + fsj)
rnaser_bsj number of bsj reads in RNase R data (only when --RNaseR is specificed)
rnaser_fsj number of fsj reads in RNase R data (only when --RNaseR is specificed)
gene_id ensemble id of host gene
gene_name HGNC symbol of host gene
gene_type type of host gene in gtf file

Usage 2: RNase R effect correction

When you have both RNase R treated and untreated samples, CIRIquant can estimate the before-treatment expression levels of circRNAs detected in RNase R data.

In order to remove RNase R treatment effect, two steps are needed:

  1. Run CIRIquant with RNase R treated sample.

  2. Run CIRIquant with untreaded total RNA sample, specific --RNaseR option using the output gtf file in Step1

Then, CIRIquant will output estimated expression levels of circRNAs detected in RNaseR data, and the header lines will include additional information of RNase R treatment effciency.

Example usage

# Step1. Run CIRIquant with RNase R treated data
CIRIquant --config ./hg19.yml \
          -1 ./RNaseR_treated_1.fq.gz \
          -2 ./RNaseR_treated_2.fq.gz \
          --no-gene \
          -o ./RNaseR_treated \
          -p RNaseR_treated \
          -t 6

# Step2. Run CIRIquant with untreated total RNA
CIRIquant --config ./hg19.yml \
          -1 ./TotalRNA_1.fq.gz \
          -2 ./TotalRNA_2.fq.gz \
          -o ./TotalRNA \
          -p TotalRNA \
          -t 6 \
          --RNaseR ./RNaseR_treated/RNaseR_treated.gtf

Usage 3: Differential expression analysis

Study without biological replicate

For sample without replicate, the differential expression & differential splicing analysis is performed using CIRI_DE

Usage:
  CIRI_DE [options] -n <control> -c <case> -o <out>

  <control>         CIRIquant result of control sample
  <case>            CIRIquant result of treatment cases
  <out>             Output file

Options (defaults in parentheses):

  -p                p value threshold for DE and DS score calculation (default: 0.05)
  -t                numer of threads (default: 4)

Example usage:
  CIRI_DE -n control.gtf -c case.gtf -o CIRI_DE.tsv

The output format CIRI_DE is in the format below:

column name description
1 circRNA_ID circRNA identifier
2 Case_BSJ number of BSJ reads in case
3 Case_FSJ number of FSJ reads in case
4 Case_Ratio junction ratio in case
5 Ctrl_BSJ number of BSJ reads in control
6 Ctrl_FSJ number of FSJ reads in control
7 Ctrl_Ratio junction ratio in control
8 DE_score differential expression score
9 DS_score differential splicing score

Study with biological replicates

For study with biological replicates, a customed analysis pipeline of edgeR is recommended and we provide prep_CIRIquant to generate matrix of circRNA expression level / junction ratio and CIRI_DE_replicate for DE analysis

Step1: Prepare CIRIquant output files

One should provide a text file listing sample information and path to CIRIquant output GTF files

CONTROL1 ./c1/c1.gtf C 1
CONTROL2 ./c2/c2.gtf C 2
CONTROL3 ./c3/c3.gtf C 3
CASE1 ./t1/t1.gtf T 1
CASE2 ./t2/t2.gtf T 2
CASE3 ./t3/t3.gtf T 3

The first three columns is required by default. For paired samples, you could also add a column of subject name.

column description
1 sample name
2 path to CIRIquant output gtf
3 group ("C" for control, "T" for treatment)
4 subject (optional, only for paired samples)

Note: If you are planning to use CIRI_DE for differential expression, then group name in column 3 must be either “C” or “T”.

Then, run prep_CIRIquant to summarize the circRNA expression profile in all samples

Usage:
  prep_CIRIquant [options]

  -i                the file of sample list
  --lib             where to output library information
  --circ            where to output circRNA annotation information
  --bsj             where to output the circRNA expression matrix
  --ratio           where to output the circRNA junction ratio matrix

Example:
  prep_CIRIquant -i sample.lst \
                 --lib library_info.csv \
                 --circ circRNA_info.csv \
                 --bsj circRNA_bsj.csv \
                 --ratio circRNA_ratio.csv

These count matrices (CSV files) can then be imported into R for use by DESeq2 and edgeR (using the DESeqDataSetFromMatrix and DGEList functions, respectively).

Step2: Prepare StringTie output

The output of StringTie should locate under output_dir/gene/prefix_out.gtf. You need to use prepDE.py from stringTie to generate the gene count matrix for normalization.

For example, one can provide a text file sample_gene.lst containing sample IDs and path to StringTie outputs:

CONTROL1 ./c1/gene/c1_out.gtf
CONTROL2 ./c2/gene/c2_out.gtf
CONTROL3 ./c3/gene/c3_out.gtf
CASE1 ./t1/gene/t1_out.gtf
CASE2 ./t2/gene/t2_out.gtf
CASE3 ./t3/gene/t3_out.gtf

Then, run prepDE.py -i sample_gene.lst and use gene_count_matrix.csv generated under current working directory for further analysis.

Step3: Differential expression analysis

For differential analysis using CIRI_DE_replicate, you need to install a R environment and edgeR package from Bioconductor.

Usage:
  CIRI_DE_replicate [options]

  --lib             library information by CIRIquant
  --bsj             circRNA expression matrix
  --gene            gene expression matrix
  --out             output differential expression result

Example:
  CIRI_DE_replicate --lib  library_info.csv \
            --bsj  circRNA_bsj.csv \
            --gene gene_count_matrix.csv \
            --out  circRNA_de.tsv

Please be noted that the output results is unfiltered, and you could apply a more stringent filter on expression values to get a more convincing result.

Test data

Download test dataset

Test dataset can be downloaded from Github.

wget https://github.com/Kevinzjy/CIRIquant/releases/download/v0.2.0/test_data.tar.gz
tar zxvf test_data.tar.gz

circRNA quantification

Folder quant contain the test dataset for circRNA quantification.

1. Generate hisat2 and bwa index

cd ./test_data/quant
bwa index -a bwtsw -p chr1.fa chr1.fa
hisat2-build ./chr1.fa ./chr1.fa

2. Customize the configuration

Replace the path of bwa/hisat2/stringtie/samtools in chr1.yml with your own version.

3. Run test dataset

Test data set can be retrived under test_data/quant folder, you can replace the path of required software in the chr1.yml with your own version

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          --no-gene \
          -o ./test \
          -p test

The demo dataset should take approximately 5 minutes on a personal computer. It has been tested on my PC with Intel i7-8700 processor and 16G of memory, running Ubuntu 18.04 LTS.

The structure of output directory ./test should be like this:

test
├── align
│   ├── test.bam
│   ├── test.sorted.bam
│   └── test.sorted.bam.bai
├── circ
│   ├── test.ciri
│   ├── test.ciri.bed
│   ├── test_denovo.bam
│   ├── test_denovo.sorted.bam
│   ├── test_denovo.sorted.bam.bai
│   ├── test_index.1.ht2
│   ├── test_index.2.ht2
│   ├── test_index.3.ht2
│   ├── test_index.4.ht2
│   ├── test_index.5.ht2
│   ├── test_index.6.ht2
│   ├── test_index.7.ht2
│   ├── test_index.8.ht2
│   ├── test_index.fa
│   └── test_unmapped.sam
├── CIRIerror.log
├── test.bed
├── test.gtf
└── test.log

Then, you can check the main output in ./test/test.gtf.

Differential expression analysis

Folder DE contain the test dataset for differential expression analysis

cd ./test_data/DE

# Test for DE-score and DS-score calculation
CIRI_DE -n ctrl.gtf \
        -c case.gtf \
        -o CIRI_DE.tsv

# Test for RNase R correction
CIRI_DE -n ctrl_corrected.gtf \
        -c case_corrected.gtf \
        -o CIRI_DE_corrected.tsv