- Data process: screening, preprocessing, mapping, quantification and saturation analysis.
All RNA-seq data of normal tissues or cell lines in humans, rats, pigs, and mice were downloaded from SRA. Low-quality reads were filtered using some data preprocessing steps by Perl scripts. RNA-seq reads were mapped to the reference genome of their corresponding species with Tophat v2.0.9. The reference genomes of humans, rats, pigs and mice are hg19 (UCSC), rn4 (UCSC), Sus scrofa10.2 (NCBI) and mm10 (UCSC), respectively. Gene/isoform assembly and quantification were performed using Cufflinks v2.1.1 with default parameters. The RPKM (reads per kilobase per million mapped reads) of genes/isoforms/exons were calculated. To obtain complete information, the data sources for each tissue or cell line were manually screened. For the tissue/cell line with multiple experiments, we selected the best one as representative experiment by considering sequencing quality, saturation degree and experiment design. However, the other data were also exhibited in MTD to serve as a supplement. As for the experiment with technical replicates, we ordered the replicates by sequencing quality and saturation degree, and then chose the best one as the major source for expression results description of this experiment and the others as supplements. The saturation analysis results showed that RNA-seq data from B-Cell, bronchial epithelial cells, endometrial stromal, epithelial, foreskin fibroblast, macrophage, and Tcell in humans, RNA-seq data from bone marrow, caudal epididymis, olfactory, olfactory epithelium, placenta, skin, uterus, vas deferens, MEF cells, sperm and Tcell in mice, and RNA-seq data from blood, heart, liver, muscle, testis and sperm in pigs, were unsaturated.
- What does the nomenclature of locus-specific transcripts mean?
Example: C10PT0163001-SPIP-LHXXX01 The nomenclature of locus-specific transcripts consists of three parts. Part 1, corresponding to positions 1-12, presents the information of a transcript. Position 1: C means coding transcripts, and N means noncoding transcripts. Positions 2-3 denote the located chromosome for this transcript. Specially, chr1 to chr9 are labelled as 01 to 09, and chrX or chrY are labelled as 0X or 0Y. Position 4 is the specific location of the transcript. P or Q indicates the p-arm or q-arm, and 0 means that the transcript resides in the centromere or telomere region. Position 5 is the transcriptional direction. T indicates that the direction is towards the telomere, C represents a direction towards the centromere, and 0 means that the transcript resides in the centromere region. Positions 6-9 convey that the transcript resides in certain bins (100 kb per bin) from the centromere to the telomere. Positions 10-12 indicate the bin numbers of the transcripts according to their location from the centromere to the telomere according to the first-come-first-name rule.
Parts 2 and 3 show the information of the gene in which the transcript resides. In position 13, S means the sense strand, and A means the antisense strand. If the transcript is located in an area overlapping with the protein-coding genes and they are in the same strand, position 13 is an S, and if they are in the opposite strands, position 13 is an A. If the transcript's location does not overlap with protein-coding genes, the nearest protein-coding gene location is determined to decide whether it is in the sense strand or antisense strand. Positions 14-16 indicate the first three characters of the gene symbol for the gene of the transcript. If the full gene symbol of a gene is less than three letters, the remaining positions are occupied by 0.
In position 17, L means a long transcript ( >500 bp), M means a medium transcript (100 bp to 500 bp), and S means a short transcript (20 bp to 100 bp). In position 18, H indicates highly expressed transcript (RPKM >100), M denotes a moderately expressed transcript (RPKM from 10 to 100), and L indicates low expression of the transcript (RPKM ≤10). In position 19, P means an alternative promoter, E represents an alternative exon, and A denotes an alternative poly(A). Positions 20-21 indicate the number of transcriptional variants.
- How to browse for information in the MTD?
Browsing for information is achieved by chromosome, by region or by pathway. Clicking on the chromosome of interest, whether human or mouse, on the 'browse by chromosome' page and dragging the mouse on the chromosome image will cause a rectangle to appear. Then, one can select the tissues/cell lines and choose the corresponding experiment of interest. If the selected region contains less than 800 items, the gene expression levels and read coverage images of the genes in the experiments of your selected tissues/cell lines will be shown. If your selected region contains more than 800 items and less than 1800 items, a summary table of all gene expression information will be shown. If your selected region contains more than 1800 items, the page will warn that the queried region contains too many items to load. For pigs and rats, we do not offer this function because of their poorly annotated and assembled genomes. On the 'browse by region' page, the user inputs a chromosomal region, such as chr2:2043960..2045540, and then chooses a data source to browse the structures and read coverages of the genes in the queried region. Additionally, on the 'browse by pathway' page, it is possible to browse gene expression levels based on their joint KEGG pathway in the experiments of selected tissues/cell lines, and each resulting table is sortable, which makes it easy to find genes with high/low expression levels in the pathway of interest.
- How to search information in the MTD?
We provide the ' search' and ' analysis' interfaces with search functions. On the 'search' interfaces, you can set different limitations for the genes or isoforms to obtain different types of results. On the 'analysis' interfaces, you can search for gene expression signatures within species or across species. On the 'intraspecies' interface, you can input a gene symbol or RefSeq ID of your queried gene and then choose the interested tissues/cell lines and the corresponding experiments so that the expression patterns can be compared across tissues/cell lines within the species. On the 'interspecies' interface, a gene symbol or the RefSeq ID of the gene of interest can be inputted. Then, clicking on the 'find homologue' button will produce a statistic bar graph, which describes the expression patterns of the orthologous genes for your target in the other three species. You can also click the 'show details' button to obtain further information. If you are confused about the correct gene symbols or RefSeq IDs to use in the search, you can determine them using the 'Gene Information' files in the 'Other data' section in the 'download' page. Notably, we excluded genes in unclear chromosomal regions in our database.
- How to download information in the MTD?
Use the download page or click 'text result' link below each table to export the corresponding result. In addition, read coverage plots and structure plots of genes/isoforms can be exported on the 'browse by region' page.
- How to cite the MTD?
Xin Sheng, Jiayan Wu, Qianqian Sun, Xue Li, Feng Xian, Manman Sun, Wan Fang, Meili Chen, Jun Yu, and Jingfa Xiao
MTD: a mammalian transcriptomic database to explore gene expression and regulation
Brief Bioinform 2016 : bbv117v1-bbv117.