LncRNAs (Long non-coding RNAs) are closely associated with human health and diseases. While the human genome transcribes hundreds of thousands of lncRNAs, only a small part of them have been experimentally studied. The comprehensive annotation of human lncRNAs is of great significance in navigating the functional landscape of the human genome and deepening the understanding of the multi-featured RNA world.
To facilitate the discovery of lncRNAs’ biological functions, we developed LncBook, which is devoted to the integration and multi-omics annotation of human lncRNAs. It provides a comprehensive and high-quality list of human lncRNAs, enriches these lncRNAs with essential multi-omics signatures, and identifies featured lncRNAs in diseases and diverse biological contexts.
The first version was published in the NAR 2019 Database Issue. Over the past several years, we have significantly updated, expanded and enriched LncBook. The updated version of LncBook 2.0 integrates more human lncRNAs, characterizes diverse molecular signatures of these lncRNAs with more abundant data and stringent criteria, and identifies a list of high-confidence lncRNAs that are most likely related with human health and diseases.
Compared with the first version, LncBook 2.0 has significant changes and improvements as follows:
Based on LncBook v1 (4 resources were integrated), LncBook 2.0 integrated lncRNAs from another 5 resources, incuding RefLnc, GENCODE v33, CHESS v2.2, FANTOM-CAT (lv4_strigent) and BIGTranscriptome.
To obtain a high-confidence lncRNA dataset, a set of strict criteria was adopted by considering redundancy, mapping error, pre-mRNA, small RNA fragment, miRNA precursor, polymerase run-on, incomplete transcript, length, boundary, strand and coding potential.
LncRNA transcripts are assigned into the same gene if they share exonic sequences in the same strand.
Based on their genomic locations in respect to protein-coding genes, we classified lncRNAs into seven groups, Intergenic, Intronic (S), Intronic (AS), Overlapping (S), Overlapping (AS), Sense, and Antisense. "S" in the bracket represents that lncRNAs are in the same strand of protein-coding RNAs, a nd "AS" represents that lncRNAs are in the antisense strand of protein-coding RNAs.
We characterized conservation features of human lncRNA genes across 40 evolutionarily related animals based on the UCSC genome alignment results (LiftOver files). We analyzed the alignment segments between human and 40 species (as far as zebrafish) to obtain the evolutionary information of human lncRNAs, determine their ages, and identify their homolog protein-coding/non-coding gene.
We measured sequence conservation primarily by the sequence similarity of the alignments. To exclude the influence of genetic relationship and the alignment length, we used the alignment of lncRNA introns as control, and assessed conservation levels by fitting quantiles.
Q50 represents medium level of conservation; similarity of the alignment region is higher than the median level of lncRNA introns. Q99 means high level of conservation; similarity of the alignment region is higher than 99% of the aligned segments of lncRNA introns (such as TUG1, MALAT1).
We used the following criteria to identify whether a lncRNA has homologous sequence/gene in an animal:
The age of a lncRNA gene represents the earliest occurrence time of the gene sequence. There are 17 time nodes, including "Homo" (human specific), "Hominini", "Homininae", "Hominidae", "Hominoidea", "Catarrhini", "Simiiformes", "Haplorrhini", "Primates", "Euarchontoglires", "Boreoeutheria", "Eutheria", "Theria", "Mammalia", "Amniota", "Tetrapoda" and "Euteleostomi".
Note that age determination is based on the parsimony rule, therefore it does not rule out that a gene is conserved in species that we haven’t included, thus making its true age older than our determination.
LncBook collected variants from COSMIC, ClinVar and GWAS Catalog, identified disease/trait-associated variants, annotated corresponding disease/trait information and mapped them to the lncRNA loci.
LncBook curated high-quality variants from COSMIC, ClinVar and GWAS Catalog.
Disease-associated variants were derived from COSMIC and ClinVar and trait-associated variants were derived from GWAS Catalog. For COSMIC, we defined variants with a FATHMM-MKL score > 0.7 as disease-associated (pathogenic) variants. Meanwhile, variants in ClinVar tagged as "Pathogenic", "Affects" or "Risk factor" were considered as disease-associated variants.
To unify disease names and traits, we mapped ClinVar disease names and GWAS traits to Human Phenotype Ontology and Experimental Factor Ontology respectively.
Variants were allocated to lncRNAs by Bedtools intersect function.
LncBook collected 16 publicly accessible bisulfite-seq datasets from TCGA and GEO, covering 16 diseases (14 cancers and 2 neurodevelopmental disorders) with both case and control samples. Disease-associated lncRNA genes were identified based on DNA methylation level differentiation. The following table details the datasets used.
Biological Context | Source | Project ID | Disease Name (Short Name) | Sample Number |
---|---|---|---|---|
Neurodevelopmental Disorder | GEO | GSE119980 | Rett syndrome (RTT) | 12 (6 cases, 6 controls) |
Neurodevelopmental Disorder | GEO | GSE109875 | Autism spectrum disorders | 16 (10 cases, 6 controls |
Cancer | GEO | GSE116229 | Acute Lymphoblastic Leukemia (ALL) | 38 (31 cases, 7 controls) |
Cancer | GEO | GSE135869 | Acute Myeloid Leukemia (AML) | 15 (9 cases, 6 controls) |
Cancer | GEO | GSE113336 | Chronic Lymphocytic Leukemia (CLL) | 18 (11 cases, 7 controls) |
Cancer | GEO | GSE149608 | Esophageal Squamous Cell Carcinoma (ESCC) | 19 (10 cases, 9 controls) |
Cancer | GEO | GSE142241 | Medulloblastoma (MB) | 12 (8 cases, 4 controls) |
Cancer | GEO | GSE79799 | Liver cancer | 6 (3 cases, 3 controls) |
Cancer | TCGA | TCGA-BLCA | Bladder Urothelial Carcinoma (BLCA) | 7 (6 cases, 1 control) |
Cancer | TCGA | TCGA-BRCA | Breast Invasive Carcinoma (BRCA) | 6 (5 cases, 1 control) |
Cancer | TCGA | TCGA-COAD | Colon Adenocarcinoma (COAD) | 3 (2 cases, 1 control) |
Cancer | TCGA | TCGA-LUAD | Lung Adenocarcinoma (LUAD) | 6 (5 cases, 1 control) |
Cancer | TCGA | TCGA-LUSC | Lung Squamous Cell Carcinoma (LUSC) | 5 (4 cases, 1 control) |
Cancer | TCGA | TCGA-READ | Rectum Adenocarcinoma (READ) | 3 (2 cases, 1 control) |
Cancer | TCGA | TCGA-STAD | Stomach Adenocarcinoma (STAD) | 5 (4 cases, 1 control) |
Cancer | TCGA | TCGA-UCEC | Uterine Corpus Endometrial Carcinoma (UCEC) | 6 (5 cases, 1 control) |
Accumulating evidences have shown that lncRNAs are closely associated with a variety of diseases, including cancers and brain diseases. LncBook annotated the DNA methylation profiles of lncRNA genes on both promoter and body regions with case and control samples across the 16 diseases. Here, we defined the regions from −1500 bp relative to the transcription start site as the promoter region and calculated the averaged methylation level of all CpG sites on promoter or body region.
DNA methylation level on both promoter and body regions between case and control was compared.
Considering the small sampling size of some datasets, different criteria were used.
LncBook collected lncRNAs’ expression capacities across 9 biological contexts, expression level distributions across 337 biological conditions, and featured genes from LncExpDB.
All expressed genes which are defined as maximum expression values higher than 1.0 TPM are ranked in a specific biological condition (time point/stage/tissue/cell/component/processing). Specifically, genes with expression values greater than the upper quantile are classified as “H” (high expression level), those less than the lower quantile as “L” (low expression level), and the remaining as “M” (medium expression level). High-capacity lncRNAs (HCL) are genes with “H” classification in at least one condition, and low-capacity lncRNAs (LCL) are those with “L” in all conditions, and the remaining are medium-capacity lncRNAs (MCL). It is noted that with more biological conditions covered, LCL or MCL may change to MCL or HCL.
Featured genes are specifically expressed in a certain cell line/tissue, differentially expressed in the context of cancer or virus infection, enriched in a subcellular compartment, dynamically expressed during cell differentiation or embryo/organ development, or periodically expressed with circadian rhythm. See more details in LncExpDB
LncBook integrated small proteins identified by Ribo-seq and mass spectrometry data from SmProt database. Small proteins were mapped to lncRNAs according to their genomic locations by Bedtools intersect function. If the genomic location blocks entirely fall within the exons of lncRNA transcripts, the small proteins will be associated with the lncRNAs in our database.
The lncRNA-miRNA interactions were predicted by miRanda, TargetScan and RNAhybrid with the miRNA and lncRNA sequences downloaded from miRBase and LncBook, respectively. The high-confidence interactions which are predicted by all the three softwares (miRanda and TargetScan: default parameters; RNAhybrid: -b 1 –e 20 –f 8,12 –u 1 –v 1 -s 3utr_human) were collected in LncBook.
LncBook 2.0 assessed the confidence of lncRNA genes and classified all the lncRNAs into 4 levels
considering expression capacity and the functional features inferred from expression and methylation profiling.
The Genes section provides confidence levels, various molecular signatures, disease/trait associations of lncRNA genes, as well as the basic information including gene symbol, length, genomic location, etc.
You can search high-confidence lncRNAs genes, obtain all disease-assocaited lncRNAs or highly expressed genes.
The Conservation section provides conservation features of human lncRNA genes across 40 animals, as well as alignment details in each species. You can trace the original gene/sequence of a certain human lncRNA gene, obtain the homologous protein-coding/ncRNA genes in different species, and explore the most conserved or human-specific lncRNA genes.
In the "Variation" page, you can browse the noncoding variants' genomic location, associated lncRNA, functional effect and clinical significance, associated disease and trait information. You can search interest entries by gene ID, dbSNP ID, functional effect, disease name or trait and download these entries by clicking on download button.
You can access the DNA methylation information in the “Methylation” page, which includes the lncRNA genes that exhibit differential DNA methylation profiles referring to normal samples on promoter regions in at least one disease. The “Methylation filter” tab allows you to narrow down the results according to the lncRNA/disease/hyper or hypo methylation status of your interest. Additionally, for each lncRNA, the methylation profiles of promoter and body regions in case and control samples across all the 16 diseases are shown and can be downloaded in the gene page.
In the "Expression Capacity" page, you can browse the lncRNA’s expression capacities in various biological contexts. You can explore high-capacity lncRNAs in one or multiple contexts using the categories in the “Expression Filter”. All entries can be downloaded by click on download button. Furthermore, the “Chart” enables visualization of expression level distribution among all the biological conditions. Clicking on gene id will direct you to the LncExpDB page where you could view the expression profiles across different biological contexts.
In the "Small Protein" page, you can browse basic information of small proteins, including small protein ID, genome loci, amino acid sequence and experiment evidence. You can search interest entries by gene ID, small protein ID or experiment evidence and download these entries by clicking on download button.
You can visualize all the lncRNA-mRNA interactions in the “Interaction” page, which includes the descriptions of binding start, binding end, sequence complementarity score and the minimum free energy for the RNA duplexes. The “Interaction filter” tab allows you to narrow down the results according to the lncRNA/miRNA of your interest. In addition, all the interactions about the lncRNA of your interest are also shown and can be downloaded in the gene page.
LncBook 2.0: human long non-coding RNA integration and multi-omics annotation (In preparation)
LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res 2019. [PMID=30329098]