LncBook 2.0 Integration and multi-omics annotation of human long non-coding RNAs

1. Introduction

1.1 What is LncBook ?

LncRNAs (Long non-coding RNAs) are closely associated with human health and diseases. While the human genome transcribes hundreds of thousands of lncRNAs, only a small part of them have been experimentally studied. The comprehensive annotation of human lncRNAs is of great significance in navigating the functional landscape of the human genome and deepening the understanding of the multi-featured RNA world.

To facilitate the discovery of lncRNAs’ biological functions, we developed LncBook, which is devoted to the integration and multi-omics annotation of human lncRNAs. It provides a comprehensive and high-quality list of human lncRNAs, enriches these lncRNAs with essential multi-omics signatures, and identifies featured lncRNAs in diseases and diverse biological contexts.

The first version was published in the NAR 2019 Database Issue. Over the past several years, we have significantly updated, expanded and enriched LncBook. The updated version of LncBook 2.0 integrates more human lncRNAs, characterizes diverse molecular signatures of these lncRNAs with more abundant data and stringent criteria, and identifies a list of high-confidence lncRNAs that are most likely related with human health and diseases.

1.2 Where does LncBook 2.0 improve?

Compared with the first version, LncBook 2.0 has significant changes and improvements as follows:

  • First, LncBook 2.0 integrates lncRNAs from another 5 well-known resources based on the 4 resources of version 1. It finally incorporates 119,722 new transcripts and 9,632 new genes, updates 21,305 genes, and provides a high-quality collection of 323,950 lncRNA transcripts and 95,243 genes by adopting strict criteria and quality control to discard questionable lncRNAs.
  • Second, it is significantly improved by multi-omics annotations in both quantity and quality. Particularly, it characterizes conservation features of human lncRNA genes across 40 evolutionarily related animals, integrates lncRNA-encoded small proteins from the SmProt database, enriches the expression and methylation annotations with more biological contexts, and predicts lncRNA-miRNA binding sites with more stringent criteria.
  • Last but not least, based on the above changes, LncBook 2.0 assesses the confidence of lncRNA genes and classifies all the lncRNAs into 4 levels considering expression capacity and the functional features inferred from expression and methylation profiling. It identifies 17,039 high-confidence lncRNA genes, which are highly expressed and most likely have important biological functions in human health and diseases.
  • 2. LncRNA integration and curation

    Based on LncBook v1 (4 resources were integrated), LncBook 2.0 integrated lncRNAs from another 5 resources, incuding RefLnc, GENCODE v33, CHESS v2.2, FANTOM-CAT (lv4_strigent) and BIGTranscriptome.

    To obtain a high-confidence lncRNA dataset, a set of strict criteria was adopted by considering redundancy, mapping error, pre-mRNA, small RNA fragment, miRNA precursor, polymerase run-on, incomplete transcript, length, boundary, strand and coding potential.

    LncRNA transcripts are assigned into the same gene if they share exonic sequences in the same strand.

    3. LncRNA classification

    Based on their genomic locations in respect to protein-coding genes, we classified lncRNAs into seven groups, Intergenic, Intronic (S), Intronic (AS), Overlapping (S), Overlapping (AS), Sense, and Antisense. "S" in the bracket represents that lncRNAs are in the same strand of protein-coding RNAs, a nd "AS" represents that lncRNAs are in the antisense strand of protein-coding RNAs.

  • Intergenic: lncRNAs are transcribed from intergenic regions;
  • Intronic (S): lncRNAs are transcribed entirely from introns of protein-coding genes;
  • Intronic (AS): lncRNAs are transcribed from antisense strand of protein-coding genes and the entire sequences are covered by introns of protein-coding genes;
  • Overlapping (S): lncRNAs that contain coding genes within an intron on the sense strand;
  • Overlapping (AS): lncRNAs that contain coding genes within an intron on the antisense strand;
  • Sense: lncRNAs are transcribed from the sense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included), or both lncRNAs and protein-coding genes intersect each other partially;
  • Antisense: lncRNAs are transcribed from the antisense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included), or both lncRNAs and protein-coding genes intersect each other partially.
  • See more details in LncRNAWiki
  • 4. Conservation annotation

    We characterized conservation features of human lncRNA genes across 40 evolutionarily related animals based on the UCSC genome alignment results (LiftOver files). We analyzed the alignment segments between human and 40 species (as far as zebrafish) to obtain the evolutionary information of human lncRNAs, determine their ages, and identify their homolog protein-coding/non-coding gene.

    4.1 Sequence conservation

    We measured sequence conservation primarily by the sequence similarity of the alignments. To exclude the influence of genetic relationship and the alignment length, we used the alignment of lncRNA introns as control, and assessed conservation levels by fitting quantiles.

    Q50 represents medium level of conservation; similarity of the alignment region is higher than the median level of lncRNA introns. Q99 means high level of conservation; similarity of the alignment region is higher than 99% of the aligned segments of lncRNA introns (such as TUG1, MALAT1).

    4.2 Homologous sequence/gene:

    We used the following criteria to identify whether a lncRNA has homologous sequence/gene in an animal:

  • Homologous sequence: alignment length ≥50 nt or ≥20% transcript length;
  • Homologous gene: homologous transcript alignment length ≥50 nt or ≥20% transcript length.
  • 4.3 LncRNA age

    The age of a lncRNA gene represents the earliest occurrence time of the gene sequence. There are 17 time nodes, including "Homo" (human specific), "Hominini", "Homininae", "Hominidae", "Hominoidea", "Catarrhini", "Simiiformes", "Haplorrhini", "Primates", "Euarchontoglires", "Boreoeutheria", "Eutheria", "Theria", "Mammalia", "Amniota", "Tetrapoda" and "Euteleostomi".

    Note that age determination is based on the parsimony rule, therefore it does not rule out that a gene is conserved in species that we haven’t included, thus making its true age older than our determination.

    5. Variation annotation

    LncBook collected variants from COSMIC, ClinVar and GWAS Catalog, identified disease/trait-associated variants, annotated corresponding disease/trait information and mapped them to the lncRNA loci.

    5.1 Variants collection and curation

    LncBook curated high-quality variants from COSMIC, ClinVar and GWAS Catalog.

  • COSMIC: Only variants which are tagged as "Confirmed somatic mutation" in COSMIC mutation Data were retained.
  • ClinVar: Variants containing ambiguous functional terms such as “Likely benign” or "Unceratin significance" were removed.
  • GWAS Catalog: Only significant variants (p-value < 5x10-8) were retained.
  • 5.2 Disease and trait-associated variants

    Disease-associated variants were derived from COSMIC and ClinVar and trait-associated variants were derived from GWAS Catalog. For COSMIC, we defined variants with a FATHMM-MKL score > 0.7 as disease-associated (pathogenic) variants. Meanwhile, variants in ClinVar tagged as "Pathogenic", "Affects" or "Risk factor" were considered as disease-associated variants.

    To unify disease names and traits, we mapped ClinVar disease names and GWAS traits to Human Phenotype Ontology and Experimental Factor Ontology respectively.

    5.3 Variants allocation

    Variants were allocated to lncRNAs by Bedtools intersect function.

    6. Methylation profile

    6.1 DNA methylation data collection

    LncBook collected 16 publicly accessible bisulfite-seq datasets from TCGA and GEO, covering 16 diseases (14 cancers and 2 neurodevelopmental disorders) with both case and control samples. Disease-associated lncRNA genes were identified based on DNA methylation level differentiation. The following table details the datasets used.

    Biological Context Source Project ID Disease Name (Short Name) Sample Number
    Neurodevelopmental Disorder GEO GSE119980 Rett syndrome (RTT) 12 (6 cases, 6 controls)
    Neurodevelopmental Disorder GEO GSE109875 Autism spectrum disorders 16 (10 cases, 6 controls
    Cancer GEO GSE116229 Acute Lymphoblastic Leukemia (ALL) 38 (31 cases, 7 controls)
    Cancer GEO GSE135869 Acute Myeloid Leukemia (AML) 15 (9 cases, 6 controls)
    Cancer GEO GSE113336 Chronic Lymphocytic Leukemia (CLL) 18 (11 cases, 7 controls)
    Cancer GEO GSE149608 Esophageal Squamous Cell Carcinoma (ESCC) 19 (10 cases, 9 controls)
    Cancer GEO GSE142241 Medulloblastoma (MB) 12 (8 cases, 4 controls)
    Cancer GEO GSE79799 Liver cancer 6 (3 cases, 3 controls)
    Cancer TCGA TCGA-BLCA Bladder Urothelial Carcinoma (BLCA) 7 (6 cases, 1 control)
    Cancer TCGA TCGA-BRCA Breast Invasive Carcinoma (BRCA) 6 (5 cases, 1 control)
    Cancer TCGA TCGA-COAD Colon Adenocarcinoma (COAD) 3 (2 cases, 1 control)
    Cancer TCGA TCGA-LUAD Lung Adenocarcinoma (LUAD) 6 (5 cases, 1 control)
    Cancer TCGA TCGA-LUSC Lung Squamous Cell Carcinoma (LUSC) 5 (4 cases, 1 control)
    Cancer TCGA TCGA-READ Rectum Adenocarcinoma (READ) 3 (2 cases, 1 control)
    Cancer TCGA TCGA-STAD Stomach Adenocarcinoma (STAD) 5 (4 cases, 1 control)
    Cancer TCGA TCGA-UCEC Uterine Corpus Endometrial Carcinoma (UCEC) 6 (5 cases, 1 control)
    6.2 Identification of disease-associated lncRNA genes based on DNA methylation

    Accumulating evidences have shown that lncRNAs are closely associated with a variety of diseases, including cancers and brain diseases. LncBook annotated the DNA methylation profiles of lncRNA genes on both promoter and body regions with case and control samples across the 16 diseases. Here, we defined the regions from −1500 bp relative to the transcription start site as the promoter region and calculated the averaged methylation level of all CpG sites on promoter or body region.

    DNA methylation level on both promoter and body regions between case and control was compared. Considering the small sampling size of some datasets, different criteria were used.

  • GSE116229 (ALL), GSE135869 (AML), GSE113336 (CLL) and GSE149608 (ESCC): the maximum value of the samples is higher than 0.2, median value foldchange >= 2 or <=1/2 and adjusted p-value <= 0.05, Wilcox test;
  • GSE142241 (MB) and GSE119980 (RTT): the maximum value of the samples is higher than 0.2 and p-value <= 0.02, Wilcox test;
  • TCGA datasets (BLCA, BRCA, COAD, LUAD, LUSC, READ, STAD and UCEC): methylation level shows increase/decrease in all case samples relative to control samples, the maximum value of the samples is higher than 0.2, and the methylation level of the control sample is two folds higher/lower than the maximum/minimum value of case samples;
  • GSE79799 (liver cancer): the minimum methylation level of the case/control samples is higher than the maximum methylation level of the control/case samples, the maximum value of the samples is higher than 0.2 and median value foldchange >= 4 or <=1/4.
  • 7. Expression profile

    LncBook collected lncRNAs’ expression capacities across 9 biological contexts, expression level distributions across 337 biological conditions, and featured genes from LncExpDB.

    All expressed genes which are defined as maximum expression values higher than 1.0 TPM are ranked in a specific biological condition (time point/stage/tissue/cell/component/processing). Specifically, genes with expression values greater than the upper quantile are classified as “H” (high expression level), those less than the lower quantile as “L” (low expression level), and the remaining as “M” (medium expression level). High-capacity lncRNAs (HCL) are genes with “H” classification in at least one condition, and low-capacity lncRNAs (LCL) are those with “L” in all conditions, and the remaining are medium-capacity lncRNAs (MCL). It is noted that with more biological conditions covered, LCL or MCL may change to MCL or HCL.

    Featured genes are specifically expressed in a certain cell line/tissue, differentially expressed in the context of cancer or virus infection, enriched in a subcellular compartment, dynamically expressed during cell differentiation or embryo/organ development, or periodically expressed with circadian rhythm. See more details in LncExpDB

    8. Small protein

    LncBook integrated small proteins identified by Ribo-seq and mass spectrometry data from SmProt database. Small proteins were mapped to lncRNAs according to their genomic locations by Bedtools intersect function. If the genomic location blocks entirely fall within the exons of lncRNA transcripts, the small proteins will be associated with the lncRNAs in our database.

    9. LncRNA-miRNA interaction

    The lncRNA-miRNA interactions were predicted by miRanda, TargetScan and RNAhybrid with the miRNA and lncRNA sequences downloaded from miRBase and LncBook, respectively. The high-confidence interactions which are predicted by all the three softwares (miRanda and TargetScan: default parameters; RNAhybrid: -b 1 –e 20 –f 8,12 –u 1 –v 1 -s 3utr_human) were collected in LncBook.

    10. Confidence level

    LncBook 2.0 assessed the confidence of lncRNA genes and classified all the lncRNAs into 4 levels considering expression capacity and the functional features inferred from expression and methylation profiling.

  • Level 1: high expression and featured
  • Level 2: high expression but not featured; medium expression and featured
  • Level 3: medium expression but not featured; low expression and featured
  • Level 4: low expression and not featured
  • 11. Database usage

    11.1 Browse genes

    The Genes section provides confidence levels, various molecular signatures, disease/trait associations of lncRNA genes, as well as the basic information including gene symbol, length, genomic location, etc.

    You can search high-confidence lncRNAs genes, obtain all disease-assocaited lncRNAs or highly expressed genes.

    11.2 Browse conservation

    The Conservation section provides conservation features of human lncRNA genes across 40 animals, as well as alignment details in each species. You can trace the original gene/sequence of a certain human lncRNA gene, obtain the homologous protein-coding/ncRNA genes in different species, and explore the most conserved or human-specific lncRNA genes.

    11.3 Browse variations

    In the "Variation" page, you can browse the noncoding variants' genomic location, associated lncRNA, functional effect and clinical significance, associated disease and trait information. You can search interest entries by gene ID, dbSNP ID, functional effect, disease name or trait and download these entries by clicking on download button.

    11.4 Browse DNA methylation profile

    You can access the DNA methylation information in the “Methylation” page, which includes the lncRNA genes that exhibit differential DNA methylation profiles referring to normal samples on promoter regions in at least one disease. The “Methylation filter” tab allows you to narrow down the results according to the lncRNA/disease/hyper or hypo methylation status of your interest. Additionally, for each lncRNA, the methylation profiles of promoter and body regions in case and control samples across all the 16 diseases are shown and can be downloaded in the gene page.

    11.5 Browse expression capacities

    In the "Expression Capacity" page, you can browse the lncRNA’s expression capacities in various biological contexts. You can explore high-capacity lncRNAs in one or multiple contexts using the categories in the “Expression Filter”. All entries can be downloaded by click on download button. Furthermore, the “Chart” enables visualization of expression level distribution among all the biological conditions. Clicking on gene id will direct you to the LncExpDB page where you could view the expression profiles across different biological contexts.

    11.6 Browse small proteins

    In the "Small Protein" page, you can browse basic information of small proteins, including small protein ID, genome loci, amino acid sequence and experiment evidence. You can search interest entries by gene ID, small protein ID or experiment evidence and download these entries by clicking on download button.

    11.7 Browse lncRNA-miRNA interactions

    You can visualize all the lncRNA-mRNA interactions in the “Interaction” page, which includes the descriptions of binding start, binding end, sequence complementarity score and the minimum free energy for the RNA duplexes. The “Interaction filter” tab allows you to narrow down the results according to the lncRNA/miRNA of your interest. In addition, all the interactions about the lncRNA of your interest are also shown and can be downloaded in the gene page.

    11.8 Cite LncBook

    LncBook 2.0: human long non-coding RNA integration and multi-omics annotation (In preparation)

    LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res 2019. [PMID=30329098]