#updated in SEPTEMBER 2024# The LncBook database is a comprehensive repository for long non-coding RNA (lncRNA) transcripts. In LncBook v2.1, the database has been enriched by integrating data from two new lncRNA resources (RNAAtlas and TAiC), updated versions of five resources (GENCODE, CHESS, NONCODE, LNCipedia), and the Robust version of FANTOM-CAT, validated and supplemented by 94 sets of PacBio long-read sequencing data from public databases. This comprehensive approach has been coupled with an optimized annotation pipeline, resulting in the curation of 526,318 lncRNA transcripts. By correlating structural relationships between long-read RNA-seq data-identified lncRNA transcripts and predicted lncRNA transcripts, LncBook identified 69,517 high-confidence full-length identified lncRNA transcripts with exon structures consistent with predicted lncRNA transcripts and 4,496 identified lncRNA transcripts corrected exon boundaries for predicted transcripts. Additionally, LncBook recognized 74,340 novel lncRNA transcripts supported by long-read data. LncBook has further refined its transcript classification system by incorporating specific labels, including FSM, ISM, NIC, NNC, ESM, etc., thereby enhancing the clarity and depth of transcript feature analysis. All the results and data are downloadable. Detailed information of each file is outlined below. ####### 1. lncRNA_LncBookv2.1_GRCh38.gtf.gz ################ This annotation file contains all lncRNA transcript annotations curated by LncBook. GTF file contains several tab and semicolon separated entries per line: #reference genome: hg38/GRCh38 column 1: chromosome name {reference chromosomes, scaffolds, assembly patches, alternate loci} column 2: sources {database sources or assembled} column 3: feature type {transcript, exon} column 4: genomic start location column 5: genomic end location column 6: score column 7: genomic strand {-,.,+} (. means unknown) column 8: genomic phase (not used) column 9: additional information (see below) * gene_id: The unique accession number assigned to the gene by LncBook. *transcript_id: The unique accession number assigned to the transcript by LncBook. * transcript_alias: Transcript IDs from other sources (e.g.,GENCODE, NONCODE, LNCipedia, MiTranscriptome beta, RefLnc, CHESS, FANTOM-CAT, BIGTranscriptome, RNAAtlas, TAiC). * type: LncRNA classification {Validated, Modified, FSM, ISM, NIC, NNC, ESM, Fusion, GenicGenomic, GenicIntron, Antisense, Intergenic, Unmodified, Predicted} The LncBook database categorizes lncRNA transcripts into two primary groups: "Predicted Transcripts" derived from database integration and "Identified Transcripts" assembled from long-read sequencing data. To provide a nuanced understanding of these transcripts, we have further classified them based on their structural relationships. Here are the specific classification criteria: Validated: The identified transcript matches the predicted transcript exons. Corrected: The identified transcript matches the predicted transcript splice junctions (SJs). ISM (Incomplete Splice Match): The identified transcript SJs partially and consecutively match the predicted transcript, or the single-exon identified transcript overlaps with the multi-exon predicted transcript. NIC (Novel In Catalog): A new transcript formed by a new combination of known splice sites. NNC (Novel Not in Catalog): A new transcript with at least one novel splice site. ESM (Extended Splice Match): The identified transcript SJs match and extend the predicted transcript splice junctions. Fusion: The identified transcript spans multiple genes. GenicGenomic: The identified transcript overlaps with both introns and exons of the predicted transcript. GenicIntron: The identified transcript is completely contained within the intron of the predicted transcript. Antisense: The identified transcript does not overlap with the predicted transcript on the same strand but overlaps on the antisense strand. Intergenic: The identified transcript is located in the intergenic region with no overlap. Uncorrected: The predicted transcript before correction by identified transcripts. Predicted: The predicted transcript awaiting validation. *level:Transcript quality grading {level1, level2, level3} Level 1: High-confidence lncRNAs validated or corrected by long-read RNA-seq data-identified transcripts. Level 2: High-quality lncRNAs obtained by quality filtering of predicted or identified transcripts. Level 3: Comprehensive lncRNAs, including uncorrected predicted transcripts.