a catalog of biological databases
|Full name:||Ensembl Genome Annotation System|
|Description:||An Ensembl-style portal for the genomes of non-vertebrate species.|
|University/Institution:||European Bioinformatics Institute|
|Address:||The Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK|
|Contact name (PI/Team):||Paul Julian Kersey|
|Contact email (PI/Helpdesk):||email@example.com|
Ensembl Genomes 2020-enabling non-vertebrate genomic research. [PMID: 31598706]
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life.
Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species. [PMID: 29092050]
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.
Ensembl Genomes 2016: more genomes, more complexity. [PMID: 26578574]
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ensembl Genomes 2013: scaling up access to genome-wide data. [PMID: 24163254]
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.
Gramene 2013: comparative plant genomics resources. [PMID: 24217918]
Gramene (http://www.gramene.org) is a curated online resource for comparative functional genomics in crops and model plant species, currently hosting 27 fully and 10 partially sequenced reference genomes in its build number 38. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. Whole-genome alignments complemented by phylogenetic gene family trees help infer syntenic and orthologous relationships. Genetic variation data, sequences and genome mappings available for 10 species, including Arabidopsis, rice and maize, help infer putative variant effects on genes and transcripts. The pathways section also hosts 10 species-specific metabolic pathways databases developed in-house or by our collaborators using Pathway Tools software, which facilitates searches for pathway, reaction and metabolite annotations, and allows analyses of user-defined expression datasets. Recently, we released a Plant Reactome portal featuring 133 curated rice pathways. This portal will be expanded for Arabidopsis, maize and other plant species. We continue to provide genetic and QTL maps and marker datasets developed by crop researchers. The project provides a unique community platform to support scientific research in plant genomics including studies in evolution, genetics, plant breeding, molecular biology, biochemistry and systems biology.
WormBase 2014: new views of curated biology. [PMID: 24194605]
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest.
Studying Culicoides vectors of BTV in the post-genomic era: resources, bottlenecks to progress and future directions. [PMID: 24355835]
Culicoides biting midges (Diptera: Ceratopogonidae) are a major vector group responsible for the biological transmission of a wide variety of globally significant arboviruses, including bluetongue virus (BTV). In this review we examine current biological resources for the study of this genus, with an emphasis on detailing the history of extant colonies and cell lines derived from C. sonorensis, the major vector of BTV in the USA. We then discuss the rapidly developing area of genomic and transcriptomic analyses of biological processes in vectors and introduce the newly formed Culicoides Genomics and Transcriptomics Alliance. Preliminary results from these fields are detailed and finally likely areas of future research are discussed from an entomological perspective describing limitations in our understanding of Culicoides biology that may impede progress in these areas.
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. [PMID: 23870653]
The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly.In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies.Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species. [PMID: 22067447]
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrative resource for genome-scale data from non-vertebrate species. The project exploits and extends technology (for genome annotation, analysis and dissemination) developed in the context of the (vertebrate-focused) Ensembl project and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. Since its launch in 2009, Ensembl Genomes has undergone rapid expansion, with the goal of providing coverage of all major experimental organisms, and additionally including taxonomic reference points to provide the evolutionary context in which genes can be understood. Against the backdrop of a continuing increase in genome sequencing activities in all parts of the tree of life, we seek to work, wherever possible, with the communities actively generating and using data, and are participants in a growing range of collaborations involved in the annotation and analysis of genomes.
Analysis of the bread wheat genome using whole-genome shotgun sequencing. [PMID: 23192148]
Bread wheat (Triticum aestivum) is a globally important crop, accounting for 20 per cent of the calories consumed by humans. Major efforts are underway worldwide to increase wheat production by extending genetic diversity and analysing key traits, and genomic resources can accelerate progress. But so far the very large size and polyploid complexity of the bread wheat genome have been substantial barriers to genome analysis. Here we report the sequencing of its large, 17-gigabase-pair, hexaploid genome using 454 pyrosequencing, and comparison of this with the sequences of diploid ancestral and progenitor genomes. We identified between 94,000 and 96,000 genes, and assigned two-thirds to the three component genomes (A, B and D) of hexaploid wheat. High-resolution synteny maps identified many small disruptions to conserved gene order. We show that the hexaploid genome is highly dynamic, with significant loss of gene family members on polyploidization and domestication, and an abundance of gene fragments. Several classes of genes involved in energy harvesting, metabolism and growth are among expanded gene families that could be associated with crop productivity. Our analyses, coupled with the identification of extensive genetic variation, provide a resource for accelerating gene discovery and improving this major crop.
Taking the next step: building an Arabidopsis information portal. [PMID: 22751211]
The Arabidopsis information portal (AIP), a resource expected to provide access to all community data and combine outputs into a single user-friendly interface, has emerged from community discussions over the last 23 months. These discussions began during two closely linked workshops in early 2010 that established the International Arabidopsis Informatics Consortium (IAIC). The design of the AIP will provide core functionality while remaining flexible to encourage multiple contributors and constant innovation. An IAIC-hosted Design Workshop in December 2011 proposed a structure for the AIP to provide a framework for the minimal components of a functional community portal while retaining flexibility to rapidly extend the resource to other species. We now invite broader participation in the AIP development process so that the resource can be implemented in a timely manner.
WormBase: Annotating many nematode genomes. [PMID: 24058818]
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase's role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
WormBase 2012: more genomes, more data, new website. [PMID: 22067452]
Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community.
PomBase: a comprehensive online resource for fission yeast. [PMID: 22039153]
PomBase (www.pombase.org) is a new model organism database established to provide access to comprehensive, accurate, and up-to-date molecular data and biological information for the fission yeast Schizosaccharomyces pombe to effectively support both exploratory and hypothesis-driven research. PomBase encompasses annotation of genomic sequence and features, comprehensive manual literature curation and genome-wide data sets, and supports sophisticated user-defined queries. The implementation of PomBase integrates a Chado relational database that houses manually curated data with Ensembl software that supports sequence-based annotation and web access. PomBase will provide user-friendly tools to promote curation by experts within the fission yeast community. This will make a key contribution to shaping its content and ensuring its comprehensiveness and long-term relevance.
Multiple reference genomes and transcriptomes for Arabidopsis thaliana. [PMID: 21874022]
Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.
Ensembl BioMarts: a hub for data retrieval across taxonomic space. [PMID: 21785142]
For a number of years the BioMart data warehousing system has proven to be a valuable resource for scientists seeking a fast and versatile means of accessing the growing volume of genomic data provided by the Ensembl project. The launch of the Ensembl Genomes project in 2009 complemented the Ensembl project by utilizing the same visualization, interactive and programming tools to provide users with a means for accessing genome data from a further five domains: protists, bacteria, metazoa, plants and fungi. The Ensembl and Ensembl Genomes BioMarts provide a point of access to the high-quality gene annotation, variation data, functional and regulatory annotation and evolutionary relationships from genomes spanning the taxonomic space. This article aims to give a comprehensive overview of the Ensembl and Ensembl Genomes BioMarts as well as some useful examples and a description of current data content and future objectives. Database URLs: http://www.ensembl.org/biomart/martview/; http://metazoa.ensembl.org/biomart/martview/; http://plants.ensembl.org/biomart/martview/; http://protists.ensembl.org/biomart/martview/; http://fungi.ensembl.org/biomart/martview/; http://bacteria.ensembl.org/biomart/martview/.
Gramene database in 2010: updates and extensions. [PMID: 21076153]
Now in its 10th year, the Gramene database (http://www.gramene.org) has grown from its primary focus on rice, the first fully-sequenced grass genome, to become a resource for major model and crop plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data.
Ensembl Genomes: extending Ensembl across the taxonomic space. [PMID: 19884133]
Ensembl Genomes (http://www.ensemblgenomes.org) is a new portal offering integrated access to genome-scale data from non-vertebrate species of scientific interest, developed using the Ensembl genome annotation and visualisation platform. Ensembl Genomes consists of five sub-portals (for bacteria, protists, fungi, plants and invertebrate metazoa) designed to complement the availability of vertebrate genomes in Ensembl. Many of the databases supporting the portal have been built in close collaboration with the scientific community, which we consider as essential for maintaining the accuracy and usefulness of the resource. A common set of user interfaces (which include a graphical genome browser, FTP, BLAST search, a query optimised data warehouse, programmatic access, and a Perl API) is provided for all domains. Data types incorporated include annotation of (protein and non-protein coding) genes, cross references to external resources, and high throughput experimental data (e.g. data from large scale studies of gene expression and polymorphism visualised in their genomic context). Additionally, extensive comparative analysis has been performed, both within defined clades and across the wider taxonomy, and sequence alignments and gene trees resulting from this can be accessed through the site.
An international bioinformatics infrastructure to underpin the Arabidopsis community. [PMID: 20807877]
The future bioinformatics needs of the Arabidopsis community as well as those of other scientific communities that depend on Arabidopsis resources were discussed at a pair of recent meetings held by the Multinational Arabidopsis Steering Committee and the North American Arabidopsis Steering Committee. There are extensive tools and resources for information storage, curation, and retrieval of Arabidopsis data that have been developed over recent years primarily through the activities of The Arabidopsis Information Resource, the Nottingham Arabidopsis Stock Centre, and the Arabidopsis Biological Resource Center, among others. However, the rapid expansion in many data types, the international basis of the Arabidopsis community, and changing priorities of the funding agencies all suggest the need for changes in the way informatics infrastructure is developed and maintained. We propose that there is a need for a single core resource that is integrated into a larger international consortium of investigators. We envision this to consist of a distributed system of data, tools, and resources, accessed via a single information portal and funded by a variety of sources, under shared international management of an International Arabidopsis Informatics Consortium (IAIC). This article outlines the proposal for the development, management, operations, and continued funding for the IAIC.