The Protist 10,000 Genomes (P10K) Project was launched in 2019 with the goal of decoding the genome sequences and construct a comprehensive database resource containing over 10,000 species of protists, encompassing representatives from every major clade. Samples were collected from diverse habitats, and the genome information was acquired through de novo sequencing, genome reannotation, and integration of publicly available data. Serving as a centralized data portal for the project, the P10K database primarily focuses on delivering high-quality curation and facilitating efficient retrieval of protist genome data.
The P10K database has integrated a total of 2,959 protist genomes and transcriptomes. This collection includes 1,601 genomes and 1,358 transcriptomes, which consist with 1,101 datasets sequenced by the P10K team that primarily originating from locations in China and 1,858 publicly available datasets were gathered from various regions across the globe.
For the data sequenced by the P10K team, samples were collected from various regions of China, such as the Qinghai-Tibet Plateau area, Qinghai Lake area and the central and eastern regions, and from a variety of habitats, including lakes, rivers, wetlands, oceans, and hot springs. Among them, bulk DNA/RNA sequencing was used for successfully cultured or enriched protists, while 1078 single-cell isolates of protists were sequenced due to laboratory culture or enrichment challenges. A standardized pipeline, including assembly, decontamination and annotation was utilized for protist sequencing data analysis.
For the public data, 1,193 genomes were downloaded from the NCBI (https://www.ncbi.nlm.nih.gov/assembly/) and 1 genome was downloaded from Ciliate.org (https://ciliates.org/), while 659 transcriptomes were generated by the MMETSP (Marine Microbial Eukaryote Transcriptome Sequencing Project, http://marinemicroeukaryotes.org/) and 5 transcriptomes of Euplotes were derived from the study of Gaydukova et al. Samples from this public data were collected from various countries and regions worldwide. Out of these public genomes, both assembly sequences and gene annotations were obtained for 428 genomes, while only assembly sequences were obtained for the remaining 766 genomes. For the 659 transcriptomes with only assembled sequences from MMETSP, a re-annotation process was performed using our analysis pipeline, which included decontamination and gene annotation steps.
In addition to these sources, the meta information incorporates specific attributes unique to the
P10K project. These attributes are organized into six distinct categories: Taxonomy, Sample, DNA & RNA
Extraction, Sequencing, Genome|Transcriptome Assembly, Annotation. This comprehensive compilation of
meta information ensures a comprehensive understanding of the project's data and methodologies,
facilitating efficient data utilization and analysis for researchers and stakeholders.
Supergroup, Phylum, Class, Order, Family, Genus, Species/Strain. According to previous studies, protists have been divided into 10 supergroups: Alveolata, Stramenopiles, Rhizaria, Amoebozoa, Obazoa, Metamonada, Discoba, Archaeplastida, Picozoa, Haptista.
The item of sample information provides important index information for each sample. Samples owned from P10K consortium have record 22 detailed information about the sample, including the physiological characteristics, location and habitat, etc. Additionally, samples from public databases provide the original data records.
The item of DNA & RNA Extraction recorded the method of nucleic extraction, especially explains whether the sample is single-cell or bulk.
The item of sequencing record detail sequencing information, including the platforms used, sequencing strategies employed, and data sizes generated. Additionally, this section documents whether the data has been published or is accessible as open data and also collect the meta information as much as possible.
The item of Genome|Transcriptome Assembly record the assembly methodology, along with providing statistical information including Genome|Transcriptome size, N50 and completeness about the assembled data.
The item of annotation provides details gene prediction information, including gene number,
gene length and CDS length. Ciliates have various alternative genetic codes. To address
this, the P10K data undertook codon usage calculations across whole genomics data. This
meticulous estimation of the codon table carries substantial implications for enhancing the
precision of ciliate genome structure annotation. Moreover, due to the extensive variation
in genome and chromosome sizes among protists, the P10K project has integrated these
attributes with the established BUSCO evaluation approach to formulate a genome quality
assessment index tailored specifically for protists. To assess genome integrity, the project
employed the Genome Quality (GQ) metric, which takes into consideration both genome and
coding sequence (CDS) completeness.
$$Genome\ Quality\left( \% \right) = Genome\ completeness \left( \% \right) \times 0.5 +
\dfrac{Number\ of\ CDS\ with\ complete\ structure}{Total\ number\ of\ CDS}\left( \%
\right)\times 0.5 $$
According to the range of GQ values, genome integrity was divided into three levels: high
(GQ>=80%), medium (50% to 80%), and low (GQ<50%).
Item | Description |
---|---|
Sample Metadata | |
Sample ID | Example: P10K-MW-000001. P10K: project; MW: provider; 000001: six-digit number. |
Taxonomy | Phylum, Class, Order, Family, Genus, Species/Strain |
Project phase | There are 5 levels: 1. Sampled: Sample related meta-information have been sumbited; 2. Sequenced: Genome/transcriptome have been sequenced; 3. Assembled: Genome/transcriptome have been assembled; 4. Gene model annotated: Genome/transcriptome have been annotated with the gene model; 5. Gene function annotated: Gene function has been annotated. |
Biosample | The accession number of the BioSample database, which is used to link data to BioSample that describes the research. It needs to be established in advance in the BioSample database. |
Bioproject | The accession number of the BioProject(s) to which the BioSample belongs, which is used to link data to BioSample that describes the research. It needs to be established in advance in the BioProject database. |
Source | Which databases or projects the sample originates. |
Latitude | Latitude geographic coordinates of the location where the sample was collected as decimal fractions. |
Longitude | Longitude geographic coordinates of the location where the sample was collected as decimal fractions. |
Elevation | The elevation of the sample is the vertical distance between Earth's surface above Sea Level and the sampled position in the air. If the sampling site is below Sea Level, it is defined as the vertical distance below the surface, expressed as a negative number. |
Collection method | The experimental methods or techniques used for the collection or isolation of samples. |
Collection date | The date on which the sample was collected, and the format is "YYYY-MM-DD". |
Storage method | The experimental methods or techniques used to store samples. |
Originating lab | Name of institute or who collected the sample, e.g., the name of the laboratory or company. |
Sample provider | Name of persons who collected the sample, e.g., the name of the laboratory PI. |
Contact information (Email) | Email of the person, project or institute who collected the sample. |
Geographic location | The geographical origin of the sample, and the nation, province/state and city are separated by ";". The "nation": use the appropriate name from the "Country list" in P10K MetaInformation Template excel or this list http://www.insdc.org/documents/country-qualifier-vocabulary. |
Biotic relationship | It refers to the interaction or relationship between living organisms in an ecosystem. |
Extreme environments | It refers to habitats or ecosystems that exhibit harsh and challenging conditions, often characterized by extreme temperatures, high or low pH levels, high salinity, low oxygen levels, high pressure, or other factors that make them inhospitable for most forms of life. |
Cell arrangement | It refers to the specific organization or pattern in which cells are structured or positioned in a tissue, organ, or organism. |
Cell shape | It refers to the geometric form or morphology of an individual cell. |
Colony color | It refers to the color of a microbial colony. |
Energy source | It refers to the substance or molecule from which an organism derives its energy to carry out vital cellular processes. |
Habitat | Natural environment of an organism or biosample; the place that is natural for the life and growth of an organism or a general description of the place where a biosample was collected from. |
Metabolism | It refers to the main mode of metabolism that occurs within an organism. It encompasses the processes by which living organisms obtain, utilize, and transform energy and nutrients to maintain their essential functions, grow, and reproduce. |
Oxygen | It referred to oxygen requirement or mode of respiration for an organism. For example, aerobic organisms are those that require oxygen to carry out their metabolic processes. Anaerobic organisms, on the other hand, do not require oxygen for their metabolic processes. |
Host name | The natural (as opposed to laboratory) host to the organism from which the sample was obtained. |
Parasitic site | Provides sample source tissue/organ information of host. |
Disease | Health or disease status of specific host at time of collection. |
Sequencing Metadata | |
Nucleic ID | Example: P10K-MW-000001-1. P10K: project; MW: provider; 000001: six-digit number; 1: data type Nucleic ID adds one digit to the Sample ID to indicate the data type, separated by "-". "Data type" is numbered as 1-6 representing different types. 1: Bulk DNA; 2: Bulk RNA; 3: Single cell DNA; 4: Single cell RNA; 5: Metagenomics/Metatranscriptomics; 6: Public database. |
Nucleic acid type | It refers to the specific type of nucleic acid that is being sequenced in a sequencing experiment. |
Sequencing ID | Example: P10K-MW-000001-12. P10K: project; MW: provider; 000001: six-digit number; 12: data type and sequencing experiment repeat order Sequencing ID adds one digit to the Nucleic ID to indicate the sequencing experiment repeat order. If a sample is sequenced twice with different methods, there are two sequencing IDs with the last digit 1, 2. |
Raw data | The accession number of the raw data published in the database. |
Sequencing strategy | Sequencing technique intended for this library. |
Insert size (bp) | Fragment size for Paired reads. |
Read length (bp) | It refers to the length of a DNA or RNA sequence fragment generated by sequencing technology. |
Total base (Gb) | It refers to the total number of bases produced by a sequencing experiment. |
Reads GC (%) | It refers to the percentage of guanine (G) and cytosine (C) nucleotides present in the raw sequencing reads obtained from a sequencing experiment. |
Sequencing platform | The sequencing platforms and instrument models. |
Publication | The PMID of the published article associated with the data. |
Assembly Metadata | |
Assembly strategy | The assembly strategy and assembly software, separated by commas. |
Assembly ID | The accession number of the assembly data published in the database. |
Scaffold number/Transcript number | "Scaffold number" refers to the count or quantity of scaffolds present in a genome assembly. "Transcript number" refers to the count or quantity of transcripts present in a transcriptomic dataset. |
N50 (bp) | It represents the sequence length at which half of the entire assembly's cumulative length is contained in sequences of equal or greater length. |
Completeness (%) | Genome/transcriptome completeness, refer to calculated values for BUSCO. |
Total size (Mb) | The total number of bases in a genome or transcriptome. |
Coverage (X) | Depth of coverage. It refers to the average number of times each base in the genome/transcriptome has been sequenced. |
Gene number | The numbers of all genes annotated in a set of genetic data or in a particular organism's genome. Note that for transcriptome assemblies, this column refers to the number of ORFs (Open Reading Frames) |
Average gene length (bp) | The average length of all genes annotated in a set of genetic data or in a particular organism's genome. |
CDS completeness (%) | The definition of the CDS completeness is used to roughly assess the integrity of
CDS. Calculation method: Number of CDS with complete structure/Total number of CDS × 100%. CDS with complete structure: CDS with both start and stop codons (both starting with ATG and ending with TGA/TAA/TAG). |
Annotated level | The definition of the Annoteted level is used to roughly evaluate the genome quality. Annotation level is also defined as Genome Quality (GQ). The detailed calculation methods and level definitions of GQ metric have been mentioned in the earlier section titled 'Annotation'. |
Codon table | The mean length of coding sequences (CDS) annotated in a set of genetic data or a specific organism's genome. |
Average CDS length (bp) | The mean length of coding sequences (CDS) annotated in a set of genetic data or a specific organism. |
Average exons per gene | The average number of exons found in each gene within a dataset or a specific organism's genome. Exons are the coding regions of DNA or RNA that contain the information necessary for producing functional proteins. |
Average exon length (bp) | The mean length of exons annotated within a dataset or a specific organism's genome. |
The P10K website is organized into seven main sections or pages, each serving a specific purpose. Here's a brief overview of each section
The home page enables user search by using the P10K ID, species name, habitat, biosample, and gene name in the search field. Enter the keyword and click 'Search' to retrieve the desired information. Additionally, users can access specific data by clicking on the particularly prominent numbers on the page; these numbers are hyperlinked to link to specific data. Furthermore, the hypothesis phylogeny tree with present species images, exhibit the recorded species number under the order and phylum level. Moreover, the hypothetical phylogenetic tree includes images of present species and displays the recorded species numbers at both the order and phylum levels.
More detailed information can be found on the browse page, which allows users to explore our data at the sample, genome, and gene levels.
The Sample page presents 22 subjects about sample meta-information in a tabular format, and each individual sample page provides comprehensive details. Users can browse through the featured items listed, search by keywords, download metadata, choose which columns to display, sort columns. Additionally, the Sample ID, Bioproject ID, and Biosample ID are clickable within the table. By clicking on these identifiers, users can access more detailed sample information, genome statistics, and sequencing for the individual sample.
The Genome page presents 10 subjects about genomic information in a tabular format. Users can browse through the featured items listed, search by keywords, and sort columns. Sample ID leads to individual sample page, Gene Number provides gene list for specific sample, and Assembly ID is a search credential for external databases (NGDC, NCBI, iMicrobe, etc.).
The Gene page presents 11 subjects about genetic information in a tabular format, and each individual gene page provides detailed relevant information. Users can customize filters and download gene lists for specific sample of their interest. With a simple click on the gene ID, users can access the comprehensive individual gene page featuring gene summaries, GO annotations, orthologous and paralogous genes, as well as the ability to download CDS and protein sequences and visualize gene locations using the genome visualization function.
The genome visualization page allows users to expand the species taxonomy tree on the left side for viewing samples associated with each taxon. Gene locations were visualized by the genome visualization function above the table. Adjusting the number of entries displayed on a single page and entering Sample ID in the "Search" box is also allowed.
The BLAST tool has been tailored for sequence searches, and it offers advanced functionalities like distance trees and a multiple sequence alignment (MSA viewer). These features facilitate comprehensive analysis and comparison of acquired sequences.
Users can submit data to the P10K database through the channels on the home page. Metadata can be submitted step-by-step or in bulk uploads, and the submission framework for genome sequences and raw reads data is built on the NGDC.
You can access genome/transcriptome, CDS, and protein sequences by clicking on individual sample links in the sample or genome browser. The download table is available on that page. Raw sequencing data can be accessed through the NGDC (https://ngdc.cncb.ac.cn/) under the bioproject of PRJCA017400
Protist 10,000 Genomes Project. The Innovation, 2020, 1(3). (PMID:
34557722)
The P10K Database: A Data Portal for the Protist 10,000 Genomes Project (In Preparation)
The P10K Project is an open initiative, and collaborators are welcome to join, user who want to join the P10K or use the P10K’s data can contact us.
Here is the contact Information:
Corresponding Email: miaowei@ihb.ac.cn
Technological Email: p10k@ihb.ac.cn
Telephone: 86-27-68780050
Institute of Hydrobiology, Chinese Academy of Science, Wuhan, Hubei, China.