Genome data submission preparation

GWH accepts whole genome assembly data (including/excluding organelle(s), plasmid(s), etc.). Individual organelle/plasmid genome, viral genome, and gene fragment sequence data are submitted to the GenBase database of the NGDC. You can submit all genome assembly data associated with the same BioProject and publication simultaneously in one batch submission.

Submission process description (More details)

Prepare the following information:

  1. BioProject and BioSample accession number
  2. Batch Meta File
  3. Genome Sequence File
  4. Genome Annotation File (optional)
  5. Assignment Information File (required for certain condition)
  6. Uploading Files through FTP
  7. Metagenome data submission
  8. Haplotype data submission

1. BioProject and BioSample accession number

If you have not created related BioProject or BioSample, go to create the project in the BioProject and the sample in the BioSample.

Different omics data can be associated with the same BioProject and BioSample, which can be shared without having to recreate them if they have already been created when submitting other omics data, such as raw reads in Genome Sequence Archive (GSA).

2. Batch Meta File

The "Green column" is required. The "Blue column" is required for certain conditions. The "Gray column" is optional, which can be left blank if not available. The "Yellow area" has a drop-down menu that allows you to select from a controlled vocabulary. The explanation of the filled in items can be seen in the prompt instructions, please fill in them according to the requirements.

Please note that: (1) Do not delete or insert any entry column; (2) Be careful when you use the Excel "autofill function" to prevent unnecessary filling errors, e.g., date or numbers; (3) Check carefully before submitting to save your batch submission audit time; (4) The English input method must be used.

"Browse→Upload→validate" the completed meta file in step 4 of the submission page. If there are any errors, please follow the hints to modify them until they are passed the online quality control (that is 'Checked OK').

3. Genome Sequence File

The genome sequence file is required. Please use the FASTA format that starts with a definition line, and follows by sequence base content in the following line.

The simplest definition line requires the ">" symbol and a sequence_ID (in figure 1). All sequence files must be in plain text using ASCII characters only. Use IUPAC (International Union of Pure and Applied Chemistry) codes for your genome sequence. The accepted filename suffix is fsa/fa/fasta/gz/bz2.

Please note that all genome sequences from the same genome assembly should be in and only in one genome sequence file. The genome sequences from different genome assemblies need to be stored in different genome sequence files in one batch submission.

Example:

>Seq1

CCTTTAT...

>>Seq2 chromosome 1

GGTAGGT...

Figure 1. Example for genome sequence file

4. Genome Annotation File (optional)

The genome annotation file is optional. It is recommended to submit the annotation file corresponding to the sequence file in GFF/TBL format.

The GFF3 file have 9 columns and have plain text file separated by tabs (figure 2). Please refer to here for the explanation of annotation in GFF3 format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

Example:

Chr1    .    gene    790397    794331    .    +    .    ID=BsG00000000001;Name=BsG00000000001;Note=Similar to PCMP-H27: Pentatricopeptide repeat-containing protein At4g35130%2C chloroplastic (Arabidopsis thaliana);
Chr1 . mRNA 790397 794331 . + . ID=BsM00000000001;Parent=BsG00000000001;Name=BsM00000000001;Note=Similar to PCMP-H27: Pentatricopeptide repeat-containing protein At4g35130%2C chloroplastic (Arabidopsis thaliana);
Chr1 . exon 790397 794331 . + . ID=BsE00000000001;Parent=BsM00000000001
Chr1 . CDS 790476 792863 . + 0 ID=BsC00000000001;Parent=BsM00000000001
Chr1 . five_prime_UTR 790397 790475 . + . ID=BsF00000000001;Parent=BsM00000000001
Chr1 . three_prime_UTR 792864 794331 . + . ID=BsT00000000001;Parent=BsM00000000001

Figure 2. Example for GFF3 file

The TBL files have 5 columns and are plain text files separated by tabs (figure 3). Please refer to here for the explanation of annotation in TBL format. https://www.ncbi.nlm.nih.gov/genbank/feature_table/

Example:

>Feature Sc_16
<1      1050    gene
                        gene    ATH1
<1      1009    CDS
                        product    acid trehalase
                        codon_start    2
<1      1050     mRNA
                        product    acid trehalase
3450    4536    gene
                        gene    YIP2
3522    3572    CDS
3706    4197
                        product    Yip2p
3450    3572    mRNA
3706    4536
                        product    Yip2p
                    

Figure 3. Example for TBL file

Please note that all genome annotation data from the same genome assembly should be in and only in one genome annotation file. The genome annotation data from different genome assemblies need to be stored in different genome annotation files in one batch submission.

5. Assignment Information File (required for certain condition)

When the assembly level is complete or draft in chromosome, chromosome assignment file is required. When the assembly level is complete, (1) all chromosomes are included, which means that each chromosome is in a single sequence and there are no extra sequences; (2) and each sequence in the genome must be assigned to a chromosome/plasmid/organelle.

An assignment file (csv format) should contain four items: 'Sequence ID', 'Chromosome/Plasmid name' or 'Type', 'Complete', 'Circular'.

The Sequence ID in the assignment file must come from the Sequence ID in the genome sequence file. For example, if the genome sequence file contains: ">ctg1 chromosome1", then the sequence ID is "ctg1".

In the chromosome assignment file, "Complete=true" indicates that the sequence represents a certain chromosome. "Complete=false" means that is associated with a certain chromosome but has not been localized to a specific position on the chromosome. That is to say, it only represents a small segment of the chromosome. The definition of "Complete" is similar to that of plasmid/organella assignment file.

The topology information "Cirular=true" indicates that the sequence is cyclizable (the beginning and end of the sequence can contain gap). "Cirular=false" indicates that the sequence is linear or a segment of the cyclized sequence (the beginning and end of the sequence cannot contain gap).

A more detailed explanation can be found in the help.

6. Uploading Files through FTP

If you need to submit genome sequence files, genome annotation files, chromosome/plasmid/organelle assignment files, please upload them separately via FTP.

Address: ftp://submit.big.ac.cn

User: Same as you login the GWH

Password: Same as you login the GWH

Path: /GWH/Batchxxxxxxx

7. Metagenome data submission

Metagenome data include raw reads, primary metagenome, binned metagenome and metagenome-assembled genome (MAG). The raw reads should be submitted to the GSA (linked to primary BioSample accession) and the contigs made from overlapping reads can be submitted as the pieces of one or more genome assemblies to the Genome Warehouse (GWH).

Figure 4. Data structure for metagenome


Table1. The requirements for different data types of metagenome

Deposited Database BioProject Primary BioSample Raw Reads Organism-specific BioSample(s) Genome Sequence Genome Annotation
Raw Reads (required) GSA / / /
Primary metagenome (optional) GWH / / /
Binned metagenome (optional) GWH Optional
Metagenome-Assembled Genome (MAG) (optional) GWH Optional

Please note:

Do not include sequences you have only downloaded from a public depository.

Register the project in the BioProject database and the physical metagenomic sample (Primary BioSample, the organism name should be "xxxx metagenome") in the BioSample database.

The raw reads submission is required and should be submitted to the Genome Sequence Archive (GSA) (linked to primary BioSample accession).

For MAG/Binned, both primary BioSample and organism-specific BioSample(s) are required. The organism-specific BioSamples: (1) include the BioProject accession (the same as primary BioSample); (2) include all of the source attributes that are in the physical metagenomic sample (eg, geo_loc_name, collection-data, lat-lon, isolation-source, etc.); (3) include a unique isolate name; (4) include a prokaryotic or eukaryotic organism name, which is genus/species level from Taxonomy database.

For MAG/Binned, each organism-specific BioSample should link to the primary BioSample by providing "BioSample accession" and "BioSample for Metagenome Primary Assembly" in "Batch Meta File".

For MAG/Binned, if the organism name cannot match in Taxonomy database during BioSample submission, please contact with gsa@big.ac.cn to add your confirmed new organism name; and if the organism name can match in Taxonomy database but has not been provided during BioSample submission, please contact with gsa@big.ac.cn to deal with it.

8. Haplotype data submission

GWH accepts haplotypes as separate genome assemblies, and they are related to each other in GWH. Haplotype combinations of polyploid genomes share the same BioSample, have separate BioProjects, and have an umbrella BioProject that associates the BioProjects at the above data levels (Figure 5).


Figure 5. Haplotype data submission association.

The types of haplotypes can be:

a. Principal haplotype / Alternate haplotype

If one is much better than the other. Please name them based on their sequence length or sequencing accuracy. Because each pseudohaplotype assembly is derived from the same sample, both assemblies share the same BioSample.

b. Haplotype 1 / Haplotype 2 / Haplotype 3 / Haplotype 4

If they are of similar quality, When more than 2 haplotypes are present, use Haplotype 3 / Haplotype 4 for the additional assemblies.

c. Maternal haplotype / Paternal haplotype

When that information is known.

d. Diploid

Diploid cells contain two complete sets (2n) of chromosomes. A genome assembly for which a chromosome assembly is available for both sets of an individual's chromosomes. It is anticipated that a diploid genome assembly represents the genome of an individual, therefore, it is not expected that alternate loci will be defined for this assembly, although it is possible that unlocalized or unplaced sequences could be part of the assembly.

e. Polyploid

Polyploid cells contain multiple complete sets of chromosomes.

f. Haploid-with-alt-loci

The collection of chromosome assemblies, unlocalized and unplaced sequences and alternate loci that represent an organism's genome. Any locus may be represented 0, 1 or >1 time, but entire chromosomes are only represented 0 or 1 times.

g. Unresolved-diploid

The assembly methodology creates separate sequences for the two haplotypes of a genome but the submitter is not able to distinguish them into two haplotypes. This type of genome assembly is an Unresolved diploid assembly, and is submitted with the Single or Batch submission option, whichever is the most appropriate. A genome assembly from a diploid in which many of the haplotypic sequences have been resolved but the two haplotypes have not been separated. Consequently, the assembly will be much larger than the expected haploid genome size and many genes will be present in two copies.

Please note:

Since polyploid genomes are two or more haplotype genomes assembled separately from the same sample, they should come from the same sample, so they must be associated with the same BioSample, the same "Biosample Accession" (column 1 in the batch excel file).

Since polyploid genomes can be assembled into separate two or more Haplotype genomes, different BioProjects need to be created to form associations with haplotype genomes, i.e. different "Haplotype BioProject" (column 38 in the batch excel file), in order to distinguish and preserve the association between the data. But it also belongs to a general umbrella BioProject (namely BioProject accession in step2 on the submission page) to form associations between different haplotype genome data. Therefore, the general umbrella BioProject can link the Haplotype BioProject to form an umbrella structure.

For example, to submit Principal haplotype and Alternate haplotype data, two separate BioProjects need to be created first (if there is one, you can skip it) to distinguish different haplotype genome assembly data. Then, the general umbrella BioProject is created to associate these two different BioProjects to form an umbrella structure association relationship, which is created by adding two separate BioProjects already created to the associated projects of the basic information in the second step (Figure 6). Please note that the associated item here can also be a project number that has been disclosed by other users, as long as the content is related to the content you submitted. Finally, the corresponding BioSample is created, which needs to be associated with the overall BioProject.

Figure 6. An example of BioProject Associate

When the submitted data is haplotype, the correspondence between the different haplotype genomes needs to be reflected in the "Assembly name" (column 2 in the batch excel file), and the same prefix needs to be used. And the *.pri / *.alt or *.hap1 / *.hap2 / *.hap3 / *.hap4 or *.pat / *.mat at the end. The suffix corresponds to Principal haplotype/Alternate haplotype or Haplotype 1 / Haplotype 2 / Haplotype 3 / Haplotype 4 or Maternal haplotype / Paternal haplotype of abbreviation. For example, bSteHir1.pri & bSteHir1.alt or mCalJac1.pat & mCalJac1.mat or rPleGil1.0.hap1 & rPleGil1.0.hap2.