Genome Warehouse

If you have not created related BioProject or BioSample, go to create the project in the BioProject and the sample in the BioSample.

Different omics data can be associated with the same BioProject and BioSample, which can be shared without having to recreate them if they have already been created when submitting other omics data, such as raw reads in Genome Sequence Archive (GSA).

Template file: GWH-batchsubmission-English.xlsx and GWH-batchsubmission-Chinese.xlsx

The "Green column" is required. The "Blue column" is required for certain conditions. The "Gray column" is optional, which can be left blank if not available. The "Yellow area" has a drop-down menu that allows you to select from a controlled vocabulary. The explanation of the filled in items can be seen in the prompt instructions, please fill in them according to the requirements.

Please note that: (1) Do not delete or insert any entry column; (2) Be careful when you use the Excel "autofill function" to prevent unnecessary filling errors, e.g., date or numbers; (3) Check carefully before submitting to save your batch submission audit time; (4) The English input method must be used.

"Browse→Upload→validate" the completed meta file in step 4 of the submission page. If there are any errors, please follow the hints to modify them until they are passed the online quality control (that is 'Checked OK').

Example:

>Seq1

CCTTTAT...

>>Seq2 chromosome 1

GGTAGGT...

The genome annotation file is optional. It is recommended to submit the annotation file corresponding to the sequence file in GFF/TBL format.

The GFF3 file have 9 columns and have plain text file separated by tabs (figure 2). Please refer to here for the explanation of annotation in GFF3 format. https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

Example:


Chr1    .    gene    790397    794331    .    +    .    ID=BsG00000000001;Name=BsG00000000001;Note=Similar to PCMP-H27: Pentatricopeptide repeat-containing protein At4g35130%2C chloroplastic (Arabidopsis thaliana);

Chr1    .    mRNA    790397    794331    .    +    .    ID=BsM00000000001;Parent=BsG00000000001;Name=BsM00000000001;Note=Similar to PCMP-H27: Pentatricopeptide repeat-containing protein At4g35130%2C chloroplastic (Arabidopsis thaliana);

Chr1    .    exon    790397    794331    .    +    .    ID=BsE00000000001;Parent=BsM00000000001

Chr1    .    CDS    790476    792863    .    +    0    ID=BsC00000000001;Parent=BsM00000000001

Chr1    .    five_prime_UTR    790397    790475    .    +    .    ID=BsF00000000001;Parent=BsM00000000001

Chr1    .    three_prime_UTR    792864    794331    .    +    .    ID=BsT00000000001;Parent=BsM00000000001

Figure 2. Example for GFF3 file

The TBL files have 5 columns and are plain text files separated by tabs (figure 3). Please refer to here for the explanation of annotation in TBL format. https://www.ncbi.nlm.nih.gov/genbank/feature_table/

Example:


>Feature Sc_16
<1      1050    gene
                        gene    ATH1
<1      1009    CDS
                        product    acid trehalase
                        codon_start    2
<1      1050     mRNA
                        product    acid trehalase
3450    4536    gene
                        gene    YIP2
3522    3572    CDS
3706    4197
                        product    Yip2p
3450    3572    mRNA
3706    4536
                        product    Yip2p

Figure 3. Example for TBL file

Please note that all genome annotation data from the same genome assembly should be in and only in one genome annotation file. The genome annotation data from different genome assemblies need to be stored in different genome annotation files in one batch submission.

The assignment information files include chromosome assignment file , plasmid assignment file, and organella assignment file.

When the assembly level is complete or draft in chromosome, chromosome assignment file is required. When the assembly level is complete, (1) all chromosomes are included, which means that each chromosome is in a single sequence and there are no extra sequences; (2) and each sequence in the genome must be assigned to a chromosome/plasmid/organelle.

An assignment file (csv format) should contain four items: 'Sequence ID', 'Chromosome/Plasmid name' or 'Type', 'Complete', 'Circular'.

The Sequence ID in the assignment file must come from the Sequence ID in the genome sequence file. For example, if the genome sequence file contains: ">ctg1 chromosome1", then the sequence ID is "ctg1".

In the chromosome assignment file, "Complete=true" indicates that the sequence represents a certain chromosome. "Complete=false" means that is associated with a certain chromosome but has not been localized to a specific position on the chromosome. That is to say, it only represents a small segment of the chromosome. The definition of "Complete" is similar to that of plasmid/organella assignment file.

The topology information "Cirular=true" indicates that the sequence is cyclizable (the beginning and end of the sequence can contain gap). "Cirular=false" indicates that the sequence is linear or a segment of the cyclized sequence (the beginning and end of the sequence cannot contain gap).

A more detailed explanation can be found in the help.

Address: ftp://submit.big.ac.cn

User: Same as you login the GWH

Password: Same as you login the GWH

Path: /GWH/Batchxxxxxxx

	Deposited Database	BioProject	Primary BioSample	Raw Reads	Organism-specific BioSample(s)	Genome Sequence	Genome Annotation
Raw Reads (required)	GSA	√	√	√	/	/	/
Primary metagenome (optional)	GWH	√	√	√	/	/	/
Binned metagenome (optional)	GWH	√	√	√	√	√	Optional
Metagenome-Assembled Genome (MAG) (optional)	GWH	√	√	√	√	√	Optional

Do not include sequences you have only downloaded from a public depository.

Register the project in the BioProject database and the physical metagenomic sample (Primary BioSample, the organism name should be "xxxx metagenome") in the BioSample database.

The raw reads submission is required and should be submitted to the Genome Sequence Archive (GSA) (linked to primary BioSample accession).

For MAG/Binned, both primary BioSample and organism-specific BioSample(s) are required. The organism-specific BioSamples: (1) include the BioProject accession (the same as primary BioSample); (2) include all of the source attributes that are in the physical metagenomic sample (eg, geo_loc_name, collection-data, lat-lon, isolation-source, etc.); (3) include a unique isolate name; (4) include a prokaryotic or eukaryotic organism name, which is genus/species level from Taxonomy database.

For MAG/Binned, each organism-specific BioSample should link to the primary BioSample by providing "BioSample accession" and "BioSample for Metagenome Primary Assembly" in "Batch Meta File".

For MAG/Binned, if the organism name cannot match in Taxonomy database during BioSample submission, please contact with gsa@big.ac.cn to add your confirmed new organism name; and if the organism name can match in Taxonomy database but has not been provided during BioSample submission, please contact with gsa@big.ac.cn to deal with it.

a. Principal haplotype / Alternate haplotype

If one is much better than the other. Please name them based on their sequence length or sequencing accuracy. Because each pseudohaplotype assembly is derived from the same sample, both assemblies share the same BioSample.

b. Haplotype 1 / Haplotype 2 / Haplotype 3 / Haplotype 4

If they are of similar quality, When more than 2 haplotypes are present, use Haplotype 3 / Haplotype 4 for the additional assemblies.

c. Maternal haplotype / Paternal haplotype

When that information is known.

d. Diploid

Diploid cells contain two complete sets (2n) of chromosomes. A genome assembly for which a chromosome assembly is available for both sets of an individual's chromosomes. It is anticipated that a diploid genome assembly represents the genome of an individual, therefore, it is not expected that alternate loci will be defined for this assembly, although it is possible that unlocalized or unplaced sequences could be part of the assembly.

e. Polyploid

Polyploid cells contain multiple complete sets of chromosomes.

f. Haploid-with-alt-loci

The collection of chromosome assemblies, unlocalized and unplaced sequences and alternate loci that represent an organism's genome. Any locus may be represented 0, 1 or >1 time, but entire chromosomes are only represented 0 or 1 times.

g. Unresolved-diploid

The assembly methodology creates separate sequences for the two haplotypes of a genome but the submitter is not able to distinguish them into two haplotypes. This type of genome assembly is an Unresolved diploid assembly, and is submitted with the Single or Batch submission option, whichever is the most appropriate. A genome assembly from a diploid in which many of the haplotypic sequences have been resolved but the two haplotypes have not been separated. Consequently, the assembly will be much larger than the expected haploid genome size and many genes will be present in two copies.

Since polyploid genomes are two or more haplotype genomes assembled separately from the same sample, they should come from the same sample, so they must be associated with the same BioSample, the same "Biosample Accession" (column 1 in the batch excel file).

Since polyploid genomes can be assembled into separate two or more Haplotype genomes, different BioProjects need to be created to form associations with haplotype genomes, i.e. different "Haplotype BioProject" (column 38 in the batch excel file), in order to distinguish and preserve the association between the data. But it also belongs to a general umbrella BioProject (namely BioProject accession in step2 on the submission page) to form associations between different haplotype genome data. Therefore, the general umbrella BioProject can link the Haplotype BioProject to form an umbrella structure.

For example, to submit Principal haplotype and Alternate haplotype data, two separate BioProjects need to be created first (if there is one, you can skip it) to distinguish different haplotype genome assembly data. Then, the general umbrella BioProject is created to associate these two different BioProjects to form an umbrella structure association relationship, which is created by adding two separate BioProjects already created to the associated projects of the basic information in the second step (Figure 6). Please note that the associated item here can also be a project number that has been disclosed by other users, as long as the content is related to the content you submitted. Finally, the corresponding BioSample is created, which needs to be associated with the overall BioProject.

1. BioProject and BioSample accession number

2. Batch Meta File

3. Genome Sequence File

Example:

4. Genome Annotation File (optional)

Example:

Example:

5. Assignment Information File (required for certain condition)

6. Uploading Files through FTP

7. Metagenome data submission

8. Haplotype data submission