The GWH Handbook (Version beta, July 2017) containing detailed data items' descriptions is freely available here.
The GWH Submission Quick Start Guide (Version beta, July 2017) containing submission descriptions is freely available here.
Designed for compatibility, Genome Warehouse (GWH) follows INSDC data standardsand structures. All data are organized into three objects, i.e., BioProject, BioSample, Genome (Figure 1). "BioProject", bearing an accession number prefixed with "PRJC", provides an overall description for an individual research initiative, including basic description, organism, data type, submitter, funding information, and publication(s) if available.
Figure 1: Data model in GWH
Data relationships in GWH are as follows.
BioProject: is an overall description of a single research initiative, typically involving multiple samples.
BioSample: describes biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of attributes.
Genome: describes detailed genome assembly for a BioSample. One BioSample has one or more genome assemblies. For example, one plant sample may have mitochondrion genome and full genome. One genome contains genome sequence file, and shold contain genome annotation file, AGP file(s), and genome assignment file(s).
GWH shortens for Genome Warehouse, a data repository for genome assembly data. It archives genome assembly sequence, genome annotation and other associated data. GWH is one of database resources in BIG Data Center (BIGD), part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome assembly associated data for worldwide institutions and laboratories.
Only registered users can submit data using Genome Sequence submission (Gsub) System. Briefly, data submission requires the following steps.
a) Create a BIGD account and/or login to GWH;
b) Enter metadata information and specify release date;
c) Submit data files;
Any user can freely register and create a Gsub account. After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.
♦ If you just have forgotten your password, you may find the password by clicking “Forgot password”. You will receive an e-mail and please follow the URL to reset your password within 30 minutes.
♦ If you are already a member and you’ve forgotten both your GWH username and password, please feel free to contact us. We will do our best to help you.
Data submission requires that you log into Genome Sequence Submission (Gsub) System, so you need to create an account if you are not a member.
Please note that fields marked are required when submitting metadata.
In the current version 1.0beta of GWH, it supports to submit files by the way of online directly and ftp. It is highly recommended that you submit your files using a dedicated FTP tool (e.g., FileZilla). Please transmit you data files to the GWH FTP site using the following credentials
User: Same as you login the Gsub
Password: Same as you login the Gsub
Path: /GWH/WGSXXXXXX (your submission ID).
In the current version, we accept genome associated data file format as follows:
♦ Genome sequence : FASTA (Step3 Files)
♦ Genome annotation: GFF or TBL (Step3 Files)
♦ Sequence ordering and orientation information: AGP (Step3 Files)Note: required if genome assembly is complete genome or draft genome in chromosome level.
♦ Sequence assignment information: CSV (Step4 Assignment)Note: required if genome assembly is draft genome in scaffold/chromosome level.
All submitted files that you submit via FTP will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the validation process, they will be made public or controlled access according to their release date set by users and the status will change to 'Released' or 'Sucessful' respectively.
MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".
♦ For Linux users, use: $ md5sum filename
♦ For Mac users, use: $ md5 filename
♦ For Windows users, use: $ certutil -hashfile filename MD5; and combine the code by removing the spaces. Or use third party tool.
♦ File does not exist
♦ MD5 code is inconsistent
♦ Illegal compressed format
♦ Not a plain text file
♦ Invalid fasta format
♦ Invalid genome sequence ID
♦ Invalid genome sequence ID start
♦ Repeat Sequence ID
♦ Null sequence, only seqID
♦ Starts/ends with N in genome sequence except for a circular sequence
♦ Sequence length < 200bp
♦ Invalid bases
♦ Invalid gff format
♦ Invalid tbl format
♦ Extra parent info for gene in gff
♦ Absent ID in gff
♦ Absent parent features in gff
♦ Absent parent info for RNA/transcript/exon/CDS/UTR feature in gff
♦ ID equal Parent ID in gff
♦ Repeat ID in gff
♦ CDS length is not a multiple of three except "Transl_except" or partial CDS
♦ Too short of CDS length
♦ Illegal start codon except "RNA-editing" or partial CDS
♦ Illegal stop codon except "Transl_except" or partial CDS
♦ Internal stop codon
♦ Illegal codon_start value in tbl
♦ Illegal frame value in gff
♦ Illegal strand value
♦ Repeat features in tbl/gff
♦ Redundant features in tbl/gff (e.g.: intron)
♦ Conflict sequence ID between parent and child features
♦ Conflict strand between parent and child features
♦ A feature coordinate falling out of the range of the corresponding parent feature
♦ ID and parent ID disorder in gff
♦ Illegal frame value of the first CDS region in a transcript except partial CDS
♦ Transcript features of "Trans-splicing" does not contain part info
♦ Sequence ID in genome annotation does not exist in genome sequence ID
♦ A feature coordinate in genome annotation falls out of the range of the corresponding genome sequence
♦ Sequence content is inconsistent with assembly level
♦ Invalid sequence assignment file format
♦ Invalid chromosome/organella/plasmid assignment name in sequence assignment file
♦ Sequence ID in assignment file does not exist in genome sequence ID
♦ A chromosome/organella/plasmid name corresponds to more than one complete sequence ID
♦ A chromosome/organella/plasmid name corresponds to more than one circular sequence ID
♦ A sequence ID has more than one assignment record
♦ Sequence may be of vector/adaptor/primer/index origin
When you submit data, you will find a button named “Release date” at the bottom of "Step 2 Gerneral info" web page. After you specify the release date, it will trigger the data release according to the inputted date. Note that release of Bioproject and Biosample is also triggered by the released of WGS-associated data. It is suggested that you set the release date of Genome later than BioProject or BioSample. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GWH will release sequence data on the specified date. The release date can be changed through the genome portal.
GWH accession No. is prefixed with ‘GWH’ and is followed by 4 Capital letters, and 8 digits. For example, GWHXXXX00000000. Please cite the genome accession number GWHXXXX00000000 in your publication like this (We recommend you putting these paragraphs in the Materials and Methods section of the paper):
The whole genome sequence data reported in this paper have been deposited in the Genome Warehose  in National Genomics Data Center , Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, under accession number GWHXXXX00000000 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GWH.
National Genomics Data Center
Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation
No.1 Beichen West Road, Chaoyang District
Beijing 100101, China
Tel: +86 (10) 8409-7858
+86 (10) 8409-7298
Fax: +86 (10) 8409-7720