The GWH Handbook (Version 2, May 2024) containing detailed data items' descriptions is freely available here.
The GWH Submission Quick Start Guide (Version beta, July 2017) containing submission descriptions is freely available here.
Designed for compatibility, Genome Warehouse (GWH) follows INSDC data standardsand structures. All data are organized into three objects, i.e., BioProject, BioSample, Genome (Figure 1). "BioProject", bearing an accession number prefixed with "PRJC", provides an overall description for an individual research initiative, including basic description, organism, data type, submitter, funding information, and publication(s) if available.
Figure 1: Data model in GWH
Data relationships in GWH are as follows.
BioProject: is an overall description of a single research initiative, typically involving multiple samples.
BioSample: describes biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of attributes.
Genome: describes detailed genome assembly for a BioSample. One BioSample has one or more genome assemblies. For example, one plant sample may have mitochondrion genome and full genome. One genome contains genome sequence file, and shold contain genome annotation file, AGP file(s), and genome assignment file(s).
GWH shortens for Genome Warehouse, a data repository for genome assembly data. It archives genome assembly sequence, genome annotation and other associated data. GWH is one of database resources in National Data Center (NGDC), part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome assembly associated data for worldwide institutions and laboratories.
Only registered users can submit data using BIG Submission Protal (BIG Sub) . Briefly, data submission requires the following steps.
a) Create a BIGD account and/or login to GWH;
b) Enter metadata information and specify release date;
c) Submit data files;
Any user can freely register and create a Gsub account . After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.
♦ If you just have forgotten your password, you may find the password by clicking “Forgot password”. You will receive an e-mail and please follow the URL to reset your password within 30 minutes.
♦ If you are already a member and you’ve forgotten both your GWH username and password, please feel free to contact us. We will do our best to help you.
Data submission requires that you log into BIG Submission Protal (BIG Sub) , so you need to create an account if you are not a member.
Please note that fields marked are required when submitting metadata.
In the current version 1.0beta of GWH, it supports to submit files by the way of online directly and ftp. It is highly recommended that you submit your files using a dedicated FTP tool (e.g., FileZilla). Please transmit you data files to the GWH FTP site using the following credentials
Address: ftp://submit.big.ac.cn
User: Same as you login the Gsub
Password: Same as you login the Gsub
Path: /GWH/WGSXXXXXX (your submission ID).
In the current version, we accept genome associated data file format as follows:
♦ Genome sequence : FASTA (Step3 Files)
♦ Genome annotation: GFF or TBL (Step3 Files)
♦ Sequence ordering and orientation information: AGP (Step3 Files)
Note: required if genome assembly is complete genome or draft genome in chromosome level.♦ Sequence assignment information: CSV (Step4 Assignment)
Note: required if genome assembly is draft genome in scaffold/chromosome level.All submitted files that you submit via FTP will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the validation process, they will be made public or controlled access according to their release date set by users and the status will change to 'Released' or 'Sucessful' respectively.
MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".
♦ For Linux users, use: $ md5sum filename
♦ For Mac users, use: $ md5 filename
♦ For Windows users, use: $ certutil -hashfile filename MD5; and combine the code by removing the spaces. Or use third party tool.
The GWH quality control process is based on table2asn software (https://www.ncbi.nlm.nih.gov/genbank/table2asn/), and integrate and further supplement the results, summarizing them into report files with err, warning, and user as suffixes. Regarding the error types and interpretations of GWH quality control output, please refer to the following link:
https://www.ncbi.nlm.nih.gov/genbank/validation
https://www.ncbi.nlm.nih.gov/genbank/new_asndisc_examples/
https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/ValidErrItem_8cpp_source.html
Here is a summary of common error types and their referable solutions.
When you submit data, you will find a button named “Release date” at the bottom of "Step 2 Gerneral info" web page. After you specify the release date, it will trigger the data release according to the inputted date. Note that release of Bioproject and Biosample is also triggered by the released of WGS-associated data. It is suggested that you set the release date of Genome later than BioProject or BioSample. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GWH will release sequence data on the specified date. The release date can be changed through the genome portal.
GWH accession No. is prefixed with ‘GWH’ and is followed by 4 Capital letters, and 8 digits. For example, GWHXXXX00000000. Please cite the genome accession number GWHXXXX00000000 in your publication like this (We recommend you putting these paragraphs in the Materials and Methods section of the paper):
The whole genome sequence data reported in this paper have been deposited in the Genome Warehose [1] in National Genomics Data Center [2], Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, under accession number GWHXXXX00000000 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
[1] Genome Warehouse: A Public Repository Housing Genome-scale Data. Genomics Proteomics Bioinformatics 2021, 19(4):584-589. [PMCID=PMC9039550]
[2] Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic Acids Res 2024, 52(D1):D18-D32. [PMCID=PMC10767964]
If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email (GWH@big.ac.cn) or Instant Messaging Software (QQ Group: 541196594docuemntation.faq.help2Answer22
We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GWH.
Address:
National Genomics Data Center
Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation
No.1 Beichen West Road, Chaoyang District
Beijing 100101, China
Tel: +86 (10) 8409-7858
+86 (10) 8409-7298
Fax: +86 (10) 8409-7720