The items in each metadata object (Version 2.1) containing detailed data items descriptions is freely available CN US.
The GSA Submission Quick Start Guide (Version 2.3) containing submission descriptions is freely available CN US.
Designed for compatibility, Genome Sequence Archive (GSA) follows Nucleotide Sequence Database Collaboration (INSDC) data standards and structures. Organizational framework of the GSA data is based on the concepts of BIOPROJECT (corresponds to PROJECT in the BioProject database), BIOSAMPLE (corresponds to SAMPLE in the BioSample database), EXPERIMENT, and RUN.
Figure 1. Data model in GSA
Followings are examples of metadata. Submitters can organize meta data objects flexibly.
♦ Comparative genome sequencing of three strains (paired-end) Include paired-end read files in a Run(Figure 2).
Figure 2. Comparative genome sequencing of three strains (paired-end)
♦ Technical and biological replicates. Biological replicates should be classified as two different samples; technical replicates should be considered as two different experiments.
Figure 3. Technical and biological replicates
To create a submission, users need to register and log into the BIG Data Center Submission Portal (BIG Sub,https://ngdc.cncb.ac.cn/gsub/). In order to simplify the submission procedure, GSA is equipped with a user-friendly input wizard for data submission (Figure 4).
♦ All data associated with the same BIOPROJECT should be submitted to a single GSA.
♦ EXPERIMENT and RUN objects contain instrument and library information and are directly associated with sequence data.
♦ Each EXPERIMENT is a unique sequencing result for a specific sample.
♦ Paired-end data files (forward/reverse) must be listed together in the same RUN in order for the two files to be correctly processed as paired-end.
Figure 4. Graphic illustration of data submissions to GSA
Linked BioProject, BioSample, and GSA data are released as follows (Figure 5): Release of the BioProject records DO NOT trigger release of the other linked data. Release of the BioSample records JUST triggers release of BioProject; however, DO trigger release of the referencing GSA. Release of the GSA nucleotide sequence data DO trigger release of the linked BioProject and BioSample records.
Figure 5. Release of linked BioProject/BioSample/GSA
1. A date can be set by authors to withhold the release of new submissions for a specified period.
2. The release date can be changed through the BIG Sub portal:https://ngdc.cncb.ac.cn/gsub/submit/gsa/[substitute your GSA accession number]/contents
3. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GSA will release sequence data on the specified date.
4. As soon as they are available, please send the full publication data--all authors, title, journal, volume, pages and date to the following address: gsa@big.ac.cn
The submitted data will go through a three-step review process before archived (Figure 1). The first step is the online validation during the metadata submission. In this step, both the structure and vocabulary of the metadata will be checked automatically. The second step is the manual review, namely, the expert review. In this step, the data administrator will double-check the metadata to ensure the accuracy of the information. The last step is the quality control for the sequence files. In this step, both the format and content of the files will be checked, and the quality of the files will be evaluated.
Figure 1. GSA data curation & quality control process
GSA is short for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in National Genomics Data Center(NGDC). , part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories. gsa.doc.faq.h1AA2
Only registered users can submit data using BIG Submission (BIG Sub,https://ngdc.cncb.ac.cn/gsub/) Portal. Please refer to the GSA Submission Quick Start Guide.gsa.doc.faq.h1BAagsa.doc.faq.h1BAbgsa.doc.faq.h1BAcgsa.doc.faq.h1BAd
Any user can freely register and create a BIG Sub account.After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account. gsa.doc.faq.h2AA1gsa.doc.faq.h2AA2
1) If you just have forgotten your password, you can reset the password by clicking “Forgot password”. gsa.doc.faq.h2BA1gsa.doc.faq.h2BA2
2) After submitting, you will receive a response email. Please click the following URL to update your password within 10 minutes or you will need to submit email again.
If you have any problems about your account usage, please email gsa@big.ac.cn for assistance. gsa.doc.faq.h2BD1 gsa.doc.faq.h2BD2
After logging on the login system, you can follow steps below to finish the submission:
1) Create a GSA submission in GSA database.
2) Register your project (BioProject) and biological samples (BioSamples) if you did not register them before at BioProject and BioSample databases, respectively. Please refer to the GSA Submission Quick Start Guide. gsa.doc.faq.h3AC1 gsa.doc.faq.h3AC2 gsa.doc.faq.h3AC3 gsa.doc.faq.h3AC4gsa.doc.faq.h3AC5gsa.doc.faq.h3AC6
3) Submit GSA metadata -information that will link your project, samples/experiments and file names. gsa.doc.faq.h3AD1gsa.doc.faq.h3AD2
4) Upload sequence data files by FTP.
In the current version of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (such as FileZilla Client) to log in to the FTP server, follow the tools instruction to set the transfer mode; If you are using FTP command, type the “binary” command before the “mput” command. gsa.doc.faq.h3CA1gsa.doc.faq.h3CA2
Transmitting your data files to the GSA FTP site
Address: ftp://submit.big.ac.cn
User and Password are same as you login the BIG Sub
NOTICE: Navigate (use command cd) to GSA folder in the Remote Site box. Then upload files will be removed after the whole submission is finished processing.
After finishing all above tasks, GSA team will check your information and files, and give your feedback.
In the current version, we recommend that read data is either submitted in FASTQ or BAM format. In addition, GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.
Format | File suffix | Description |
---|---|---|
Fastq format |
.fastq.gz .fq.gz .fastq.bz2 .fq.bz2 |
fastq files with constant read length |
BAM format |
.bam |
Binary SAM format for use by loaders that combine alignment and sequencing data |
HDF5 format |
.bax.h5 .bas.h5 |
HDF5 is a data model, library, and file format for storing and managing data. |
Reference_FASTA |
.fasta.gz .fa.gz |
Reference sequence file in single fasta format used to construct SRA archive file format. |
SFF format |
.sff |
454 Standard Flowgram Format file |
SRF format |
.srf |
SRF is a generic format for DNA sequence data. This format has sufficient flexibility to store data from current and future DNA sequencing technologies. |
All submitted files will be regularly moved from FTP to a staging area for processing. Thus, it is quite normal that files “disappear” from FTP. If files succeed in passing the process, they will be made public or controlled access according to their release date set by users.
MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".
♦ For Linux users, use: $ md5sum
♦ For Mac users, use: $ md5
♦ Windows users need to use a third-party tool, e.g. winmd5free. gsa.doc.faq.h3GA4gsa.doc.faq.h3GA5
After accessing the GSA database through the BIG Sub account, please find the “Share” button in the last column “Operation” of this list as shown below.
By clicking the “Share” tab, you can get the “Shared URL” as shown in the figure below. You can copy and paste the URL to editors, and then they can peer review your data.
After the article published, you can click on the "Release Now" button in the last column “Operation” of the list as shown below.
Please Click "Yes" in the "Confirmation Box" to trigger GSA release. The release of GSA will trigger the release of BioProject and BioSample, so you DO NOT need to release BioProject and BioSample in their respective system separately.
NOTICE:Data can be searched and downloaded in the GSA database as soon as they are archived.
When you have successfully submitted data to GSA, please consider to use the following words to describe data deposition in your manuscript:
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) in National Genomics Data Center (Nucleic Acids Res 2021), China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRAxxxxxx) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa. gsa.doc.faq.h4CB1.
Please cite the following required publications.
The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, Proteomics & Bioinformatics 2021, https://doi.org/10.1016/j.gpb.2021.08.001
Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2021.
Nucleic Acids Res 2021, 49(D1):D18–D28. https://doi.org/10.1093/nar/gkaa1022
[PMID=33175170]
If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email gsa@big.ac.cn or Instant Messaging Software (QQ Group: 548170081). gsa@big.ac.cn gsa.doc.faq.h5AA1548170081).
We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GSA,
Address:
National Genomics Data Center
China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences
No.1 Beichen West Road, Chaoyang District
Beijing 100101, China
Tel: +86 (10) 8409-7340