GSA Data Model

Designed for compatibility, Genome Sequence Archive (GSA) follows Nucleotide Sequence Database Collaboration (INSDC) data standards and structures. Organizational framework of the GSA data is based on the concepts of BIOPROJECT (corresponds to PROJECT in the BioProject database), BIOSAMPLE (corresponds to SAMPLE in the BioSample database), EXPERIMENT, and RUN.

Figure 1. Data model in GSA

Organization of metadata objects

Followings are examples of metadata. Submitters can organize meta data objects flexibly.

♦   Comparative genome sequencing of three strains (paired-end) Include paired-end read files in a Run(Figure 2).

Figure 2. Comparative genome sequencing of three strains (paired-end)

♦   Technical and biological replicates.

Figure 3. Technical and biological replicates

Data submission and retrieval

To create a submission, users need to register and log into the BIG Data Center Submission Portal (BIG Sub,https://ngdc.cncb.ac.cn/gsub/). In order to simplify the submission procedure, GSA is equipped with a user-friendly input wizard for data submission (Figure 4).

♦   All data associated with the same BIOPROJECT should be submitted to a single GSA.

♦   EXPERIMENT and RUN objects contain instrument and library information and are directly associated with sequence data.

♦   Each EXPERIMENT is a unique sequencing result for a specific sample.

♦   Paired-end data files (forward/reverse) must be listed together in the same RUN in order for the two files to be correctly processed as paired-end.

Figure 4. Graphic illustration of data submissions to GSA

Release of linked BioProject/BioSample/GSA

Linked BioProject, BioSample, and GSA data are released as follows (Figure 5): Release of the BioProject records DO NOT trigger release of the other linked data. Release of the BioSample records JUST triggers release of BioProject; however, DO trigger release of the referencing GSA. Release of the GSA nucleotide sequence data DO trigger release of the linked BioProject and BioSample records.

Figure 5. Release of linked BioProject/BioSample/GSA

Release Policies and Disclaimers

1. A date can be set by authors to withhold the release of new submissions for a specified period.

2. The release date can be changed through the BIG Sub portal:https://ngdc.cncb.ac.cn/gsub/submit/gsa/[substitute your GSA accession number]/contents

3. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GSA will release sequence data on the specified date.

4. As soon as they are available, please send the full publication data--all authors, title, journal, volume, pages and date to the following address: gsa@big.ac.cn

Data curation & quality control process

The submitted data will go through a three-step review process before archived (Figure 6). The first step is the online validation during the metadata submission. In this step, both the structure and vocabulary of the metadata will be checked automatically. The second step is the manual review, namely, the expert review. In this step, the data administrator will double-check the metadata to ensure the accuracy of the information. The last step is the quality control for the sequence files. In this step, both the format and content of the files will be checked, and the quality of the files will be evaluated.

Figure 6. GSA data curation & quality control process