BioProject is a searchable collection of complete and incomplete (in-progress) large-scale molecular projects including genome sequencing and assembly, transcriptome, metagenomic, annotation, expression and mapping projects.
BioProject accession No. is prefixed with 'PRJC' and followed by 1 Capital letters and 6 digits. For example, PRJCA000001.
See the Standard of BioProject metadata for detailed information.
BioSample contains descriptions of biological source materials used in studies that have data in other National Genomics Data Center databases such as Genome Sequence Archive, Genome Warehouse, Gene Expression Nebulas, Genome Variation Map, Methylation Bank, etc. The National Genomics Data Center, working collaboratively with multiple partner institutions/laboratories, develops a family of standards for big omics data representation, analysis, search and exchange.
BioSample accession No. is prefixed with 'SAMC', and followed by6 digits. For example, SAMC000001.
See the Standard of BioSample metadata for detailed information.
The Genome Sequence Archive (GSA) is a data repository specialized for archiving raw sequence reads.
A GSA object consists of a series of Experiments and Runs.
GSA Accession No. is prefixed with 'CRA' and followed by 6 digits. For example, CRA000001.
Experiment Accession No. is prefixed with 'CRX' and followed by 6 digits. For example, CRX000001.
Run Accession No. is prefixed with 'CRR' and followed by 6 digits. For example, CRR000001.
Name | Description | Tips | Value Format |
*ID | Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3.... The Experiment ID must be unique. | ||
*Experiment title | Short description that will identify the Experiment on public pages. It can have any format, but we suggest that you make it concise, unique, consistent, and as informative as possible. | Every Experiment from same Sample must be unique. | {text} |
*BioProject accession | BioProject accession. | Typical of the form PRJCA [number], NOT SUBPRJCA [number], like PRJCA000005. | |
*BioSample name | Sample Name is a name that you choose for the sample. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. | Every Sample Name from a single Submitter must be unique. | {text} |
*Platform | This column has drop-down menus that allow you to select from a controlled vocabulary Once specified for one row, these values can be copied-and-pasted down. | See the Platform form for details | |
*Library Construction / Experimental Design | Free-form description of the methods used to create the sequencing library; a brief 'materials and methods' section. | e.g., DNA of sorted NCSCs was extracted from the cell line using a QIAamp DNA Mini Kit, sheared to approximately 300-500 bp using a Covaris S220 instrument. Then the libraries were constructed through end-repair, A-tailing, adapter ligation and bisulfate-converted using a ZymoEZ DNA Methylation Kit. | {text} |
Library name | Name of Library. | ||
*Strategy | This column has drop-down menus that allow you to select from a controlled vocabulary. | Once specified for one row, these values can be copied-and-pasted down. See the Strategy form for details. | |
*Source | This column has drop-down menus that allow you to select from a controlled vocabulary. | Once specified for one row, these values can be copied-and-pasted down. See the Source form for details. | |
*Selection | This column has drop-down menus that allow you to select from a controlled vocabulary. | Once specified for one row, these values can be copied-and-pasted down. See the Selection form for details. | |
*Layout | This column has drop-down menus that allow you to select from a controlled vocabulary. | Once specified for one row, these values can be copied-and-pasted down. | |
*Read length for mate 1(bp) | Planned Read Length of Mate1 for your submission. | When Platform is PacBio sequel and Ion Torrent series sequencers, leave this column empty is available. | |
Read length for mate 2 (bp) | Planned Read Length of Mate 2 for your submission. | Require for paired-end data only. | |
Insert size (bp) | Fragment size for Paired reads. Please provide a numerical value for the median interval of the insert size. | ||
Nominal size (bp) | Nominal size | ||
Nominal standard deviation (bp) | Standard deviation of insert size | ||
Planned number of cycles | Planned number of cycles for your submission. | When the Platform is Helicos HeliScope, the Planned number of cycles is required. |
Platform:the sequencing platforms and instrument models.
Platform | Instrument Model |
LS454 | 454 GS |
454 GS 20 | |
454 GS FLX | |
454 GS FLX Titanium | |
454 GS FLX+ | |
454 GS Junior | |
Capillary Technologies | AB 310 Genetic Analyzer |
AB 3130 Genetic Analyzer | |
AB 3130xL Genetic Analyzer | |
AB 3500 Genetic Analyzer | |
AB 3500xL Genetic Analyzer | |
AB 3730 Genetic Analyzer | |
AB 3730xL Genetic Analyzer | |
ABI Solid | AB 5500 Genetic Analyzer |
AB 5500xl Genetic Analyzer | |
AB 5500x-Wl Genetic Analyzer | |
AB 5500x-Wl Genetic Analyzer | |
AB SOLiD 4 System | |
AB SOLiD 4hq System | |
AB SOLiD PI System | |
AB SOLiD System 1.0 | |
AB SOLiD System 2.0 | |
AB SOLiD System 3.0 | |
BGISeq | BGISEQ-100 |
BGISEQ-500 | |
BGISEQ-1000 | |
BGISEQ-2000 | |
DNBSEQ-T7 | |
MGISEQ-2000RS | |
CapitalBio Company | BioelectronSeq 4000 |
Bionano Genomics | BioNano IRYS |
BioNano SAPHYR | |
Complete Genomics | Complete Genomics |
Daan Gene | DA8600 |
Helicos BioSciences Corporation | Helicos HeliScope |
HYK Genetic | HYK-PSTAR-IIA |
Illumina | Illumina HiSeq X Ten |
Illumina Genome Analyzer | |
Illumina Genome Analyzer II | |
Illumina Genome Analyzer IIx | |
Illumina HiScanSQ | |
Illumina HiSeq 1000 | |
Illumina HiSeq 1500 | |
Illumina HiSeq 2000 | |
Illumina HiSeq 2500 | |
Illumina HiSeq 3000 | |
Illumina HiSeq 4000 | |
Illumina MiSeq | |
Illumina MiniSeq | |
Illumina NovaSeq 5000 | |
Illumina NovaSeq 6000 | |
Illumina Nextseq 500 | |
Illumina Nextseq 550 | |
Illumina iSeq 100 | |
Berry Genomics | NextSeq CN500 |
IonTorrent | Ion Torrent PGM |
Ion Torrent Proton | |
Ion Torrent S5 | |
Ion Torrent S5 XL | |
Oxford Nanapore | OXFORD_NANOPORE GridION |
OXFORD_NANOPORE MinION | |
OXFORD_NANOPORE PromethION | |
PacBio SMRT | PacBio RS |
PacBio RS II | |
PacBio Sequel | |
PacBio Sequel II |
Strategy:sequencing technique intended for the library.
Strategy | Sequencing strategy used in the experiment |
---|---|
WGA | Whole genome amplification. |
WGS | Whole genome shotgun. |
WES | Whole exome sequencing is a genomic technique for sequencing, all of the protein-coding genes in a genome (known as the exome). |
WXS | Random sequencing of exonic regions selected from the genome. |
RNA-Seq | Random sequencing of whole transcriptome. |
miRNA-Seq | Micro RNA and other small non-coding RNA sequencing. |
Tn-Seq | Gene fitness determination through transposon seeding. |
WCS | Whole chromosome (or other replicon) shotgun. |
CLONE | Genomic clone based (hierarchical) sequencing. |
POOLCLONE | Shotgun of pooled clones (usually BACs and Fosmids). |
AMPLICON | Sequencing of overlapping or distinct PCR or RT-PCR products. |
CLONEEND | Clone end (5', 3', or both) sequencing. |
FINISHING | Sequencing intended to finish (close) gaps in existing coverage. |
ChIP-Seq | Direct sequencing of chromatin immunoprecipitates. |
MNase-Seq | Direct sequencing following MNase digestion. |
DNase-Hypersensitivity | Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI. |
Bisulfite-Seq | Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status. |
EST | Single pass sequencing of cDNA templates. |
FL-cDNA | Full-length sequencing of cDNA templates. |
CTS | Concatenated Tag Sequencing. |
MRE-Seq | Methylation-Sensitive Restriction Enzyme Sequencing strategy. |
MeDIP-Seq | Methylated DNA Immunoprecipitation Sequencing strategy. |
MBD-Seq | Direct sequencing of methylated fractions sequencing strategy. |
Synthetic-Long-Read | Binning and barcoding of large DNA fragments to facilitate assembly of the fragment. |
ATAC-seq | Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA. |
ChIA-PET | Direct sequencing of proximity-ligated chromatin immuneprecipitates. |
FAIRE-seq | Formaldehyde Assisted Isolation of Regulatory Elements. |
Hi-C | Chromosome Conformation Capture technique where a biotinlabeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing. |
ncRNA-Seq | Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA). |
RAD-Seq | Restriction Site Associated DNA Sequence. |
RIP-Seq | Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP). |
SELEX | Systematic Evolution of Ligands by EXponential enrichment. |
ssRNA-seq | strand-specific RNA sequencing. |
Targeted-Capture | Targeted-Capture sequencing. |
Tethered Chromatin Conformation Capture | Tethered Chromatin Conformation Capture sequencing. |
TCR-seq | High throughput sequencing to map T-cell receptor (TCR) repertoires at high resolution. |
BCR-seq | High throughput sequencing to map B-cell receptor (BCR) repertoires at high resolution. |
MeRIP-Seq | MeRIP-Seq maps m6A-methylated RNA. Deep sequencing provides high-resolution reads of m6A-methylated RNA. |
OTHER | Library strategy not listed (please include additional info in the “design description”). |
Source:The library source specifies the type of source material that is being sequenced.
Source | Type of genetic source material sequenced |
---|---|
GENOMIC | Genomic DNA (includes PCR products from genomic DNA). |
TRANSCRIPTOMIC | Transcription products or non-genomic DNA (EST, cDNA, RT-PCR, screened libraries. |
METATRANSCRIPTOMIC | Transcription products from community targets. |
METAGENOMIC | Mixed material from metagenome. |
SYNTHETIC | Synthetic DNA. |
VIRAL RNA | Viral RNA. |
OTHER | Other, unspecified, or unknown library source material. (please include additional info in the “design description”) |
Selection:whether any method was used to select and/or enrich the material being sequenced.
Selection | Method of selection or enrichment used in the Experiment |
unspecified | Library enrichment, screening, or selection is not specified. (please include additional info in the “design description”) |
RANDOM | Random selection by shearing or other method. |
PCR | Source material was selected by designed primers. |
RANDOM PCR | Source material was selected by randomly generated primers. |
RT-PCR | Source material was selected by reverse transcription PCR. |
HMPR | Hypo-methylated partial restriction digest. |
MF | Methyl Filtrated. |
CF-S | Cot-filtered single/low-copy genomic DNA. |
CF-M | Cot-filtered moderately repetitive genomic DNA. |
CF-H | Cot-filtered highly repetitive genomic DNA. |
CF-T | Cot-filtered theoretical single-copy genomic DNA. |
MDA | Multiple displacement amplification. |
MSLL | Methylation Spanning Linking Library. |
cDNA | complementary DNA. |
ChIP | Chromatin immunoprecipitation. |
MNase | Micrococcal Nuclease (MNase) digestion. |
DNAse | Deoxyribonuclease (MNase) digestion. |
Hybrid Selection | Selection by hybridization in array or solution. |
Reduced Representation | Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling. |
Restriction Digest | DNA fractionation using restriction enzymes. |
5-methylcytidine antibody | Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C). |
MBD2 protein methyl-CpG binding domain | Enrichment by methyl-CpG binding domain. |
CAGE | Cap-analysis gene expression. |
RACE | Rapid Amplification of cDNA Ends. |
size fractionation | Physical selection of size appropriate targets. |
Padlock probes capture method | Circularized oligonucleotide probes. |
Poly-A | polyA enriched RNA-seq. |
other | Other library enrichment, screening, or selection process. (please include additional info in the “design description”) |
Name | Description | Tips | Value Format |
*ID | Run IDs, prefixed with 'R' and followed by a natural number such as: R1, R2, R3... The Run ID must be unique. | ||
*Run title | Short description that will identify the Run on public pages. It can have any format, but we suggest that you make it concise, unique and consistent and as informative as possible. | Every Run from same Experiment must be unique. | {text} |
*BioProject accession | BioProject accession. | Typical of the form PRJCA [number], NOT SubPRJCA [number], like PRJCA000005. | PRJCA[number] |
*Experiment ID | Experiment IDs, prefixed with 'E' and followed by a natural number, such as E1, E2, E3… | ||
*Run data file type | This column has drop-down menus that allow you to select from a controlled vocabulary. | ||
*File name 1 | All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\). | 1.Fastq format can be compressed using gzip or bzip2 (and DOES NOT accepts zip or rar). 2.BAM format do not compress. 3.PacBio sequel and Ion Torrent series sequencers can upload tar compression format. 4.Doublecheck that your file names is accurate before sending to us. |
|
*MD5 checksum 1 | MD5 checksums are a 32-character alphanumeric string. | 1. For Mac and Linux system users, the native command line tools "md5sum"(Linux) and "md5"(Mac OX) can be used to generate MD5 checksums. 2.Windows users must need to download a third-party utility, like winmd5free. |
32-character alphanumeric string |
File name 2 | All data file names must be unique without spaces, brackets, periods, or forward (/) or backward slashes (\). | Those fields require for paired-end data only. | |
MD5 checksum 2 | MD5 checksums are a 32-character alphanumeric string. | 32-character alphanumeric string | |
Reference file name | Reference name. | When your Run data file type select BAM format. 1. If you want to submit your reference file to our FTP Site, you need to fill in the reference_name and reference_md5. We only accept Fasta file under GZIP and BZIP2 compression formats; 2. If your reference file is already in other database, please fill in the Assembly Name or Accession and Assembly Accession URL. 3.PacBio sequel and Ion Torrent series sequencers leave this column empty is available. |
|
MD5 for reference file | MD5 for reference file. | 32-character alphanumeric string | |
Assembly Name or Accession | Assembly Name or Accession. | ||
Assembly Accession URL | Assembly Accession URL. | URL |
This page reviews the submission file formats currently supported by the GSA, and gives guidance to submitters about current file formats and policies regarding GSA submissions.
File types | File suffix | Applicable platforms | Is recommended |
Fastq | .fastq.gz .fq.gz .fastq.bz2 .fq.bz2 |
All Platforms | Yes |
Bam | .bam | All Platforms | Yes |
Sff | .sff | LS454 ION_TORRENT BGISEQ-100 DA8600 |
|
Complete Genomics Native | .tar.gz .tar |
Complete Genomics BGISEQ-500 BGISEQ-1000 |
|
Solid Native | .tar.gz .tar |
ABI SOLID | |
PacBio_HDF5 | .tar .tar.gz |
PacBio RS PacBio RS II |
PacBio RS /PacBio RS II recommend |
PacBio Sequel Native | .tar .tar.gz |
PacBio Sequel | PacBio Sequel recommend |
Ab1 | .ab1 | CAPILLARY | |
Oxford Nanopore Native | .tar .tar.gz |
Oxford Nanapore | |
10x Genomics | .tar .tar.gz |
||
Bnx | .bnx.gz .bnx.bz2 |
Bionano Genomics | |
Fasta | .fasta.gz .fasta.bz2 .fa.gz .fa.bz2 |
||
Helicos Native | .tar .tar.gz |
Helicos BioSciences Corporation |
Read data can be submitted in several standards and platform specific formats. We recommend that read data submitted in BAM Fastq and BAM format.
Single and paired reads are accepted as Fastq files that meet the following requirements:
1) Quality scores must be in Phred scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
2) No technical reads (adapters, linkers, barcodes) are allowed.
3) Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
4) Paired reads must be submitted using two Fastq files.
5) Paired read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").
6) The first line for each read must start with '@'.
7) The base calls and quality scores must be separated by a line starting with '+'.
8) The Fastq files must be compressed using gzip or bzip2.
9) The regular expression for bases is “^([ACGTNactgn.]*?)$”
Submitted BAM files must be readable with Samtools and Picard.
BAM file names are required to end up with the .bam suffix (e.g. ‘a.bam’).