BLAST Homepage and Selected Search Pages
Introducing the BLAST homepage and form elements/functions of selected search pages. https://ngdc.cncb.ac.cn/blast/home
Background
BLAST is a suite of programs provided by NCBI for aligning query sequences against those present in a selected target database. The NGDC BLAST homepage https://ngdc.cncb.ac.cn/blast/home, as the NCBI BLAST homepage https://blast.ncbi.nlm.nih.gov does, provides an access point for these tools to perform sequence alignment on the web.
The BLAST Homepage
The NGDC BLAST homepage consists a brief introduction to the BLAST sequence alignment tool, and four links to different kinds of BLAST services. Among them, nucleotide BLAST (blastn) searches nucleotide databases using a nucleotide query, protein BLAST (blastp) searches protein databases using a protein query, blastx searches protein databases using a translated nucleotide query, and tblastn searches translated nucleotide databases using a protein query. Clink one of the images will take you to the correponded search page.
The Common BLAST Header
The common BLAST header provides a convenient way to navigate among different pages to access different contents or functions.
- The CNCB and NGDC name links to the CNCB and NGDC https://www.cncb.ac.cn/ homepage https://ngdc.cncb.ac.cn/ so that you can access non-BLAST related functions and content from other databases and services available from the site.
- The "Home" tab links to the NGDC BLAST homepage, from other search or result pages.
- The BLAST service temporarily saves your search results for up to 24 hours. If you logged in, the result will be kept permanently. The "Recent Results" links to a page that keeps track of recently submitted search requests that have not expired. The Request ID uniquely assigned to a submitted search provides a one-click access to that result.
- The "Saved Strategies" tab lists a set of search setups saved earlier. It allows the examination of search settings used, quick re-launch of these searches, and download of specific strategies for sharing or re-use in standalone BLAST.
- The "Help" tab points to page with a list of links to help documents.
- The "Log in" button allows users to log in to NGDC SSO, a free account from NGDC, allows users to customize their site preference and manage their works performed on the site. BLAST searches performed while logged in allows access to the search results permanently through the "Recent Results" page, independent from the browser session. Strategies saved will also be saved permanently.
The Recent Results Page
BLAST search results are available for up to 24 hours. The "Recent Results" tab displays a list of recently submitted search requests that have not expired. The list is session-specific and will be lost if session cookie is cleared upon browser exit. For this reason, it is recommended that BLAST searches be done with an active login to a NGDC SSO account, with the NGDC SSO login indicated by an insert at the upper right. Each result is given as a row in the table, with the identifier in the Request ID column providing an one-click access to the search result. The program, Title and Database columns combine to provide a summary for a specific search. The "save", "download" and the "delete" buttons allow saving the search strategy, downloading the BLAST result, and removing the BLAST result from the list.
The input box above the table is for retrieving other results using their assigned Request IDs, such as those shared among colleagues, used as teaching or demonstration examples, or those with issues encountered and reported to NGDC's blast-help group. The formatting options table is not available (the same as the NCBI BLAST's settings) with the switch to updated BLAST results page.
The Saved Strategies Page
The "Saved Strategies" tab (right) displays a list of search strategies you saved earlier. The first four columns provide a summary of the search settings for each saved entry. The "view" button loads the settings in a search page, while the "download" link saves the settings in an JSON formatted text file for use with standalone BLAST or reloading on the web services using the “Choose File” and "View" button, a useful way to share settings.
Functions of BLAST Search Pages
There are four BLAST search pages in NGDC BLAST service, each performs a specific type of sequence alignment. These pages are the foundation for the NGDC BLAST service and will be described in more detail. Table 1 below summarizes key aspects of pages. These pages access a set of common databases, a summary of the contents for these databases are given in Table 2.
Table 1. Key features of the BLAST search pages
| Search page | Query and database | Alignment | Programs and functions |
|---|---|---|---|
| nucleotide blast | nucleotide vs nucleotide | nucleotide | megablast (default): for sequence identification, intra-species comparison discontiguous megablast: for cross-species comparison, searching with coding sequences blastn: for searching with shorter queries, cross-species comparison |
| protein blast | protein vs protein | protein | blastp (default): general sequence identification and similarity searches Quick BLASTP: with a kmer match to accelerate search speed for very similar proteins |
| blastx | nucleotide (translated) vs protein | protein | blastx: for identifying potential protein products encoded by a nucleotide query |
| tblastn | protein vs nucleotide (translated) | protein | tblastn: for identifying database sequences encoding proteins similar to the query |
Table 2 describes the contents of the NCBI and NGDC standard set of databases, which will help you select the proper database to setup more efficient searches.
Table 2. Contents of the common BLAST sequence databases
| Nucl. Databases | Content |
|---|---|
| NCBI nt | All GenBank + EMBL + DDBJ + PDB sequences, excluding sequences from PAT, EST, STS, GSS, WGS, TSA and phase 0, 1 or 2 HTGS sequences. Non-redundant, records with identical sequences collapsed into a single entry. |
| NCBI nt_euk | Eukaryota nt. |
| NCBI nt_prok | Prokaryota (bacteria and archaea) nt. |
| NCBI nt_viruses | Viruses nt. |
| NCBI nt_others | Artificial and other seqs nt. |
| NCBI rRNA/ITS databases | A collection of four databases: a 16S Microbial rRNA sequences from NCBI’s Targeted Loci Projects, an 18S and a 26S RNA rRNA dataabses for fungi, plus an ITS database for fungi. |
| NCBI refseq_rna | Curated (NM_, NR_) plus predicted (XM_, XR_) sequences from NCBI Reference Sequence Project. |
| NCBI refseq_representative_genomes | NCBI RefSeq Reference and Representative genomes across broad taxonomy groups including eukaryotes, bacteria, archaea, viruses and viroids. These genomes are among the best quality genomes available with minimum redundancy - one genome per species for eukaryotes and diverse isolates for the same species for others. |
| NCBI TSA | Transcriptome Shotgun Assemblies, assembled from RNA-seq SRA data |
| NCBI pdb | Nucleotide sequences from the 3-dimensional structure records from Protein Data Bank. |
| NCBI Betacoronavirus | Betacoronavirus nucleotide sequence data (mixed DNA). |
| NCBI env_nt | Environmental samples. |
| NCBI human genome | Homo sapiens GRCh38.p13 (GCF_000001405.39) chromosomes plus unplaced and unlocalized scaffolds |
| NCBI mouse genome | Mus musculus GRCm39 (GCF_000001635.27) chromosomes plus unplaced and unlocalized scaffolds |
| NCBI mito | NCBI Genomic Mitochondrial Reference Sequences |
| Protein Databases | Content |
|---|---|
| NCBI nr | Non-redundant GenBank CDS translations + RefSeq + PDB + SwissProt + PIR + PRF, excluding those in PAT, TSA, and env_nr. |
| NCBI refseq_protein | Protein sequences from NCBI Reference Sequence project. |
| NCBI Landmark | The landmark database includes proteomes from representative genomes spanning a wide taxonomic range |
| NCBI swissprot | Last major release of the UniProtKB/SWISS-PROT protein sequence database (no incremental updates). |
| NCBI pdb | Protein sequences from the 3-dimensional structure records from the Protein Data Bank. |
| NCBI env_nr | Protein sequences translated from the CDS annotation of metagenomic nucleotide sequences. |
| NCBI tsa_nr | Protein sequences translated from CDSs annotated on transcriptome shotgun assemblies. |
Elements of the Standard Nucleotide BLAST Search Page
The "nucleotide-blast" link loads the blastn (Standard Nucleotide BLAST) search page. The top of the page (below the common BLAST header) contains a set of tabs for quick navigation among the four core BLAST search pages. The default display of the page contains three sections with the functions described below.
Enter Query Sequence
The main input box takes nucleotide query sequences in various formats, such as accession (NM_000249) or FASTA For a single input, "Query subrange" boxes define a segment of the query to use in the search. Use the "Choose File" button to upload a plain text file with one or more sequences in FASTA format. The "Align two or more sequences" checkbox changes the "Choose Search Set" sections below to "Enter Subject Sequence" to allow comparison of query against those in the subject input box.
Choose Search Set
BLAST database can be selected from the standard list using the pull-down menu. A search can be restricted to a subset of entries in the selected database by typing the name of the species, strains, or taxonomic group in the "Organism" (include or exclude) textbox and selecting from the suggested list. The exclusion box to the right removes sequences from the selected organism during the search.
Program Selection
Three programs with different speed and sensitivity are available for nucleotide vs nucleotide searches. The default megablast is better for certain tasks, such as identifying the input query and searching with large genomic query; discontiguous megablast works better in finding related sequences from other species; while blastn works better for short input queries and identifying short matches, it also works better for cross-species searches than megablast. Clicking the “BLAST” button submits the search to BLAST server for processing. Results will be automatically displayed when completed. “Algorithm parameters” link opens a normally collapsed section allowing access to additional parameter settings.
General Parameters
Parameters in this section specify the search sensitivity. The "Max target sequences" sets the maximum database sequences with matches BLAST saves for a given query. The "Expect threshold" filters out matches that are less significant, with Expect value above the setting. The "Word size" set the size of the initial seed match, smaller settings are more sensitive.
Scoring Parameters
Parameters here also affect the search sensitivity. The "Match/Mismatch Scores" specifies the reward assigned to exact match and penalty assigned to a mismatch. The "Gap Costs" field specifies how gaps introduced in the alignment should be penalized. For megablast, the default is linear, no penalty for opening a gap, while extending a gap assumes a linear penalty proportional to the length of the gap. For both parameters, non-default settings can be selected using the pull-down menu.
Filters and Masking
Parameters here specify whether low complexity sequences and organism-specific repeats should be filtered, and whether to filter only at the initial seed match stage or during alignment extension as well. Lower case letters in the FASTA query (provided as a mixed upper and lower case letters, representing custom features) can also be masked.
Make sure species-specific repeats for the source organism is selected when searching with genomic sequences to help avoid spurious hits from repeats overwhelming the BLAST search leading to search errors.
Elements of the Standard Protein BLAST Search Page
The "protein-blast" link links to the blastp (Standard Protein BLAST) search page. The top of this page has the same tab and links found in the "Standard Nucleotide BLAST" search page that provide the same functions. The default page display contains three sections with the functions described below.
Enter Query Sequence
Refer to the description for "Standard Nucleotide BLAST" for details.
Choose Search Set
Most of the components are similar to the "Standard Nucleotide BLAST" page. The main difference is that the database pull-down menu has a smaller list of protein databases.
Program Selection
Two different programs are available to satisfy various search. The default blastp is a general purpose protein alignment program for identifying a sequence or finding others similar to it. blastp-short is for short protein sequences. More complex searches may require adjustment of other search settings listed under the "Algorithm parameters" link, which is organized in a similar manner to that for the "Standard Nucleotide BLAST".
- General Parameters: This section is the same as that in the "Standard Nucleotide BLAST".
- Scoring Parameters: Eight score matrices from two families are supported. The default BLOSUM62 is the best general purpose matrix. For short queries, PAM30 is often selected. Each matrix has its own set of supported gap penalties under the Gap Costs menu. Protein alignment scores can be adjusted to account for biases in composition using options under the "Compositional adjustments".
- Filters and Masking: Parameters here specify whether low complexity should be filtered and only at the seed lookup stage. Lower case letters in the FASTA query (in a mixed upper and lower case letters format, representing custom features) should also be masked. These settings are generally not needed when compositional adjustments are used.
Items Unique to Translated Search Pages
The page layout for translated BLAST search pages is the same as "Standard Protein BLAST". However, they do contain a few program-specific parameters.
- Translated blastx search: In the "Enter Query Sequence" section, a "Genetic code" field is present under the "Choose File" button specify the codon table used in the translation of the input nucleotide query. Choose a code appropriate for the source of the query sequence. The remaining sections are the same as the "Standard Protein BLAST" page.
- Translated tblastn search: The page layout is the same as the "Standard Protein BLAST" search page. The key difference is that the Database field lists available nucleotide databases instead.