VGSC A web-based Vector Graph toolkit of genome Synteny and Collinearity

Manual

The software architecture of VGSC2 is shown in Fig. 1, illustrating the expansion of capabilities from VGSC (in the cyan background) with several new features (in red background). Vertically, we added the synteny and collinearity calculation as a preceding step of plotting and intergraded a web-based vector graphic editor as a subsequent step to enable the online image process. Horizontally, the synteny and colli- nearity analysis for gene families widen the application scope of VGSC2.

1. Synteny and collinearity calculation

Previously, VGSC provided a downloadable command-line toolkit, programmed in Java, to support the offline plotting workflow, while VGSC-WEB (available at http://bio.njfu.edu.cn/vgsc-web) is a web- based user interface which provides the same functionality. Both re- quired the user to prepare the collinearity file, in which synteny blocks are pre-defined and usually generated by synteny and collinearity analysis algorithms, such as MCScanX, SyMAP, and iADHoRe 3.0. In VGSC2, we integrated the dynamic algorithm from MCScanX, in which pairwise collinear genes chains are matched to score the adjacent collinear gene pairs, known as anchor genes, and to penalize the dis- tance between anchor genes. As a result, in the fast mode of VGSC2, users only need to upload the alignment file (BLAST format) and an- notation file in the GFF file format to complete the entire analysis and plotting procedure.

2. Synteny and collinearity in gene family

In eukaryote genomes, a set of similar genes formed by gene du- plication and conversion are considered as a gene family, and genes in such a family usually have similar biochemical functions. Therefore, synteny and collinearity at the gene family level play a significant role in the homology of gene families. With the position information from the genome annotation, the collinearity always im- plies important connections between whole-genome and gene family scales. VGSC2 can convert the Newick format tree file into a tree plot or draw the collinearity circle plot with highlighted gene family connec- tions.

3. Online editor of vector graph

Using graphic approaches to study biological systems can provide an intuitive vision and useful insights for helping analyze complicated relations therein, as indicated by many previous studies on a series of important biological topics, sucha as enzyme kinetics, drug me- tabolism systems and protein structure. As a key feature of VGSC, the output ability of the vector graph is distinguished from other synteny and collinearity analysis toolkits. Compared with regular raster images that use dot matrix data to represent a generally rectangular grid of pixels or points of color. Vector graphics represent the size and position of geometrical primitives, such as points, lines, curves, shapes, therefore its ability of scale invariance, rotate-invariance, and trans- form invariance makes it widely used in scientific data visualization, especially in high-resolution image processing. In VGSC2, we integrated a frontend JavaScript engine to enable online editing. Users can ma- nipulate the result of analysis from previous steps, such as selection, color editing, rotation, zooming and more. Finally, the edited result can be saved locally as a vector graph file or high-resolution raster file.

4. Availability and examples

User-friendly and publicly accessible web-servers represent the current trend for developing various computational methods. Actually they have significantly enhance the impacts of bioinfomatics on biology science [14], driving computational biology into an un- precedented revolution, here we also provide a web-based VGSC2, the Second Generation Vector Graph Toolkit of Genome Synteny and Col- linearity, and it is ready and available at http://bio.njfu.edu.cn/vgsc2. The command-line based executable of VGSC2 is free to download with the full user manual and sample data. The EXAMPLES section of the page lists three typical examples of VGSC and VGSC2 functions and capabilities. The sample data of Oryza sativa vs. Zea mays, originally from MCScanX examples, demonstrates the feature of online analysis and plotting with bar plots, circle plots, dot plots and dual synteny plots. Also, example data of Arabidopsis thaliana illustrates the new feature of gene family plotting, including family circle plot, family tree plot and family tree plot with tandem mark. The most recent example stems from the high-resolution vector graphic image from the research of Chenopodium quinoa in Nature, in which they utilized VGSC to plot the circle plot and dual synteny plot to indicate the relationship among C. pallidicaule, C. suecicum and Chenopodium quinoa genomes.