MIA is a one-stop platform for Multi-omics Integrative Analysis that is committing to clustering and identifying feature genes by supervised and unsupervised learning algorithms.
With the explosive growth of diverse omics data, multi-omics integrative analysis has become a routine and popular approach to cluster samples and identify feature genes. Although many efforts have been made to facilitate multi-omics integrative analysis, it remains challenging in multi-dimensional data integration, feature selection for unlabeled samples, and applicability to small sample sets yet with large-scale features. Most importantly, there still lacks a web service to facilitate such analysis. Towards this end, we developed MIA, which could be used for a wide range of clustering analysis or subtype identification, e.g., disease subtyping, feature gene identification, disease classification, normal and disease sample comparison, tissue/cell clustering, and etc.
Algorithmically, MIA adopts tensor, Fuzzy C-Means clustering, and random forest to achieve multi-omics integration, clustering, and feature selection. It allows the analysis of gene-centered omics data including transcriptome, copy number variation, DNA methylation, RNA modification, histone modification, proteomics, while is presently not applicable to the discontinuous data such as gene mutation. The web server supports integrative analysis of two and at most three datasets of different omics types, as well as analysis of single one dataset. Different from the existing tools that require gene filtering or selection before sample clustering, MIA is capable of clustering samples using all submitted multi-omics data without any pre-filtering of genes.
As testified on glioblastoma multi-omics data (transcriptome, copy number variation, DNA methylation), MIA clustered glioblastoma samples into four subtypes, which show significantly different survival states. Meanwhile, MIA is capable of quantitatively prioritizing feature genes by scoring their contributions to the clustering and also delivering clustering charts based on a user-customized list of feature genes. This enables further investigation of the molecular mechanisms underlying different groups or subtypes.
Collectively, MIA is a powerful multi-omics integrative analysis tool, bearing the greater potential to be widely used by a broader range of omics studies.
MIA adopts tensor, Fuzzy C-Means clustering, and random forest to achieve multi-omics integration, clustering, and feature selection.
The data of multiple samples and multiple genes in multi-omics is integrated by tensor to form a three-dimensional matrix: sample, gene, omics type. Random forest is a commonly used classification algorithm in machine learning and also a feature selection algorithm suitable for high-dimensional data. It does not require reducing the dimension or changing the feature space, and is appropriate for the application in multi-omics data. Using the random forest algorithm, MIA calculates the importance index of each gene and then ranks all involved genes based on the importance index, accordingly generating a list of feature genes. Meanwhile, for unlabeled data, the fuzzy C-means clustering algorithm is used to generate sample labels, and then the random forest algorithm is used to identify feature genes.
In addition to the web server, we also provide a local package written in C++ language, which could be run in the form of command line on multiple different operating system platforms and provide customized parameter settings for users’ convenience.
Input (labeled samples):
Data matrix file: two or three datasets of different omics types, or one dataset;
columns are samples and the rows are genes, e.g.
here; gene/samples should be arranged in consistent order across different datasets;
TXT format and <30 Mb.
Sample label: the first column is sample and the second column is label, e.g. here; TXT format and <3
Mb.
Output (labeled samples):
After uploading the input files and clicking on “run” button, four result types of
files could
be obtained:
(1) feature_importance.txt (the ranked features with scores);
(2) orderByImportance.txt (the data matrix by ranked features);
(4) runtime_log.txt (The log of runtime). (e.g.
here).
Input (unlabeled samples):
Data matrix file: same with that of labeled samples.
Cluster number: for clustering unlabeled samples, users could specify the cluster number
by choosing a number from 2 to 10; alternatively, choose “auto” to let MIA determine an
optimal number automatically.
Output (unlabeled samples):
After uploading the input files, set the cluster number and click on "run”
button, four types of files could be obtained:
(1) feature_importance.txt (the ranked
features with scores);
(2) orderByImportance.txt (the data matrix with ranked features);
(3) fuzzy_c_means_cluster_results.txt (the cluster information);
(4) runtime_log.txt
(The log of runtime). (e.g.
here).
It is noticed that MIA will assign an unique Task ID for each request. Users can retrieve the results by inputting the Task ID in the homepage. MIA allows users to run the program on the web server and locally with the MIA package (command line). If you want to input larger data file (.txt; >30Mb) or more than three datasets, you can use the MIA package (see download page). The package can be obtained in the download page and users can install the package by reading the README.txt.