Text clustering based on pre-trained models and autoencoders.

Advanced Search

Qiang Xu, Hao Gu, ShengWei Ji

Author Information

Qiang Xu: School of Artificial Intelligence and Big Data, Hefei University, Hefei, Anhui, China.
Hao Gu: School of Artificial Intelligence and Big Data, Hefei University, Hefei, Anhui, China.
ShengWei Ji: School of Artificial Intelligence and Big Data, Hefei University, Hefei, Anhui, China.

PMID: 38250243 DOI: 10.3389/fncom.2023.1334436

Text clustering is the task of grouping text data based on similarity, and it holds particular importance in the medical field. sIn healthcare, medical data clustering is a highly active and effective research area. It not only provides strong support for making correct medical decisions from medical datasets but also aids in patient record management and medical information retrieval. With the development of the healthcare industry, a large amount of medical data is being generated, and traditional medical data clustering faces significant challenges. Many existing text clustering algorithms are primarily based on the bag-of-words model, which has issues such as high dimensionality, sparsity, and the neglect of word positions and context. Pre-trained models are a deep learning-based approach that treats text as a sequence to accurately capture word positions and context information. Moreover, compared to traditional K-means and fuzzy C-means clustering models, deep learning-based clustering algorithms are better at handling high-dimensional, complex, and nonlinear data. In particular, clustering algorithms based on autoencoders can learn data representations and clustering information, effectively reducing noise interference and errors during the clustering process. This paper combines pre-trained language models with deep embedding clustering models. Experimental results demonstrate that our model performs exceptionally well on four public datasets, outperforming most existing text clustering algorithms, and can be applied to medical data clustering.

autoencoder deep embedded clustering model deep learning medical pre-trained models text clustering

Science. 2015 Jul 17;349(6245):261-6 [PMID: 26185244]
J Big Data. 2022;9(1):15 [PMID: 35194542]
Neural Comput. 2019 Jul;31(7):1235-1270 [PMID: 31113301]
Bioinformatics. 2001 Sep;17(9):763-74 [PMID: 11590094]
Neural Netw. 2017 Apr;88:22-31 [PMID: 28157556]
J Med Internet Res. 2019 Jan 08;21(1):e10013 [PMID: 30622098]

Journal Article

No available data.

OpenLB
Open Library of Bioscience