DNA promoter task-oriented dictionary mining and prediction model based on natural language technology.
Ruolei Zeng, Zihan Li, Jialu Li, Qingchuan Zhang
Author Information
Ruolei Zeng: Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA.
Zihan Li: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China. 18811325239@163.com.
Jialu Li: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
Qingchuan Zhang: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China. zhangqingchuan@btbu.edu.cn.
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .