The 9th Big Data Forum for Life and Health Sciences & The 21st Asian Bioinformatics Consortium Symposium ( 2024/10/16 - 2024/10/18)

Biological research has entered the era of big data, including a wide variety of omics data and covering a broad range of health data. Such big data is generated at ever-growing rates and distributed throughout the world with heterogeneous standards and diverse limited access capabilities. However, the promise to translate these big data into big knowledge can be realized only if they are publicly shared. Thus, providing open access to omics & health big data is essential for expedited translation of big data into big knowledge and is becoming increasingly vital in advancing scientific research and promoting human healthcare and precise medical treatment.
It is our great pleasure to announce that the 9th Big Data Forum for Life and Health Sciences & The 21st Asian Bioinformatics Consortium Symposium will be held on October 16-18, 2024. A few renowned biomedical data scientists have agreed to give speeches. Likely, you are also cordially invited to share your work and participate in this exciting event.

会议通知下载

参会报名表

Organizing Committee

Zhang Zhang (Chair, BIG, CAS)
Yiming Bao (BIG, CAS)
Wenming Zhao (BIG, CAS)
Jingfa Xiao (BIG, CAS)
Songnian Hu (Institute of Microbiology, CAS)
Jun Yu (BIG, CAS)
Jingchu Luo (Peking University)

中国医药教育协会医学基因组学与生物信息学专委会

中国生物信息学学会(筹)生物数据资源专委会

北京生物信息学研究会

Previous Conferences

Invited Speakers

Professor
National Center for Bioinformatics, Faculty of Biological Sciences, Quaid-i-Azam University
Pakistan
Professor
Head of Bioinformation & DDBJ Center
National Institute of Genetics
Japan
Dr.
Deputy Director
Center of Genomics and Bioinformatics, Academy Sciences of Uzbekistan
Uzbekistan
Professor
China National Center for Bioinformation
China
Professor
Head of Protein Sequence Resources
EMBL-EBI
United Kingdom
Professor
South China University of Technology
China
Professor
Shandong First Medical University and Shandong Academy of Medical Sciences
China
Associate Professor
National Center for Protein Sciences (Beijing)
China
Professor
Nanjing Universtiy
China
Professor
International Human Phenome Institutes (Shanghai)
China
Professor
Computer Network Information Center
China
Dr.
Lead Scientist, Computational Biology Branch
NCBI, NLM, National Institutes of Health
USA
Professor
Institute of Biophysics, CAS
China
Associate Professor
Tsinghua University
China
Professor
Group leader
HUN-REN Research Center for Natural Sciences
Hungary
Professor
Shanghai Institute of Nutrition and Health
China
Professor
Principal Investigator
Korea Bioinformation Center, Korea Research Institute of Bioscience and Biotechnology
Korea
Professor
China National Center for Bioinformation
China
Associate Professor
China National Center for Bioinformation
China
Professor
National Institute of Genetics
Japan
Professor
Deputy Director
The Information Center of the Chinese CDC
China
Professor
Changping Laboratory
China
Senior Engineer
Institute of Computing Technology, Chinese Academy of Sciences
China
Professor
Deputy Director
Shanghai Institute of Virology, Shanghai Jiao Tong University
China
Professor
China National Center for Bioinformation
China
Associate Professor
Peking University
China
Professor
Tsinghua University
China
Professor
Fudan University
China
Ph.D candidate
Peking University
China
Professor
The Institute of Automation of the Chinese Academy of Sciences
China
Professor
Beijing Institute of Basic Medical Sciences
China
Professor
China National Center for Bioinformation
China
Professor
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
China
Professor
China National Center for Bioinformation
China
Professor
Tsinghua University
China
Professor
Westlake University
China
Professor
National Center for Protein Sciences (Beijing)
China

Agenda

10/16 Wednesday 09:00 ~ 09:10
Welcome and Opening Remarks

Zhang Zhang, China National Center for Bioinformation
10/16 Wednesday 09:10 ~ 12:00
Session 1: Data integration and curation, chaired by Cheng Chang and Shuhui Song
09:10 - 09:55
Keynote talk: AI Transforming Protein Family Classification [Abstract]

In this keynote, I will explore how we are witnessing a revolution in artificial intelligence approaches that is transforming molecular biology and computational biology. I'll discuss how recent advancements in AI, particularly in structural prediction models, are reshaping our understanding and classification of protein families. I will begin by examining the impact of high-accuracy structural models, such as AlphaFold, on our ability to completely and accurately classify protein domains. I'll show how these models have enabled us to refine, define, and classify protein domain families with unprecedented precision. Next, I'll delve into the capabilities of deep learning models like Google Research's ProtENN, which have significantly expanded our ability to identify distant homologues for known protein families. I will also discuss our ongoing collaboration with Google Research, where we're exploring video transformer technologies for annotating InterPro. I believe these advancements collectively represent the most significant progress in protein classification in three decades. I will then cover the emergence of Large Language Models (LLMs) such as ChatGPT and their potential to develop high-throughput tools for annotating proteins, non-coding RNAs, and families. I'll discuss the challenges we face in harnessing these models to write accurate and verifiable annotations at scale, including our strategies to mitigate hallucination issues. Finally, I will explore how these AI-driven approaches are bridging the gap between protein sequence and structure classification, potentially leading to a unified classification system in the near future. I'll conclude with a discussion on the broader implications of these advancements for the field of molecular biology and the exciting possibilities they present for future research and applications.

Alex Bateman, EMBL-EBI

09:55 - 10:20
Information Standard for the Interchange of Proteomic Datasets [Abstract]

蛋白质组研究的快速发展产生了海量的实验数据,如何有效地管理和共享蛋白质组学数据集是目前面临的重大挑战之一。海量蛋白质组学数据交换共享的难点在于缺乏标准信息规范,在满足数据集高效快捷共享的同时提供足够的实验和生物学相关信息。人类蛋白质组组织(Human Proteome Organization,HUPO)所属的蛋白质组标准计划(Proteomics Standards Initiative,PSI)是国际上蛋白质组领域最权威的数据标准和信息指南制定和发布的组织。从2002年至今,PSI组织从蛋白质组最低信息准则、数据格式、控制词汇表、数据资源和分析软件五个方面组织研究并交付成果。但目前国内外还没有针对基于质谱的蛋白质组学实验数据集在不同实验室、数据平台之间的交换制定相关标准。此外,蛋白质组学公共数据集的重用正变得越来越流行,而数据集重用在很大程度上仅限于与肽段和蛋白质的鉴定相关的基准研究和应用。公共定量实验的广泛重用受到缺乏样本和实验元数据的限制,导致不能将每个数据集中的数据文件与样本明确地关联起来。本研究制定了基于质谱的蛋白质组学数据集交换标准,对数据集交换过程中涉及的数据集元数据、实验元数据和实验数据文件规范信息进行了详细的描述,将是首个蛋白质组学数据集交换的信息标准,可以作为本领域未来发展的基础,实现蛋白质组实验数据集的高质量发布、交换和共享,促进蛋白质组学实验数据集的重用。

Yunping Zhu, National Center for Protein Sciences (Beijing) [Speaker introduction]

朱云平,军事科学院军事医学研究院生命组学研究所研究员,博士生导师,国家蛋白质科学中心生物信息学实验室PI。曾任中国医药生物技术协会生物医学信息技术分会副主任委员,863重大项目首席专家。研究方向为蛋白质组生物信息学、生物医学大数据挖掘。研发了蛋白质组信息学系列算法,建立的iProX是国际学术界公认的蛋白质组数据共享发布平台。发表SCI论文两百余篇;获软件著作权三十余项,中国发明专利11项。获中国电子信息科学技术奖一等奖、中国发明协会发明创新奖一等奖,以及国家科技进步创新团队奖、北京市科学技术奖一等奖、中华预防医学科技奖一等奖、军队教学成果一等奖等。

10:20 - 10:45
Group Photo & Tea Break

10:45 - 11:10
An augmented Asian reference panel boosts genotype imputation and the identification of rare variants for bone mineral density [Abstract]

We present the South and East Asian reference Database (SEAD) panel (https://imputationserver.westlake.edu.cn/), which comprises whole genome sequencing data from 11,067 individuals across 17 countries in Asia. The SEAD panel, which excludes singleton variants, consists of 22,134 haplotypes and 88,294,957 variants. The SEAD panel demonstrated higher accuracy compared to 1kGP, TOPMed and ChinaMAP in South Asian population. And as the proportion of South Asian ancestry increased, the proportion of low-frequency and rare well-imputed variants imputed using the SEAD panel progressively increased, whereas those imputed with TOPMed panel significantly decreased. Additionally, when imputing the East Asian population, the SEAD showed comparable concordance in imputation with ChinaMAP panel, while the TOPMed panel was inferior. Finally, we applied the augmented SEAD panel to conduct a discovery and replication genome-wide association study (GWAS) for hip and femoral neck (FN) bone mineral density (BMD) traits within the 5,369 Westlake BioBank for Chinese (WBBC) genotyped samples. The single-variant test suggests that rare variants near SNTG1 gene are associated with hip BMD (rs60103302, MAF=0.0092, P=1.67×10−7). The variant-set analysis also suggests the association of this gene (Pslide_window=9.08×10−9, Pgene_centric=5.27×10−8). The gene and variants achieved a suggestive level for FN BMD. This gene was not reported previously and can only be detected by using Asian reference panel. The preliminary experiment in-vitro demonstrated that the identified rare variant could upregulate the SNTG1 expression, which in turn inhibits the proliferation and differentiation of preosteoblast.

Houfeng Zheng, Westlake University [Speaker introduction]

郑厚峰,西湖大学生命科学学院特聘研究员,博士生导师,浙江省数理医学学会生物医学大数据专业委员会主任委员,中国医疗保健国际交流促进会骨质疏松分会常务委员,国际华人骨研协会(ICMRS)终身会员。2004届临床医学本科毕业,2010年获医学博士学位,加拿大麦吉尔大学(McGill University)人类遗传学博士后(2010--2014)。主要研究方向是:(1)复杂疾病与性状遗传学研究:青春期骨密度的影响因素与易感基因;维生素D与多种复杂疾病的相关性研究;肥胖的影响因素与易感基因;骨质疏松与骨折的影响因素与易感基因。(2)肠道微生物与复杂疾病:骨质疏松,代谢综合征的肠道微生物表现。(3)群体遗传学研究:主要研究中国人单体型研究,中国人群遗传结构,中国汉族人姓氏研究。在Nature、Nature Genetics和BMJ等杂志发表第一作者和通讯作者论文35篇,在美国人类遗传学年会和欧洲钙化组织学会做口头报告。为SCI收录杂志BMJ,Sci Transl Med,Genome Biol和Nat Commun等审稿,美国人类遗传学国际会议审稿人,和国家自然科学基金的评审人。主持加拿大国立卫生研究院CIHR基金1项,国家自然科学基金3项,和浙江省自然科学基金杰出青年项目1项。

11:10 - 11:35
Phenome-wide association findings from China Kadoorie Biobank (CKB) [Abstract]

The China Kadoorie Biobank (CKB) is one of the world's largest prospective cohort studies. It aims to assess established and emerging risk factors for chronic diseases, understand the interplay between genes, lifestyle, and environmental factors, and identify new biological mechanisms linking risk factors to major chronic diseases for the development of new treatments. Based on data from over 0.5 million participants in the China Kadoorie Biobank (CKB) across 10 regions, our goals are to: 1. Examine the relationship between marital status and health risks among Chinese adults, and assess how marital status is linked to the risk of specific diseases and mortality in the population. 2. Investigate the association between parenthood and health risks among Chinese adults, and evaluate how parenthood relates to the risk of specific diseases and mortality in the population. 3. Study the impact of marital status on mortality rates in the Chinese population from 2021 to 2040.

Dianjianyi Sun, Peking University [Speaker introduction]

Dianjianyi Sun, Ph.D, Associate Professor at the Department of Epidemiology and Biostatistics, School of Public Health, Peking University. He is the director of the national co-ordinating center of China Kadoorie Biobank (CKB). His research interests are mainly in the epidemiology of cardiometabolic diseases, multi-omics analysis, aging and life-course research. He has published over 240 papers (H-index 33).

11:35 - 12:00
Plant Regulome: From Big Data, AI Models to the Cis-Regulatory Code [Abstract]

Plant genomes are enriched with complex and evolutionarily diverse non-coding regulatory sequences that follow unknown cis-regulatory codes. In the last decade, high-throughput sequencing technologies have amassed vast regulomic data across various model plants. Here, we developed ChIP-Hub, a comprehensive database that integrates over 16,000 publicly available datasets from nearly 600 publications on model plant species. Using this extensive dataset, we systematically annotated tissue- and cell-specific regulatory elements, including promoters and enhancers, and constructed corresponding gene regulatory networks. Additionally, comparative genomics approaches were employed to explore the functional conservation of tissue-specific regulatory elements and chromatin states across species. We also developed the SeiPlant model, leveraging dilated convolutions with residual connections to capture both linear and nonlinear sequence features, along with spatial basis function layers to enhance model scalability. This AI framework effectively integrates large-scale plant regulomic data, enabling cross-species prediction of regulatory elements based solely on genomic sequences. This model enables to understand the molecular mechanisms of gene regulation, how mutations in cis-regulatory sequences alter their function, and consequently, how they impact plant traits. This understanding is ultimately crucial for designing synthetic regulatory sequences to enhance crop performance.

Dijun Chen, Nanjing Universtiy [Speaker introduction]

Dr. Dijun Chen is an Associate Professor of Bioinformatics at the School of Life Sciences, Nanjing University. He also serves as the Vice Chairman of the Biology Department. Dr. Chen received his B.S. degree in Bioinformatics from Harbin Medical University and his PhD from the Martin Luther University of Halle-Wittenberg, Germany. He completed postdoctoral research at the University of Potsdam and Humboldt University. His research group focuses on investigating the basic molecular principles and evolutionary diversity of gene regulation in development, growth, and disease in eukaryotic organisms using large-scale computational regulatory genomics, statistical, and machine learning methods (AI). He has published over 70 academic papers in Science Advances, Nature Communications (6 papers), Nature Plants, Nature Neuroscience and so on.

10/16 Wednesday 13:30 ~ 15:55
Session 2: Computational methods and AI models, chaired by Xin Li and Zhihua Zhang
13:30 - 14:15
Keynote talk: From big data, large foundation models to digital life

Xuegong Zhang, Tsinghua University

14:15 - 14:40
AI-empowered design of synthetic gene regulatory sequences

Xiaowo Wang, Tsinghua University

14:40 - 15:05
Towards digital cell biology: a new artificial intelligence-driven paradigm

Ge Yang, The Institute of Automation of the Chinese Academy of Sciences

15:05 - 15:30
AI-based large-scale and accurate de novo peptide sequencing [Abstract]

As the mainstream high-throughput method to identify protein sequences, tandem mass spectrometry (MS) plays an important role in proteomics research by generating mass spectra (MS1, MS2) and then analyzing the corresponding peptide sequences. However, current popular methods based on database search are limited by the reference protein sequence database and cannot identify the protein sequences outside of the database, making them unsuitable for specific research, such as neoantigen discovery, antibody design, and vaccine development. De novo peptide sequencing is a promising method to address the above shortcomings. In recent years, the rapid development of AI techniques such as large language model has revolutionized many fields, including the proteomics research. Here, we introduce π-PrimeNovo, a non-autoregressive Transformer-based deep learning model designed to perform accurate and efficient de novo peptide sequencing. With the proposed novel architecture, π-PrimeNovo achieves significantly higher accuracy and up to 69x faster sequencing compared to the state-of-the-art methods. This remarkable speed makes it highly suitable for computation-extensive peptide sequencing tasks such as metaproteomic research, where π-PrimeNovo efficiently identifies the microbial species-specific peptides. Moreover, π-PrimeNovo has been demonstrated to have a powerful capability in accurately mining phosphopeptides in a non-enriched phosphoproteomic dataset, showing an alternative solution to detect low-abundance post-translational modifications (PTMs). We suggest that this work not only advances the development of peptide sequencing techniques but also introduces a transformative computational model with wide-range implications for biological research. Key words: computational proteomics, artificial intelligence, de novo peptide sequencing

Cheng Chang, National Center for Protein Sciences (Beijing) [Speaker introduction]

常乘,军事科学院军事医学研究院生命组学研究所研究员,中国生物信息学学会(筹)生物数据资源专委会委员,长期从事计算蛋白质组学及生物信息学研究,形成了具有自主知识产权的大规模蛋白质组数据精准解析流程和相应算法、软件。2010年毕业于湖南大学电子信息工程专业获得工学学士学位,2015年从军事医学科学院博士毕业获理学博士学位。曾获2020年中国发明协会发明创新奖一等奖(排名第4),以第一作者/通讯作者(含共同)在Nature Communications,Nucleic Acids Research, Briefings in Bioinformatics等期刊发表SCI论文多篇(H-index 20)。

15:30 - 15:55
DeLoop: a deep learning model for chromatin loop prediction from sparse ATAC-seq data

Zhihua Zhang, China National Center for Bioinformation

15:55 - 16:10 Tea Break
10/16 Wednesday 16:10 ~ 17:50
ABC Session: Resources of Asian Biodata Centers, chaired by Yiming Bao
16:10 - 16:35
Personal genomes and privacy: perspectives from a public repository [Abstract]

Every sovereign state intends to impose its own regulations on the access to and sharing of personal genomes. Ultimately, the authority to control shall rest on our comprehension of privacy and fundamental human rights. Since the present definition of privacy was developed over a century ago, we must rethink it in the context of the genomic era. The issue is related to the current dispute about the access and benefit sharing of genetic information in the Convention on Biological Diversity. I will outline key areas of discussion on this subject and propose restorative and reparative measures for possible harmful accidents in the future.

Masanori Arita, National Institute of Genetics

16:35 - 17:00
DDBJ updates in 2024

Yasukazu Nakamura, National Institute of Genetics

17:00 - 17:25
Introduction to the Korea BioData Station (K-BDS) for sharing biological data [Abstract]

The large variety of data available from biological research is a rich resource that can be used for innovative endeavors. However, we are facing considerable challenges in big data deposition, integration, and translation due to the complexity of biological data and its production at unprecedented exponential rates. To address these problems, in 2020, the Korean government officially announced a national strategy to collect and manage the biological data produced through national R&D fund allocations and provide the collected data to researchers. To this end, the Korea Bioinformation Center (KOBIC) developed a new biological data repository, the Korea BioData Station (K-BDS), for sharing data from individual researchers and research programs to create a data-driven biological study environment. The K-BDS is dedicated to providing free open access to a suite of featured data resources in support of worldwide activities.

Seungwoo Hwang, Korea Bioinformation Center, Korea Research Institute of Bioscience and Biotechnology

17:25 - 17:50
Database Resources of National Genomics Data Center, China National Center for Bioinformation

Yiming Bao, China National Center for Bioinformation

10/17 Thursday 09:00 ~ 11:40
Session 3: Genomics for precision medicine, chaired by Guohui Ding and Peilin Jia
09:00 - 09:45
Keynote talk: Discovery and validation of new diagnostic, predictive, and prognostic biomarkers in cancer research

Balázs Győrffy, HUN-REN Research Center for Natural Sciences

09:45 - 10:10
Population wellness and multiomics analysis of CAS cohort

Changqing Zeng , China National Center for Bioinformation

10:10 - 10:35
RNA-RNA Interactions of Virus Genome

Tao Huang , Shanghai Institute of Nutrition and Health

10:35 - 10:50 Tea Break
10:50 - 11:15
Multi-trait analysis identifies novel pleiotropic loci of asthma in East Asia populations

Xiao Chang, Shandong First Medical University and Shandong Academy of Medical Sciences [Speaker introduction]

常啸,山东第一医科大学高层次拔尖人才,医学人工智能与大数据学院教授、转化基因组学课题组长,山东省泰山学者青年专家,国家青年人才项目评审。博士毕业于中国科学院上海生命科学学院。曾任职于美国南加州大学、费城儿童医院。目前主要从事人类遗传疾病和生物信息方向的研究,代表文章以第一作者或通讯作者发表在Cell Research, Nature Communications, The Innovation, Journal of the National Cancer Institute (两篇) 和 Journal of Allergy and Clinical Immunology (三篇)。

11:15 - 11:40
Medical high throughput gene sequencing data sharing requirements [Abstract]

High-throughput sequencing technology has been widely employed in medical research and practice. However, how to securely share and effectively utilize these high-throughput gene sequencing data remains a problem that demands to be addressed. Based on our own practice and in collaboration with several domestic units, we have put forward a set of standard solutions, ranging from the definition of terms, basic principles, shared content, scenarios to process management. We cooperated with regulatory authorities, medical institutions, enterprises, and so on, to verify the applicability of this standard solution. The scheme we provide offers an important reference for similar biological data sharing.

Guohui Ding, International Human Phenome Institutes (Shanghai) [Speaker introduction]

丁国徽博士,研究员,上海国际表型组研究院数字健康研究所执行所长。主要以生物医学信息学、生物医学数据安全/可信共享、下一代生物医学数据设施和全面健康作为研究方向,在国内外权威杂志上发表论文80余篇,其中以第一作者或通讯作者发表论文28篇;申请发明专利37项,软件著作权30余项,主译专业书籍4本,参与各级标准制定27项(国家标准或国际标准9项),承担国内外科研和产学研转化项目19项,完成上海市创新医疗器械申请1项。曾获中国科学院院长奖(优秀奖),赛诺菲青年科学家奖,上海市科技系统五四青年奖,上海市自然科学奖(一等奖),中国科学院青年促进会成员,安徽省创新创业领军人才(创新),上海市医学科技奖(三等奖),全国妇幼健康科学技术奖(一等奖)等。 丁国徽博士是是生物医学数据和生物数字融合标准领域的重要参与者,是上海生物信息学会秘书长,国际电工委员会(IEC)生物数字融合系统工作组中国专家,国家信息技术标准化委员会生物特征识别分委会委员兼基因组识别工作组副组长委员,国家生物表型标准工作组委员,牵头或参与制定中国的生物医学数据安全系列标准。此外,丁国徽博士上海生物医药行业协会精准医学专委会副秘书长,中国妇幼保健协会精准医学专委会副主委兼秘书长。

10/17 Thursday 13:30 ~ 16:50
Session 4: Youth Innovation Promotion Association CAS, chaired by Jingyao Zeng and Shuai Jiang
13:30 - 13:55
Genetic Diversity and the Pangenome Reference 2.0 of Chinese Populations [Abstract]

China is populated by 1.4 billion people of multiple ethnic groups with high cultural and language diversities. The Chinese Pangenome Consortium (CPC) aims to produce high-quality genomic sequences from people representative of the majority ethnic group (Han) and the 55 other defined ethnic groups of China, as well as of multiple unrecognized ethnic groups not yet characterized genetically. At the stage of Phase II (CPC2), apart from the collection of 116 high-quality genome assemblies from 58 core individuals, CPC aims to produce high-quality, phased, chromosome-level haplotype sequences of 500 individuals representing diverse populations in China. We expect the CPC2, as part of global efforts in human genomics, to make a considerable contribution towards the building of high-quality pangenome references and their application to various basic and clinical research questions. In particular, The CPC data have the potential for tracing missing links in our understanding of human genetic evolution and are valuable for explaining the heritability of complex diseases that is not explained by known variants.

Shuhua Xu, Fudan University [Speaker introduction]

徐书华,复旦大学生命科学学院特聘教授,万人计划/研究员,2006年博士毕业于复旦大学,同年入职中科院-马普学会计算生物学伙伴研究所,历任助理研究员、副研究员、研究员;2009年建立群体基因组学研究组(PGG);2012-2018年担任德国马普学会和中科院共同支持的青年科学家小组组长;2013年兼任上海科技大学特聘教授;2015年任中科院特聘研究员(核心骨干)、获国家杰出青年基金;2016年入选上海市优秀学术带头人;2018年兼任中科院动物进化与遗传前沿交叉卓越创新中心特聘研究员、入选科技部中青年科技创新领军人才;2019年获英国皇家学会牛顿高级访问学者基金资助;2021年7月入职复旦大学任特聘教授,建立人类群体组学实验室(POG)。主要研究领域是人类群体基因组学。兼任中国遗传学会理事、上海市遗传学会常务理事、上海市人类学会理事;兼任多个SCI学术期刊编委如Molecular Genetics and Genomics共同主编、Scientific Reports、Human Genomics、Hereditas、BMC Genomic Data资深编辑(Senior Editor)。

13:55 - 14:20
Data-driven AI for Science : Methodological Approaches and Advances [Abstract]

Artificial intelligence methods to assist scientific research (AI for Science) have become one of the current important research trends, where scientific data plays an indispensable role in the process. For instance, AlphaFold has predicted protein folding with extremely high accuracy, greatly enhancing the research efficiency in the field of structural biology, and the high efficiency of its method is inseparable from high-quality scientific data accumulated over decades, such as PDB. This report explores the innovation driven by scientific data in AI4Science, based on the speaker's technical foundation in big data and knowledge graph technology, combined with the capabilities demonstrated by large language models, targeting scientific research needs represented by the cross-field of microbial nanomaterials, introducing methods for constructing high-quality scientific datasets and attempts of AI4Science based on high-quality scientific data.

Yi Du, Computer Network Information Center [Speaker introduction]

Yi Du, Ph.D., Principal Investigator of the National Science Fund for Distinguished Young Scholars, Head of the Key Research and Development Program for Young Scientists, Beijing Science and Technology Rising Star, and a member of the Youth Promotion Association of the Chinese Academy of Sciences. I graduated with a Ph.D. from the Institute of Software (IOS) at the Chinese Academy of Sciences and has been working at the Chinese Network Information Center (CNIC) of the Chinese Academy of Sciences since 2013. Currently, I serves as the Deputy Director of the Big Data Technology and Application Development Department, focusing on research in the areas of big data knowledge graphs and AI for Science (AI4S). He has undertaken more than ten research projects, published 70 academic papers, and been granted 30 invention patents. The projects he leads have provided services to various national ministries and commissions as well as bureaus of the Chinese Academy of Sciences, ensuring the success of several significant national missions.

14:20 - 14:45
Epigenetic Mechanisms Regulating Human Zygotic Genome Activation [Abstract]

After fertilization, most of genes in the human genome are transcriptionally silent until the 8-cell stage, at which point the zygotic genome begins to be extensively transcribed. This process, known as zygotic genome activation (ZGA), is crucial for the further development of human early embryos. However, how human zygotic genome is activated is still elusive. To explore the underlying mechanisms of human ZGA, we investigated the epigenetic landscapes in human early embryos, including chromatin accessibility and histone modifications. We uncovered that OCT4 is essential for human ZGA, which is the first transcription factor identified to regulate this process. Unexpectedly, we observed widespread non-canonical broad H3K27ac domains in human embryos prior to ZGA. The transition from these non-canonical broad H3K27ac domains to typical narrow H3K27ac peaks is associated with ZGA. Furthermore, histone deacetylases are critical for this transition and ZGA. We also compared the contribution of parental genomes to ZGA using human parthenogenetic and androgenetic haploid embryos. Our results indicate that human ZGA is initiated from the paternal genome. In contrast, ZGA occurs simultaneously from both paternal and maternal genomes in mouse. Mechanismly, a primate-specific transcription factor, ZNF675, with paternally biased chromatin accessibility and gene expression, is essential for human ZGA. Taken together, our studies provide insight into the regulatory mechanisms of human ZGA.

Lei Gao, Institute of Biophysics, CAS

14:45 - 15:00 Tea Break
15:00 - 15:25
Common optimization of system software for heterogeneous hardware

En Shao, Institute of Computing Technology, Chinese Academy of Sciences [Speaker introduction]

邵恩,中国科学院计算技术研究所高级工程师,硕士生导师,中国科学院青年促进会会员,CCF高级会员,北京市科技新星。博士毕业于中国科学院大学,在高性能计算机研制方面,主要负责机群操作系统的开发与研制工作,具体攻关技术方向包括:计算机系统结构、超算系统软件。主持国家自然科学基金、国家重点研发计划(青年科学家项目)、CCF-百度松果基金、CCF-蚂蚁科研基金项目。

15:25 - 15:50
Genomic Epidemiology of Carbapenemase-producing Klebsiella pneumoniae in China [Abstract]

The rapid spread of carbapenemase-producing Klebsiella pneumoniae (cpKP) poses serious threats to public health; however, the underlying genetic basis for its dissemination is still unknown. We conducted a comprehensive genomic epidemiology analysis on 420 cpKP isolates collected from 70 hospitals in 24 provinces/autonomous regions/municipalities of China during 2009–2017 by short-/long-read sequencing. The results showed that most cpKP isolates were categorized into clonal group 258 (CG258), in which ST11 was the dominant clone. Phylogenetic analysis revealed three major clades including the top one of Clade 3 for CG258 cpKP isolates. Additionally, carbapenemase gene analysis indicated that blaKPC was dominant in the cpKP isolates, and most blaKPC genes were located in five major incompatibility (Inc) groups of blaKPC-harboring plasmids. Importantly, three advantageous combinations of host–blaKPC-carrying plasmid (Clade 3.1+3.2–IncFIIpHN7A8, Clade 3.1+3.2–IncFIIpHN7A8:IncR, and Clade 3.3–IncFIIpHN7A8:IncpA1763-KPC) were identified to confer cpKP isolates the advantages in both genotypes (strong correlation/coevolution) and phenotypes (resistance/growth/competition) to facilitate the nationwide spread of ST11/CG258 cpKP. Intriguingly, Bayesian skyline analysis illustrated that the three advantageous combinations might be directly associated with the strong population expansion during 2007–2008 and subsequent maintenance of the population of ST11/CG258 cpKP after 2008. We then examined drug resistance profiles of these cpKP isolates and proposed combination treatment regimens for CG258/non-CG258 cpKP infections. Thus, the findings of our systematical analysis shed light on the molecular epidemiology and genetic basis for the dissemination of ST11/CG258 cpKP in China, and much emphasis should be given to the close monitoring of advantageous cpKP–plasmid combinations.

Cuidan Li, China National Center for Bioinformation

15:50 - 16:50
Lighting Talks

10/18 Friday 09:00 ~ 12:05
Session 5: RNA (Single-cell) omics, chaired by Xiaomin Ying and Lina Ma
09:00 - 09:25
Computational immunology empowered by single-cell spatial omics and AI [Abstract]

Dissecting the immune responses underlying human diseases is difficult based on traditional experimental approaches. The advent of single-cell spatial sequencing techniques enables digitalization of clinical samples, opening the door of in silico investigation of human immune mechanisms directly based on clinical samples. However, due to phyical and chemical limitations, current single-cell spatial sequencing techniques are imperfect. We develop a series of AI-based algorithms to synthesize single-cell, spatially-resolved, and transcriptome-wide digitalized clinical samples, which enables the generation of immunological hypotheses of human diseases in silico, exemplified by cancers and infectious diseases.

Xianwen Ren, Changping Laboratory

09:25 - 09:50
UDA-seq: universal droplet microfluidics-based combinatorial indexing for massive-scale multimodal single-cell sequencing [Abstract]

The utilization of single-cell combinatorial indexing sequencing via droplet microfluidics presents an attractive approach for balancing cost, scalability, robustness, and accessibility. Nevertheless, existing methodologies necessitate tailored protocols for individual modalities, which may constrain their potential for automation and hinder clinical applications. Introducing UDA-seq, a universal workflow that integrates a straightforward post-indexing step to enhance throughput and systematically adapt existing droplet-based single-cell multimodal methods. UDA-seq was subjected to benchmarking across various tissue and cell types, successfully enabling several prevalent multimodal tasks, including single-cell co-assay of RNA and VDJ, RNA and chromatin, and RNA and CRISPR perturbation. Notably, UDA-seq facilitated the efficient generation of over 100,000 high-quality single-cell data from three dozen frozen clinical biopsy specimens in a singular-channel experiment of droplet microfluidics. Subsequent analysis substantiated the potency of our approach in identifying rare cell subpopulations associated with clinical phenotypes and probing the vulnerability of cancer cells.

Lan Jiang, China National Center for Bioinformation [Speaker introduction]

蒋岚研究员长期从事表观遗传学和单细胞多组学方面的研究。承担国家海外高层次人才计划项目,国家重点研发计划课题、中国科学院战略性先导科技专项(B类)课题和国家自然科学基金项目等任务。取得的成果包括开发单细胞组学技术和计算方法, 建立组蛋白修饰介导的印记基因新方向,报道DNA甲基化在脊椎动物早期胚胎的跨代遗传规律等。曾获得中国科学院院长特别奖、中国科学院百篇优博、吴瑞奖、美国Charles A. King Trust Fellowship, NIH K99 Award等奖项。目前已发表论文28篇,其中Cell, Nature, Genome Biology,Advanced Science等杂志发表第一或者通讯作者论文(含共同)共计13篇。

09:50 - 10:15
Exploring the uncharted potential of noncanonical small RNAs and RNA modifications [Abstract]

Emerging evidence showed that paternally acquired phenotypes (e.g., metabolic disorders) from environmental stressors can be memorized by sperm beyond DNA sequence, encoded in the form of sperm RNAs and RNA modifications as a 'sperm RNA code'. However, RNA modifications present significant challenges in constructing complementary DNA libraries, thereby impeding the detection of highly modified small non-coding RNAs. To address this limitation, we developed PANDORA-seq, a novel sequencing technology designed to bypass the sequence interferences caused by RNA modifications. Using PANDORA-seq, we characterized the profiles of tRNA-derived small RNAs (tsRNAs) and rRNA-derived small RNAs (rsRNAs), identifying them as potential key mediators in the transmission of paternal acquired information to offspring.

Junchao Shi, China National Center for Bioinformation [Speaker introduction]

Dr. Junchao Shi is a principal investigator of Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation. He earned his Ph.D. in Genetics, Genomics, and Bioinformatics from the University of California, Riverside. Dr. Shi's research focuses on developing non-coding RNA sequencing technologies, analytical tools, and databases. He is also involved in elucidating the biogenesis and functional mechanisms of non-coding RNAs. Additionally, he is dedicated to the development of biomolecular markers for clinical disease diagnosis and the design of small molecule drugs based on non-coding RNAs and their modifications.

10:15 - 10:40
Intelligent spatial transcriptomics: methods and applications [Abstract]

Technological advances in spatial transcriptomics are critical for better understanding the structures and functions of tissues in biological research. The combination of intelligent or statistical algorithms and spatial transcriptomics has emerged to pave the way for deciphering tissue architecture. We have made great efforts to advance intelligent spatial transcriptomics and developed a group of STA- tools. For example, we created a graph attention auto-encoder tool STAGATE to identify spatial domains by learning low-dimensional latent embeddings via integrating spatial information and gene expression profiles. Second, we introduced STAligner for integrating and aligning ST datasets across different conditions, technologies, and developmental stages to enable spatially-aware data integration, simultaneous spatial domain identification, and downstream comparative analysis. Third, we designed STAMarker for identifying spatially domain-specific variable genes with saliency maps in deep learning. Fourth, we developed a spatial location-supervised auto-encoder generator STAGE for generating high-density spatial transcriptomics. Fifth, we developed STASCAN for deciphering fine-resolution cell-distribution maps in spatial transcriptomics.

Shihua Zhang, Academy of Mathematics and Systems Science, Chinese Academy of Sciences [Speaker introduction]

Shihua Zhang received a Ph.D. degree in applied mathematics and bioinformatics from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences in 2008. He joined the same institute as an Assistant Professor in 2008 and is currently Professor. His research interests are mainly in bioinformatics and computational biology, machine learning, and deep learning. His main papers have been published in top journals including Cell, National Computational Biology, Nature Communications, Advanced Science, Genome Research, National Science Review, Cell Reports, Nucleic Acids Research, JMLR, IEEE TPAMI, IEEE TKDE, IEEE TNNLS, etc. His papers have been selected as “Top Ten Bioinformatics Advances in China” in 2021 and 2022, respectively. He won the Youth Science and Technology Award of China, Ten Thousand Talent Program—Leading Talents (in Scientific and Technological Innovation) and Young Top-notch Talent; NSFC Excellent Youth Fund, CAS Outstanding Young Investigator Program, Lu Jia-Xi Young Investigator Award, Zhongchuang Software Talent Award, National Excellent Ph.D. Thesis Award and so on. Now he serves as a Section Editor of PLOS Computational Biology and an Editor for Genomics, Proteomics & Bioinformatics, etc.

10:40 - 10:50 Tea Break
10:50 - 11:15
Deciphering the tumor cell state dynamics under drug treatment by a time-context graph model [Abstract]

Gemcitabine is commonly used for pancreatic ductal adenocarcinoma (PDAC), one of the most lethal cancer types. However, the drug resistance is a critical challenge for improving the PDAC chemotherapy. Here, we applied single-cell RNA sequencing (scRNA-seq) on PDAC patient-derived xenograft (PDX) models to study the complex cellular responses related to the gemcitabine resistances. To reconstruct dynamic tumor cell responses from these static scRNA-seq snapshots, we proposed scConGraph, a scalable bi-layer graph model that can efficiently integrate cross-time context information. Based on scConGraph, we observed that stemness and endoplasmic reticulum stress contribute to intrinsic resistance. As for acquired resistance, cancer cells may resist or evade gemcitabine treatment by activating the cell cycle, entering quiescence, or inducing epithelial-mesenchymal transition. Notably, GDF15 exhibited recurrent and significant upregulations among acquired-resistance cell subpopulations. Experimental validation confirmed that inhibiting GDF15 sensitizes tumor cells to gemcitabine, suggesting a potential target for gemcitabine-induced chemoresistance.

Jin Gu, Tsinghua University [Speaker introduction]

Jin GU, Ph.D. Associate Professor at the Institute of Information Processing, Department of Automation, Tsinghua University. His lab aims at developing machine learning methods to explore the big biomedical data, especially diverse omics data, for precision & personalized medicine. Recently, he is interested in finding biomarkers and therapeutic targets and building predictive models of cancer by spatiotemporal omics. He is the editorial board member of Genomics Proteomics Bioinformatics and BMC Bioinformatics, and serves as the member of several professional committees of CAA and CAAI.

11:15 - 11:40
Pan-cancer single-cell dissection reveals phenotypically distinct B cell subtypes [Abstract]

Characterizing the compositional and phenotypic characteristics of tumor-infiltrating B cells (TIBs) is important for advancing our understanding of their role in cancer development. Here, we establish a comprehensive resource of human B cells by integrating single-cell RNA sequencing data of B cells from 649 patients across 19 major cancer types. We demonstrate substantial heterogeneity in their total abundance and subtype composition and observe immunoglobulin G (IgG)-skewness of antibody-secreting cell isotypes. Moreover, we identify stress-response memory B cells and tumor-associated atypical B cells (TAABs), two tumor-enriched subpopulations with prognostic potential, shared in a pan-cancer manner. In particular, TAABs, characterized by a high clonal expansion level and proliferative capacity as well as by close interactions with activated CD4 T cells in tumors, are predictive of immunotherapy response. Our integrative resource depicts distinct clinically relevant TIB subsets, laying a foundation for further exploration of functional commonality and diversity of B cells in cancer.

Yu Yang, Peking University [Speaker introduction]

Yu Yang, PhD candidate Prof Zemin Zhang's Lab Biomedical Pioneering Innovation Center (BIOPIC) Peking University China

11:40 - 12:05
Mosaic and continual integration of single-cell multimodal data [Abstract]

Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. With the rapid increasing of single cell multimodal data, single cell multimodal atlas requires frequent update to provide comprehensive knowledge. The update of the atlases conventionally requires reintegration of all data and is computationally intensive, which makes the frequent update infeasible. To address these challenges, we present Multimodal Integration with Continual Learning (MIRACLE), a novel online learning framework for adaptive and efficient integration of single-cell multimodal data. MIRACLE uses dynamic architecture and data rehearsal strategies to support continual learning, integrating diverse data while minimizing information loss over time. Our evaluations show that MIRACLE achieves accurate online integration with reduced computational requirements, effectively updates and expands atlases with new cross-tissue and cross-modal data, and precisely identifies novel cell types and transfers labels across datasets.

Xiaomin Ying, Beijing Institute of Basic Medical Sciences [Speaker introduction]

Dr. Xiaomin Ying is a professor of Beijing Institute of Basic Medical Sciences. She received the B.S. and Ph.D. degrees from the National University of Defense Technology in 1997, and 2003, respectively. Her current research interests include interdisciplinary research of AI and biology, especially development of algorithms and tools for multimodal biological data integration and applications. She has published more than 70 papers. Her representative papers have been published in top journals including Nature biotechnology, Gut, Information Fusion, etc.

10/18 Friday 13:30 ~ 16:35
Session 6: Biodiversity and Health genomics, chaired by Weifeng Shi and Minxian Wang
13:30 - 14:15
Keynote talk: COG database update: principles and challenges

Michael Galperin, NCBI, NLM, National Institutes of Health

14:15 - 14:40
Multimodal Spatial Omics Integration Analysis [Abstract]

Compared to single omics, multi-omics provides a more comprehensive perspective and multi-dimensional analysis. By integrating various data such as genomics, transcriptomics, proteomics, metabolomics, and imaging omics, we can overcome the limitations of single omics, deeply reveal complex interactions and hierarchical regulatory mechanisms in biological systems, and thus gain a more comprehensive understanding of biological processes. This report will focus on key technologies and applications of multimodal spatial omics integration analysis. Multi-omics data often comes with challenges such as strong heterogeneity, high dimensionality, batch effects, and incomplete data, bringing new computational challenges. To address these, we have developed multimodal generative models to learn unified representations and cross-modal generation for heterogeneous multi-omics data. We use unsupervised learning and correlation analysis methods to discover intrinsic connections between omics, identify key biomarkers and signaling pathway networks associated with clinical phenotypes and treatment responses, providing in-depth and accurate panoramic analysis for personalized cancer diagnosis and treatment.

Hongmin Cai, South China University of Technology [Speaker introduction]

Hongmin Cai is the Executive Dean and Professor of the School of Future Technology at South China University of Technology, and a doctoral supervisor. He is a recipient of the National Science Fund for Distinguished Young Scholars. Currently, he serves as the Vice Chairman of the Bioinformatics and Artificial Life Special Committee of the Chinese Association for Artificial Intelligence, Standing Committee Member of the Bioinformatics Professional Committee of the China Computer Federation, and overseas fund review expert for the French National Institute of Health and Medical Research (INSERM Institute). He is an IET Fellow and IEEE Senior Member. He serves on the editorial boards of international journals such as IEEE Transaction on Emerging Topics in Computational Intelligence and Fundamental Research. Professor Cai has long been engaged in biomedical artificial intelligence data analysis. His research findings have been published in journals such as IEEE Transactions on Pattern Analysis and Machine Intelligence, Nature Communications, and IEEE Transactions on Medical Imaging. He has led over 10 projects, including the National Science Fund for Distinguished Young Scholars, National Natural Science Foundation of China Joint Fund, and Ministry of Science and Technology International Cooperation projects. He has been granted 20 patents. His achievements have been recognized with awards such as the First Prize in Natural Science from the Guangdong Artificial Intelligence Industry Association, the Gold Award in the "Internet+" Innovation and Entrepreneurship Competition, and the Second Prize in Guangdong Education and Teaching Achievements.

14:40 - 15:05
Genome-Wide Prediction of Tissue-Specific Enhancers in Mammals Using Combinatorial Genomic Occupancy of Gene-Regulating Proteins [Abstract]

In this talk, I will present a sequence-based computational model developed by our research group that utilizes combinatorial transcription factor (TF) genomic occupancy to predict tissue-specific enhancers. Trained on diverse datasets, including ENCODE and the VISTA enhancer browser data, the model predicted 25,000 forebrain-specific cis-regulatory modules (CRMs) in the human genome. Validation through biochemical features, disease-associated SNPs, and in vivo zebrafish analysis confirmed its effectiveness. This model enhances the prediction of enhancers that lack well-characterized chromatin features, providing a valuable complement to experimental approaches in tissue-specific enhancer discovery.

Amir Ali Abbasi, National Center for Bioinformatics, Faculty of Biological Sciences, Quaid-i-Azam University

15:05 - 15:20 Tea Break
15:20 - 15:45
Most accurate mutations in SARS-COV-2 genomes identified in Uzbek patients show novel amino acid changes

Mirzakamol Ayubov, Center of Genomics and Bioinformatics, Academy Sciences of Uzbekistan

15:45 - 16:10
The challenge of data intelligence of infectious disease surveillance and early warning system in the era of big data [Abstract]

Artificial intelligence has been widely used in infectious disease surveillance and early warning. Artificial intelligence technology can rapidly collect, analyze and process large-scale data from multiple sources, identify complex patterns in the data based on efficient algorithms, accurately simulate and predict the development trend of epidemics, and assist in making optimal decisions. This talk briefly introduces the application of AI technology in infectious disease surveillance and early warning system, describes its advantages, and discusses the challenges faced in the future, aiming to provide a reliable reference for more effective monitoring and control of infectious disease epidemics.

Zhihang Peng, The Information Center of the Chinese CDC

16:10 - 16:35
The global virome project

Weifeng Shi, Shanghai Institute of Virology, Shanghai Jiao Tong University [Speaker introduction]

史卫峰,博士,上海交通大学长聘教授,博士生导师,上海市病毒研究院副院长;2015年被评为“泰山学者特聘专家”,2019年被评为“长江学者奖励计划”青年学者,同年入选国家百千万人才工程,授予“有突出贡献的中青年专家”称号,2022年获得“全国科技系统抗击新冠肺炎疫情先进个人”荣誉称号,2023年获得“国家自然科学基金杰出青年基金” 项目资助和“第八届中源协和生命医学创新突破奖”;主要从事RNA病毒起源与进化及病毒性传染病传播与变异研究,已发表SCI论文140余篇,被国内外同行引用60000余次(谷歌学术),其中以第一/通讯作者(含并列)在Cell(2021)、Nature(2015, 2021, 2024)、Lancet(2013, 2020)、Science(2021)、Lancet Microbe(2022a, 2022b, 2023)、Nature Reviews Microbiology(2019, 2023)、Current Biology(2021)、Nature Communications(2014, 2020)等杂志发表科研论文90余篇。目前任中国微生物学会病毒学专业委员会第二届青年委员会副主任委员、中华预防医学会生物信息学分会第一届委员会委员等。