湖北农业科学 ›› 2025, Vol. 64 ›› Issue (7): 203-206.doi: 10.14088/j.cnki.issn0439-8114.2025.07.035

• 生物工程 • 上一篇    下一篇

低深度测序数据的基因型填充优化与回归模型性能分析

向冲, 陈璨   

  1. 长江职业学院数据信息学院,武汉 430070
  • 收稿日期:2025-04-14 出版日期:2025-07-25 发布日期:2025-08-22
  • 作者简介:向 冲(1981-),女,湖北荆门人,副教授,硕士,主要从事大数据技术及人工智能研究,(电话)18995589088(电子信箱)21580555@qq.com。
  • 基金资助:
    湖北省自然科学基金计划(一般面上项目)(2023AFB921)

Optimization of genotype imputation for low-depth sequencing data and performance analysis of regression models

XIANG Chong, CHEN Can   

  1. School of Data and Information, Changjiang Polytechnic, Wuhan 430070, China
  • Received:2025-04-14 Published:2025-07-25 Online:2025-08-22

摘要: 通过优化基因型填充算法和筛选最优回归模型,建立适用于低深度测序基因组数据分析的新方法。结果表明,相较于优化前的算法,优化后基因型填充算法的准确率从95%提升至98%,同时通过参数调优与高效算法选择使单次填充时间由24 h缩短至12 h,处理效率明显提高。对于连续型表型分析(如GWAS中的数量性状),岭回归模型、线性回归模型表现较好,在1.0×测序深度下,岭回归模型、线性回归模型的MSE分别为0.07、0.08,Accuracy分别为0.82、0.80。在处理分类问题(如基因组选择)时,Logistic回归模型凭借概率化建模特性展现出明显优势,该模型在分类性能上表现较好(AUC=0.90),明显优于线性回归模型(AUC=0.85)。

关键词: 低深度测序数据, 基因型填充, 岭回归模型, 性能分析, 线性回归模型, Logistic回归模型

Abstract: A new method suitable for analyzing low-depth sequencing genomic data was established by optimizing genotype imputation algorithms and screening optimal regression models.The results showed that compared to the pre-optimization algorithm, the accuracy of the optimized genotype imputation algorithm increased from 95% to 98%. Meanwhile, parameter tuning and efficient algorithm selection reduced the single imputation time from 24 hours to 12 hours, significantly improving processing efficiency.For continuous phenotypic analysis (e.g., quantitative traits in GWAS), the ridge regression model and linear regression model performed well. At 1.0× sequencing depth, their MSEs were 0.07 and 0.08, and Accuracies were 0.82 and 0.80, respectively.When handling classification problems (e.g., genomic selection), the Logistic regression model demonstrated significant advantages due to its probabilistic modeling characteristics. This model showed good Classification performance (AUC=0.90), significantly outperforming the Linear regression model (AUC=0.85).

Key words: low-depth sequencing data, genotype imputation, ridge regression models, performance analysis, linear regression model, Logistic regression model

中图分类号: