湖北农业科学 ›› 2026, Vol. 65 ›› Issue (2): 202-208.doi: 10.14088/j.cnki.issn0439-8114.2026.02.030

• 信息工程 • 上一篇    下一篇

基于双模态MobileViTv2的饲料剩余量非接触式估算方法

蔡晓锦1, 白涛1,2,3, 李想1, 乔瑞强1   

  1. 1.新疆农业大学计算机与信息工程学院,乌鲁木齐 830052;
    2.智能农业教育部工程研究中心,乌鲁木齐 830052;
    3.新疆农业信息化工程技术研究中心,乌鲁木齐 830052
  • 收稿日期:2025-09-22 出版日期:2026-03-04 发布日期:2026-03-04
  • 通讯作者: 白 涛(1979-),男,新疆乌鲁木齐人,教授,主要从事农业大数据、数据挖掘研究,(电子信箱)bt@xjau.edu.cn。
  • 作者简介:蔡晓锦(2000-),女,河南漯河人,硕士,主要从事计算机视觉研究,(电子信箱)cxj1558403@163.com。
  • 基金资助:
    新疆维吾尔自治区重大科技专项(2022A02011-4); 科技部科技创新2030重大项目(2022ZD0115800); 新疆维吾尔自治区高校基本科研业务费科研项目(XJEDU2022J009)

A non-contact estimation method for feed residue based on dual-modal MobileViTv2

CAI Xiao-jin1, BAI Tao1,2,3, LI Xiang1, QIAO Rui-qiang1   

  1. 1. College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China;
    2. Engineering Research Center of Intelligent Agriculture, MOE, Urumqi 830052, China;
    3. Xinjiang Engineering Research Center for Agricultural Informatization, Urumqi 830052, China
  • Received:2025-09-22 Published:2026-03-04 Online:2026-03-04

摘要: 针对传统饲料剩余量检测方法依赖接触式传感器、成本高且需改造采食槽的问题,提出基于双模态MobileViTv2的轻量化卷积融合回归模型(双模态MobileViTv2+CMFIM+SE),用于实现非接触、高精度的饲料剩余量自动估算。该模型以RGB图像与深度图像为输入,通过双模态MobileViTv2分别提取多尺度特征,并在4个层级引入跨模态多尺度特征交互模块(CMFIM),实现RGB与深度特征的空间-通道双重交互;采用SE模块自适应校准通道权重,增强高层语义表征能力;通过多层感知机回归头输出预测结果。在自建数据集上,双模态MobileViTv2+CMFIM+SE模型的平均绝对误差(MAE)和均方根误差(RMSE)分别为98.24 g和140.21 g,较未引入CMFIM与SE模块的双模态MobileViTv2模型分别降低21.65%和16.73%,且模型参数量仅为9.9×106。该模型兼具高精度、强鲁棒性与轻量化优势,为智慧养殖中精准饲喂提供了可行的技术路径。

关键词: 双模态MobileViTv2, 饲料剩余量, 非接触式估算, RGB图像, 深度图像

Abstract: Aiming at the problems of traditional feed residue detection methods relying on contact sensors, high cost, and the need to modify feeding troughs, a lightweight convolutional fusion regression model (dual-modal MobileViTv2 + CMFIM + SE) based on dual-modal MobileViTv2 was proposed to achieve non-contact and high-precision automatic estimation of feed residue. Taking RGB images and depth images as input, the model extracted multi-scale features respectively through the dual-modal MobileViTv2 and introduced a cross-modal multi-scale feature interaction module (CMFIM) at four levels to achieve spatial-channel dual interaction between RGB and depth features. An SE module was employed to adaptively calibrate channel weights and enhance high-level semantic representation capability. The prediction results were output through a multilayer perceptron regression head. On the self-built dataset, the mean absolute error (MAE) and root mean square error (RMSE) of the dual-modal MobileViTv2 + CMFIM + SE model were 98.24 g and 140.21 g, respectively, which represented reductions of 21.65% and 16.73% compared to the dual-modal MobileViTv2 model without the CMFIM and SE modules, and the parameter size of the model was only 9.9×106. The model combined the advantages of high accuracy, strong robustness, and lightweight design, providing a feasible technical pathway for precision feeding in intelligent livestock farming.

Key words: dual-modal MobileViTv2, feed residue, non-contact estimation, RGB images, depth images

中图分类号: