TY - JOUR
T1 - MLP-based multimodal tomato detection in complex scenarios
T2 - Insights from task-specific analysis of feature fusion architectures
AU - Chen, Wenjun
AU - Rao, Yuan
AU - Wang, Fengyi
AU - Zhang, Yu
AU - Wang, Tan
AU - Jin, Xiu
AU - Hou, Wenhui
AU - Jiang, Zhaohui
AU - Zhang, Wu
N1 - Publisher Copyright:
© 2024
PY - 2024/6/1
Y1 - 2024/6/1
N2 - Accurate and efficient tomato detection is essential for the practical deployment of robotic picking in practical agricultural applications, but it still remains significantly challenging to detect tomatoes in complex scenarios with fluctuating light, overlapping fruits, and occlusion from branches and leaves when solely using RGB images. The recent development of RGB-D sensors has brought one promising opportunity to adopt multimodal fusion for implementing high-quality fruit detection. However, the feasibility of the existing multimodal fusion and feature extraction architectures for lightweight tomato detection tasks, especially in complex agricultural scenarios, raises questions that need to be explored. As a remedy, we proposed a multimodal fusion encoder that leveraged depth and near-infrared modalities to assist RGB images in making full use of multimodal data. Moreover, the encoder contained a plug-and-play structure capable of being implemented as MLP-based (Multi-Layer Perceptron), ViT-based (Vision Transformer), or CNN-based (Convolutional Neural Networks) architectures. Furthermore, we developed a lightweight experimental detection framework based on YOLOv7-tiny by means of integrating the multimodal fusion encoder, and YOLO-DNA (Depth and Near-infrared Assisted) was put forward based on the MLP-based architecture after conducting comprehensive analysis of the aforementioned three architectures. In addition, a tomato multimodal dataset containing visible, depth, and near-infrared images was established. Experimental results demonstrated that YOLO-DNA achieved mAP0.5 of 98.13% and mAP0.5:0.95 of 74.0%, an average increase of 5.01% in mAP0.5 and 14.55% in mAP0.5:0.95 over mainstream lightweight detection models, with a detection speed of 37.12 FPS, meeting the demand of real-time tomato detection. This finding has the potential to advance research on fruit detection in the field of intelligent agricultural harvesting.
AB - Accurate and efficient tomato detection is essential for the practical deployment of robotic picking in practical agricultural applications, but it still remains significantly challenging to detect tomatoes in complex scenarios with fluctuating light, overlapping fruits, and occlusion from branches and leaves when solely using RGB images. The recent development of RGB-D sensors has brought one promising opportunity to adopt multimodal fusion for implementing high-quality fruit detection. However, the feasibility of the existing multimodal fusion and feature extraction architectures for lightweight tomato detection tasks, especially in complex agricultural scenarios, raises questions that need to be explored. As a remedy, we proposed a multimodal fusion encoder that leveraged depth and near-infrared modalities to assist RGB images in making full use of multimodal data. Moreover, the encoder contained a plug-and-play structure capable of being implemented as MLP-based (Multi-Layer Perceptron), ViT-based (Vision Transformer), or CNN-based (Convolutional Neural Networks) architectures. Furthermore, we developed a lightweight experimental detection framework based on YOLOv7-tiny by means of integrating the multimodal fusion encoder, and YOLO-DNA (Depth and Near-infrared Assisted) was put forward based on the MLP-based architecture after conducting comprehensive analysis of the aforementioned three architectures. In addition, a tomato multimodal dataset containing visible, depth, and near-infrared images was established. Experimental results demonstrated that YOLO-DNA achieved mAP0.5 of 98.13% and mAP0.5:0.95 of 74.0%, an average increase of 5.01% in mAP0.5 and 14.55% in mAP0.5:0.95 over mainstream lightweight detection models, with a detection speed of 37.12 FPS, meeting the demand of real-time tomato detection. This finding has the potential to advance research on fruit detection in the field of intelligent agricultural harvesting.
KW - Complex scenarios
KW - Feature fusion
KW - Multimodal
KW - Tomato detection
KW - YOLO
UR - https://www.scopus.com/pages/publications/85191426424
U2 - 10.1016/j.compag.2024.108951
DO - 10.1016/j.compag.2024.108951
M3 - Article
AN - SCOPUS:85191426424
SN - 0168-1699
VL - 221
JO - Computers and Electronics in Agriculture
JF - Computers and Electronics in Agriculture
M1 - 108951
ER -