OrchardTemporal: A Benchmark for Evaluating Vision-Language Models on Temporal Reasoning in Apple Orchards.
- L. Nguyen , N. Dethlefs and C. Liu.
- Link
Hide/Show Full Abstract
Orchard management requires reasoning across time, yet the existing agricultural multimodal model benchmarks do not test temporal tasks that often require multiple image analysis. We introduce OrchardTemporal, the first benchmark evaluating Vision Language Models (VLMs) on temporal reasoning in apple orchards, comprising four tasks: growth stage classification, transition detection, cross-seasonal tree re-identification, and fruit load comparison. Evaluating four VLMs, we find that strong single-image classification capability (F1=0.83 for ChatGPT 4o) does not transfer to temporal reasoning: transition detection drops to F1=0.61, and cross-seasonal re-identification barely exceeds chance. Prompting strategies such as attributeguided prompting and few-shot prompting do not improve model performance for multi-image temporal analysis tasks. These findings reveal that visual perception alone is insufficient for agricultural temporal tasks and establish baselines for future work.- 2026. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 9807-9813.