| Camera-trap systems generate vast image collections for wildlife monitoring, yet transforming these raw visuals into accurate species labels remains a demanding task. This paper introduces a reproducible deep learning benchmark designed to evaluate and compare modern convolutional architectures for camera-trap image classification. Six representative models—ResNet-50, ResNet-101, EfficientNetV2-B3, EfficientNetV2-S, InceptionV3, and Inception-ResNetV2—were fine-tuned on the Taï National Park dataset containing 16,488 labeled images across eight classes, with an additional 4,464 unlabeled test images not used in evaluation. To enhance robustness, a comprehensive augmentation pipeline and Bayesian hyper-parameter tuning using a Tree-Structured Parzen Estimator were employed. All models were assessed under stratified five-fold cross-validation to ensure fair and robust comparisons. The EfficientNetV2-S network achieved the best performance, reaching 90.2 ± 1.1\% accuracy and 0.89 ± 0.01 macro-F1. Beyond raw accuracy, the study also examines common error sources such as glare, motion blur, and occlusion, highlighting practical limitations of current CNNs on ecological imagery. The paper contributes a standardized and cross-validated benchmarking framework that improves comparability between models and provides a reproducible baseline for future transformer-based and multimodal approaches to automated wildlife monitoring. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.