| Superpixels offer a compact image representation but usually produce strong over-segmentation. We propose a learning-based framework that predicts whether two adjacent superpixels should be merged. Each superpixel is represented using deep convolutional features aggregated by a NetVLAD layer into a fixed-dimensional embedding. A siamese architecture compares neighboring superpixels through an enriched pairwise representation and estimates merge probabilities with a multilayer perceptron. Trained end-to-end from ground-truth segmentations, the proposed method accurately predicts merge decisions on the challenging ISEG dataset and enables a hierarchical merging strategy that drastically reduces the number of superpixels while preserving most semantic boundaries. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.