Elusive Images

Abstract

While the community has seen many advances in recent years to address the challenging problem of Finegrained Visual Categorization (FGVC), progress seems to be slowing—new state-of-the-art methods often distinguish themselves by improving top-1 accuracy by mere tenths of a percent. However, across all of the now-standard FGVC datasets, there remain sizeable portions of the test data that none of the current state-of-the-art (SOTA) models can successfully predict. This paper provides a framework for identifying and studying the errors that current methods make across diverse fine-grained datasets. Three models of difficulty—Prediction Overlap, Prediction Rank and Pairwise Class Confusion—are employed to highlight the most challenging sets of images and classes. Extensive experiments apply a range of standard and SOTA methods, evaluating them on multiple FGVC domains and datasets. Insights acquired from coupling these difficulty paradigms with the careful analysis of experimental results suggest crucial areas for future FGVC research, focusing critically on the set of elusive images that none of the current models can correctly classify.

Main Contributions

We propose a framework for identifying and analyzing challenging images and classes within a dataset, using 3 metrics based on aggregate predictions from multiple classification models: prediction overlap, prediction rank, and pairwise class confusion. The code is available.
We provide a standardized implementation and training recipe for several state-of-the-art FCVG models, allowing for fair comparison and efficient benchmarking. The code is available.
We introduce the iCub dataset, consisting of images of birds from the same species as CUB, but at a larger and more challenging scale, and meant to serve as an additional set of validation images for analysis.
We utilize the proposed analysis framework to analyze 5 standard FGVC datasets, along with iCub, using 6 different model types, to identify patterns in challenging images and classes.

Method

We propose three methods for analyzing image and class difficulty: prediction overlap, prediction rank, and pairwise class confusion. We use these methods to analyze difficulty in FGVC datasets.

Illusration of **prediction overlap** (left) and **prediction rank** (right).

Illustration of **pairwise class confusion**.

iCub Dataset

We propose the iCub dataset as an additional large-scale source of validation images for testing models trained to recognize birds from the 200 species found in CUB-200. iCub images are sourced from iNaturalist, and have a different visual distribution. This allows us to test CUB models on a more difficult set of data to see how they generalize.

iCub sample images — A random sampling of images from the proposed iCub dataset.

One way iCub differs from CUB is in the spatial distribution of the birds with respect to the image; in particular, iCub contains many birds that are small (occupy a small area of the image).

Results

For our analysis, we train five models (using different random seeds) for each of the model types. Accuracy of these models is shown below. We find that the models tend to underperform the results reported in their respective papers when they are reimplemented in a standardized context. We found that WS-DAN performed particularly well, despite being an older approach.

Validation accuracy on 6 datasets using 6 different models with 5 trials each.

We also show results of applying our proposed difficulty analysis methods: prediction overlap, prediction rank, and similar class confusion.

Prediction overlap measured on each dataset, using 30 predictions for each image. We show the percentage of images in each overlap group, with perfect overlap (no incorrect predictions) on the left and zero overlap (no correct predictions) on the right. In all datasets, we see a large percentage of the images are always predicted correctly, but there is also a significant portion of the images that are never predicted correctly, and many that are rarely predicted correctly.

A comparison between the prediction rank for each image and the average prediction rank for images in the same class. This shows a similar pattern as prediction overlap; on average, the rank is low (the correct class is one of the highest predictions), but there are specific images that have much higher rank.

Confusion matrices for each dataset. Darker color indicates more predictions. Red cells show similar classes, defined as classes for which the KL-divergence is more than 3 standard deviations below the mean (across all pairs of classes in the dataset), and blue cells show predictions in non-similar classes.

Finally, we show some examples of difficult images. We notice some patterns in these images such as camoflauge, occlusion, non-standard poses, or multiple objects. We hypothesize that these factors contribute to the image difficulty; but we must be careful not to confuse correlation with causation. Determining the real or exact reasons for misclassification is a challenging task.

Examples of challenging images: elusive images on the left, and hard images on the right.

Citation

@inproceedings{anderson2024elusive-images,
  title={Elusive Images: Beyond Coarse Analysis for Fine-grained Recognition},
  author={Anderson, Connor and Gwilliam, Matt and Gaskin, Evelyn and Farrell, Ryan},
  booktitle={WACV},
  year={2024}
}