0

CIFAR 10 vs. CIFAR 100 is the most popular benchmark dataset for Out-of-Distribution (OOD) performance evaluation. Google in their 2022 post "towards-reliability-in-deep-learning"[1] used CIFAR 10 vs. CIFAR 100 to demo their new state-of-the-art model pixel. The main feature of CIFAR 10 vs. CIFAR 100 is mutual exclusivity, meaning that CIFAR 100 only includes Negative-Examples of CIFAR 10. For example, CIFAR 10 has classes "automobiles" and "trucks" but not the class "pickup trucks" which only CIFAR 100 does. However, the class "pickup trucks" in CIFAR 100 is under the superclass "vehicle 2" not "automobiles" or "trucks". So how to use CIFAR 100 to test the OOD of CIFAR 10?

[1] https://ai.googleblog.com/2022/07/towards-reliability-in-deep-learning.html

2 Answers2

0

Manual Mapping of Nearest Classes

The answer is manual mapping between the datasets classes. A derivation of this is explained in the paper "Measuring human performance on CIFAR-100 vs CIFAR-10 OOD task"[1]*. So one would have to manually select and annotate the classes from CIFAR 100, best suited for measuring OOD performance for a CIFAR 10 class. For example, "pickup trucks" would be manually selected from CIFAR 100 and manually annotated as "Trucks", and then used to measure OOD performance for the class "Trucks".

[1] https://proceedings.neurips.cc/paper/2021/file/3941c4358616274ac2436eacf67fae05-Supplemental.pdf

*While the typical accuracy a human reaches is often known for classification tasks, there is a lack of such benchmark for near-OOD detection. We decided to measure human performance on the task of distinguishing CIFAR-100 and CIFAR-10. To do that, we wrote a simple graphical user interface (GUI) where a user is presented with a fixed number of images randomly chosen from the in-distribution and out-of-distribution test sets (CIFAR-10 and 100 in our case). The user then clicks on the images they believe belong to the in-distribution. To make this easier, we allow the user to choose the images belonging to the individual classes of the in-distribution. An example of our GUI is shown in Figure 8.

0

Class-wise confidence thresholding

In the previous answer, Manual Annotation was suggested as a Human-SuperVised method for mapping classes. However, I asked the same question to ChatGPT, and the gist of its response was :

How could CIFAR 100 be used to measure OOD performance for a multi-classifier trained on CIFAR 10?

"To measure the OOD performance of the model, we can evaluate its ability to correctly identify samples from CIFAR 100 as OOD. One approach is to use the model's predictive uncertainty to identify OOD samples. For example, if the model's predicted probability for a sample from CIFAR 100 is low for all classes, we can conclude that the sample is likely OOD. On the other hand, if the model's predicted probability for a sample from CIFAR 100 is high for one of the classes in CIFAR 10, we can conclude that the sample is not OOD."

  • On the other hand, I have some reservations about this argument because I want "pickup trucks" in CIFAR 100 to be identified as class "trucks" in CIFAR 10! While I get a feeling that this approach would have it identified as an "UnKnown Class"! I am not so sure though! – Emad Ezzeldin Mar 17 '23 at 14:17