Axis aligned rectangles: why is A an ERM in the case of infinite domain?

Question

I'm working on a problem 2.3a in Shalev-Shwartz/Ben-David's Machine learning textbook, which states:

An axis aligned rectangle classifier in the plane is a classifier that assigns 1 to a point if and only if it is inside a certain rectangle. Formally, given real numbers $a_1\leq b_1, a_2\leq b_2,$ define the classifier $h_{(a_1, b_1, a_2, b_2)}$ by $$ h_{(a_1, b_1, a_2, b_2)}(x_1, x_2) = \begin{cases}1&\textrm{if $a_1\leq x_1\leq b_1$ and $a_2\leq x_2\leq b_2$}\\ 0&\textrm{otherwise}\end{cases} $$ The class of all axis aligned rectangles in the plane is defined as $\mathcal{H}_\mathrm{rec}^2 = \{h_{(a_1, b_1, a_2, b_2)}:\textrm{$a_1\leq b_1$ and $a_2\leq b_2$}\}$...rely on realizability assumption. Let $A$ be an algorithm that returns the smallest rectangle enclosing all positive examples in the training set. Show that $A$ is ERM (empirical risk minimizer).

I don't understand why $A$ would be ERM, when the domain/instance space is infinite. Below is the counterexample I considered to this claim.

For simplicity, consider the one-dimensional case, where the rectangles would be intervals. Sample instance $x\sim \mathcal{U}_{[0, 1]}$ from a uniform distribution over $[0, 1]$, where the labeling function $f:[0, 1]\to\{0, 1\}$ is defined by $$ f(x) = \begin{cases} 1 &\textrm{x = 0 or x = 1}\\ 0 &\textrm{otherwise,} \end{cases} $$ then the realizability assumption holds as there exists an $h^*$ whose rectangle described by the singleton $\{1\}$ has an error set $\{0\}$ of measure $0$. However, in the case of a particular sample $$ S = \{0, 1/4, 1/2, 3/4, 1\}, $$ the smallest rectangle $A$ is the interval $[0, 1]$, and the consequent classifier $h_A$ has an empirical error of $L_S(h_A) = 3/5$. In contrast, the classifier $h^*$ has empirical error $L_S(h^*) = 1/5$, a lower error than the classifier described by $A$, even though it doesn't contain all positive examples. Therefore $A$ does not minimize empirical error.

I think you misunderstand realizability assumption: the book states (on the same page, before excercises): "realizability assumption holds (that is, for some $h \in H$, $L_{(D,f)}(h) = 0$)". I.e. there must exist a classifier achieving $0$ error. I guess you've also misunderstood what "measure 0" means: each points in your dataset has measure $\frac 15$, not $0$, since you should use a discrete measure, not a Lebesgue measure. — , Jan 05 '21 at 01:34
@Dmitry My choice of different measures for true and empirical error was intentional. The true error defined as $L_{(D f)}(h):= \mathcal{D}({x:f(x)\neq h(x)})$ is with respect to the underlying distribution $\mathcal{D}$, which in this case is the uniform distribution, which is why $L_{D, f}(h^*) = 0$. On the other hand, empirical error $L_S(h) = #(\mathrm{errors})/#S$ of a given sample is measured with respect to the discrete measure, regardless of the sample's underlying distribution. — dTdt, Jan 05 '21 at 19:00
What I'm trying to say is that the most I can claim about $A$, given the realizability assumption, is that with probability 1 over random samples, $A$ is ERM, but not always. — dTdt, Jan 05 '21 at 19:01
I see. Since the book states, "To simplify the presentation, we sometimes omit the “with probability 1” specifier." (page 38), I think this is also what they mean in this case — , Jan 05 '21 at 21:45
You're probably right, I forgot about that footnote when I wrote this. Thanks for helping me clear it up. — dTdt, Jan 05 '21 at 22:16

Axis aligned rectangles: why is A an ERM in the case of infinite domain?

0 Answers0