4

I'm working through the Coursera NLP course by Jurafsky & Manning, and the lecture on Good-Turing smoothing struck me odd.

The example given was:

You are fishing (a scenario from Josh Goodman), and caught:
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
...
How likely is it that the next species is new (i.e. catfish or bass)
Let's use our estimate of things-we-saw-once to estimate the new things.
3/18 (because N_1=3)

I get the intuition of using the count of uniquely seen items to estimate the number of unseen item types (N = 3), but the next steps seem counterintuitive.

Why is the denominator left unchanged instead of incremented by the estimate of unseen item types? I.e., I would expect the probabilities to become:

Carp : 10 / 21
Perch : 3 / 21
Whitefish : 2 / 21
Trout : 1 / 21
Salmon : 1 / 21
Eel : 1 / 21
Something new : 3 / 21

It seems like the Good-Turing count penalizes seen items too much (trout, salmon, & eel are each taken down to 1/27); coupled with the need to adjust the formula for gaps in the counts (e.g., Perch & Carp would be zeroed out otherwise), it just feels like a bad hack.

Ghillie Dhu
  • 141
  • 4

1 Answers1

3

There are no unseen item types in the given data, by definition. 3 is the count of items seen once, and they are already included in the denominator 18. If the next item were previously unseen, it would become seen once when it appears. Since 3-of-18 examples were seen-once items, this is an estimate of the probability that the next item will be seen-once too on its first appearance.

It is certainly a heuristic. There is no way to know whether there are 0 or 1000 other types out there.

Sean Owen
  • 6,595
  • 6
  • 31
  • 43
  • Just to clarify, I meant unseen item types in the population being sampled; obviously there are none in the sample itself. – Ghillie Dhu Jan 13 '15 at 18:58