11

Suppose that we read a sequence of $n$ numbers, one by one. How to find $k$'th smallest element just with using $O(k)$ cell memory and in linear time ($O(n)$). I think we should save first $k$ terms of sequence and when get the $k+1$'th term, delete a term which we sure that it cannot be the $k$'th smallest element and then save $k+1$'th term. So we should have an indicator that shows this unusable term in each step and this indicator should be update in each step quickly. I began with "max"; but it cannot update quickly; Means that if we consider max then in first deletion we miss the max and we should search for max in $O(k)$ and its cause $(n-k)\times O(k)$ time that it's not linear. Maybe we should save first $k$ terms of sequence more intelligently.

How do I solve this problem?

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
Shahab_HK
  • 147
  • 1
  • 8

2 Answers2

16

Create a buffer of size $2k$. Read in $2k$ elements from the array. Use a linear-time selection algorithm to partition the buffer so that the $k$ smallest elements are first; this takes $O(k)$ time. Now read in another $k$ items from your array into the buffer, replacing the $k$ largest items in the buffer, partition the buffer as before, and repeat.

This takes $O(k * n/k) = O(n)$ time and $O(k)$ space.

jbapple
  • 3,380
  • 17
  • 21
  • +1, this fits the asked asymptotics. That being said, I don't believe this is faster than doing a single linear-time selection algorithm... except when $k$ is a small constant, then it provides an interesting perspective. For example for $k = 1$ this algorithm produces the min function. – orlp Jan 01 '17 at 19:18
  • 1
    Sometimes, the linear-time selection algorithm uses too much space. For instance, it is not suitable for use in a streaming context or when the input array is immutable. – jbapple Jan 01 '17 at 19:25
  • Those are valid points. – orlp Jan 01 '17 at 19:28
3

You can do it in $O(k)$ memory and $O(n \log k)$ time by forming a fixed size max-heap from the first $k$ elements in $O(k)$ time, then iterating over the rest of the array and pushing a new element and then popping for $O(\log k)$ for every element giving total time $O(k + n\log k)$ = $O(n \log k)$.

You can do it in $O(\log n)$ auxiliary memory and $O(n)$ time by using the median-of-medians selection algorithm, selecting at $k$, and returning the first $k$ elements. With no change to asymptotics you can use introselect to speed up the average case. This is the canonical way to solve your problem.

Now technically $O(\log n)$ and $O(k)$ are incomparable. However I argue that $O(\log n)$ is better in practice, as it's effectively constant considering no computer system has more than $2^{64}$ bytes of memory, $\log 2^{64}= 64$. Meanwhile $k$ can grow to be as large as $n$.

orlp
  • 13,386
  • 1
  • 24
  • 40
  • Note that you can improve the complexity of the heap-based algorithm to $O(n \times \log\min (k, n - k))$ by reversing the order used by the heap when it's interesting. – xavierm02 Jan 01 '17 at 15:04
  • @xavierm02 $O(min(k, n-k))$ = $O(k)$. Proof: the worst case for $k$ is $n$. The worst case for $min(k, n-k)$ is $n \over 2$. They are the same within a constant factor, thus $O(min(k, n-k))$ = $O(k)$. – orlp Jan 01 '17 at 15:10
  • @xavierm02 That being said, it's still a nice speedup :) – orlp Jan 01 '17 at 15:19
  • $u_{n,k}=k$ is $O(k)$ but it's not $O(\min (k, n-k))$. Suppose it is. Then there is some $C$ and some $M$ so that for every $M\le k\le n$, we have $k\le C (n-k)$, which is clearly false (because we can take $n=k \to +\infty).$

    So $O(\min(k, n-k))\subsetneq O(k)$.

    – xavierm02 Jan 01 '17 at 15:27
  • @xavierm02 I'm unfamiliar with your $u_{n, k}$ notation. To be fair, I'm in general quite unfamiliar with multidimensional big-$O$ notation, especially considering that dimensions $n, k$ are not unrelated. – orlp Jan 01 '17 at 16:05
  • Thank you for your hints. So as i understand it, there is no algorithm yet which could do it in $O(k)$ memory and $O(n)$ time? – Shahab_HK Jan 01 '17 at 17:08
  • @Shahab_HK I don't believe so, but $O(\log n)$ memory is so little it doesn't matter either way. In fact it's very rare $k < \log n$, and if I were given the choice even if $O(k)$ memory existed, I'd still chose $O(\log n)$. – orlp Jan 01 '17 at 18:05
  • @Shahab_HK It turns out I was wrong about believing $O(k)$ mem and $O(n)$ time didn't exist, see the other answer. – orlp Jan 01 '17 at 19:16
  • @orlp : ​ ​ ​ How do you do the median-of-medians selection algorithm in ​ O(log n) ​ memory and O(n) time? ​ ​ ​ ​ ​ ​ ​ ​ –  Jan 02 '17 at 05:15
  • @RickyDemer That's outside of the scope of this question, I can refer to the wikipedia articles for median of medians and introselect. – orlp Jan 02 '17 at 06:42
  • @orlp : ​ I have asked a question about that on this site. ​ ​ ​ ​ –  Jan 04 '17 at 07:08