Finding k'th smallest element from a given sequence only with O(k) memory O(n) time

Question

Suppose that we read a sequence of $n$ numbers, one by one. How to find $k$'th smallest element just with using $O(k)$ cell memory and in linear time ($O(n)$). I think we should save first $k$ terms of sequence and when get the $k+1$'th term, delete a term which we sure that it cannot be the $k$'th smallest element and then save $k+1$'th term. So we should have an indicator that shows this unusable term in each step and this indicator should be update in each step quickly. I began with "max"; but it cannot update quickly; Means that if we consider max then in first deletion we miss the max and we should search for max in $O(k)$ and its cause $(n-k)\times O(k)$ time that it's not linear. Maybe we should save first $k$ terms of sequence more intelligently.

How do I solve this problem?

Are you interested in an online algorithm, or would any algorithm do? — Yuval Filmus, Jan 01 '17 at 11:16
If $k = \theta(n)$ then you can do it by using order statistics algorithm. If $k = o(n)$ then you can do it $O(k)$ memory and $O(n\log k)$ time using any height balanced trees. — Sarvottamananda, Jan 01 '17 at 14:00
It's called the selection problem https://en.wikipedia.org/wiki/Selection_algorithm — xavierm02, Jan 01 '17 at 15:00
There are linear time in-place algorithms, which you can google, but they're somewhat complicated. — Yuval Filmus, Jan 01 '17 at 16:13
@xavierm02 it's not the selection problem identically. Because there is a memory limit constraint. — Shahab_HK, Jan 01 '17 at 16:48
@YuvalFilmus I googled the problem but any solution was in $O(nlogk)$ time... — Shahab_HK, Jan 01 '17 at 16:59

score 16 · Accepted Answer · answered Jan 01 '17 at 17:42

16

Create a buffer of size $2k$. Read in $2k$ elements from the array. Use a linear-time selection algorithm to partition the buffer so that the $k$ smallest elements are first; this takes $O(k)$ time. Now read in another $k$ items from your array into the buffer, replacing the $k$ largest items in the buffer, partition the buffer as before, and repeat.

This takes $O(k * n/k) = O(n)$ time and $O(k)$ space.

answered Jan 01 '17 at 17:42

jbapple

3,380
17
21

+1, this fits the asked asymptotics. That being said, I don't believe this is faster than doing a single linear-time selection algorithm... except when $k$ is a small constant, then it provides an interesting perspective. For example for $k = 1$ this algorithm produces the min function. – orlp Jan 01 '17 at 19:18
1

Sometimes, the linear-time selection algorithm uses too much space. For instance, it is not suitable for use in a streaming context or when the input array is immutable. – jbapple Jan 01 '17 at 19:25
Those are valid points. – orlp Jan 01 '17 at 19:28

score 3 · Answer 2 · edited Jan 19 '18 at 07:39

3

You can do it in $O(k)$ memory and $O(n \log k)$ time by forming a fixed size max-heap from the first $k$ elements in $O(k)$ time, then iterating over the rest of the array and pushing a new element and then popping for $O(\log k)$ for every element giving total time $O(k + n\log k)$ = $O(n \log k)$.

You can do it in $O(\log n)$ auxiliary memory and $O(n)$ time by using the median-of-medians selection algorithm, selecting at $k$, and returning the first $k$ elements. With no change to asymptotics you can use introselect to speed up the average case. This is the canonical way to solve your problem.

Now technically $O(\log n)$ and $O(k)$ are incomparable. However I argue that $O(\log n)$ is better in practice, as it's effectively constant considering no computer system has more than $2^{64}$ bytes of memory, $\log 2^{64}= 64$. Meanwhile $k$ can grow to be as large as $n$.

edited Jan 19 '18 at 07:39

answered Jan 01 '17 at 14:55

orlp

13,386
1
24
40

Note that you can improve the complexity of the heap-based algorithm to $O(n \times \log\min (k, n - k))$ by reversing the order used by the heap when it's interesting. – xavierm02 Jan 01 '17 at 15:04
@xavierm02 $O(min(k, n-k))$ = $O(k)$. Proof: the worst case for $k$ is $n$. The worst case for $min(k, n-k)$ is $n \over 2$. They are the same within a constant factor, thus $O(min(k, n-k))$ = $O(k)$. – orlp Jan 01 '17 at 15:10
@xavierm02 That being said, it's still a nice speedup :) – orlp Jan 01 '17 at 15:19
$u_{n,k}=k$ is $O(k)$ but it's not $O(\min (k, n-k))$. Suppose it is. Then there is some $C$ and some $M$ so that for every $M\le k\le n$, we have $k\le C (n-k)$, which is clearly false (because we can take $n=k \to +\infty).$
So $O(\min(k, n-k))\subsetneq O(k)$.
– xavierm02 Jan 01 '17 at 15:27
@xavierm02 I'm unfamiliar with your $u_{n, k}$ notation. To be fair, I'm in general quite unfamiliar with multidimensional big-$O$ notation, especially considering that dimensions $n, k$ are not unrelated. – orlp Jan 01 '17 at 16:05
Thank you for your hints. So as i understand it, there is no algorithm yet which could do it in $O(k)$ memory and $O(n)$ time? – Shahab_HK Jan 01 '17 at 17:08
@Shahab_HK I don't believe so, but $O(\log n)$ memory is so little it doesn't matter either way. In fact it's very rare $k < \log n$, and if I were given the choice even if $O(k)$ memory existed, I'd still chose $O(\log n)$. – orlp Jan 01 '17 at 18:05
@Shahab_HK It turns out I was wrong about believing $O(k)$ mem and $O(n)$ time didn't exist, see the other answer. – orlp Jan 01 '17 at 19:16
@orlp : How do you do the median-of-medians selection algorithm in O(log n) memory and O(n) time? – Jan 02 '17 at 05:15
@RickyDemer That's outside of the scope of this question, I can refer to the wikipedia articles for median of medians and introselect. – orlp Jan 02 '17 at 06:42
@orlp : I have asked a question about that on this site. – Jan 04 '17 at 07:08

Finding k'th smallest element from a given sequence only with O(k) memory O(n) time

2 Answers2

Linked