7

What follows is my algorithm for doing this in what I believe to be $O(n)$ time, and my proof for that. My professor disagrees that it runs in $O(n)$ and instead thinks that it runs in $\Omega(n^2)$ time. Any comments regarding the proof itself, or the style (i.e. my ideas may be clear but the presentation not).

The original question:

Given $n$ numbers, find the largest $m \leq n$ among them in time $o(n \log n)$. You may not assume anything else about $m$.

My answer:

  1. Sort the first $m$ elements of the array. This takes $O(1)$ time, as this is totally dependent on $m$, not $n$.
  2. Store them in a linked list (maintaining the sorted order). This also takes $O(1)$ time, for the same reason as above.
  3. For every other element in the array, test if it is greater than the least element of the linked list. This takes $O(n)$ time as $n$ comparisons must be done.
  4. If the number is in fact greater, then delete the first element of the linked list (the lowest one) and insert the new number in the location that would keep the list in sorted order. This takes $O(1)$ time because it is bounded by a constant ($m$) above as the list does not grow.
  5. Therefore, the total complexity for the algorithm is $O(n)$.

I am aware that using a red-black tree as opposed to linked list is more efficient in constant terms (as the constant upper bound is $O(m\cdot \log_2(m))$ as opposed to $m$ and the problem of keeping a pointer to the lowest element of the tree (to facilitate the comparisons) is eminently doable, it just didn't occur to me at the time.

What is my proof missing? Is there a more standard way of presenting it (even if it is incorrect)?

Raphael
  • 72,336
  • 29
  • 179
  • 389
soandos
  • 1,143
  • 2
  • 10
  • 23
  • 1
    "better than $O(n\log n)$" -- that is rough. I assume you mean $\Theta(n \log n)$? Another problem of the question is that it is not clear wether $m$ is fixed. – Raphael Apr 24 '12 at 18:38
  • @Raphael, I am sending you what I have. No idea what he meant. Assume that it really is $O(n⋅log2(n))$ With regard to whether or not $m$ is fixed, I have no idea. When I calculated the complexity, I assumed that it was, but there was no basis for that assumption. – soandos Apr 24 '12 at 19:03
  • There are no $n$'s or $m$'s in the original problem. Based on context, I'm assuming you want to find the largest $m$ numbers in an array of $n$ numbers. EDIT: nevermind, it's in the question title. – Joe Apr 24 '12 at 19:56
  • 1
    @soandos: Then you ask your professor; neither you nor us can be expected to work with half a question. By the way, the actual wording you quote has another problem: Every algorithm is $O(1)$ if you fix your input size to 10 billion. – Raphael Apr 24 '12 at 20:22
  • @Raphael exactly. if all you're given is the quote in the original question, you have to clarify the question just to determine what $n$ is! – Joe Apr 24 '12 at 20:31
  • @Raphael, and others, see update – soandos Apr 24 '12 at 20:36
  • 1
    @soandos: I incorporated your update into the "quote". The answers by Louis and Joe address your problem sufficiently (imho). And unforgiven solves the exercise, of course. – Raphael Apr 24 '12 at 20:41
  • Agreed, thanks for helping with the clarity – soandos Apr 24 '12 at 20:43
  • 1
    @soandos If what you wanted was a solution to the exercise (thanks unforgiven), then you should have just asked for it from the start. Instead, you asked something like: "is my algorithm $O(n)$ and how can I improve my proof?" – Joe Apr 25 '12 at 21:25

3 Answers3

17

Here is an $O(n)$ algorithm solving the problem.

  1. Use the worst-case $O(n)$ selection algorithm to determine the $n-m+1$-th order statistics. Let $k$ be this number, which is the smallest of the $m$ largest numbers we are trying to determine.

  2. Now partition the array around the pivot $k$ using the QuickSort partition function. This step takes $O(n)$ too.

  3. Output the $m$ largest numbers: these are given by $k$ and all of the numbers in the upper subarray generated by partition in step 2.

Raphael
  • 72,336
  • 29
  • 179
  • 389
Massimo Cafaro
  • 4,247
  • 18
  • 27
5

Your algorithm takes $\Theta(n + mn)$ time. I suspect that your professor is looking for something that takes $O(n+ n\log m)$ time, which should be possible, maybe by using a heap...

The source of your disagreement with the professor is that he or she doesn't appear to think $m$ is a constant, despite how the question is worded. If it's not, then $\Theta(m)$ is a lot worse than $\Theta(\log m)$.

Juho
  • 22,554
  • 7
  • 62
  • 115
Louis
  • 2,926
  • 16
  • 25
1

Correctness
Missing from your presentation is the loop invariant to establish correctness: you maintain the largest $m$ elements encountered so far in a linked list. Thus, by the end of your algorithm, you have tested all the elements, and so you have the largest $m$ elements in the array. Your algorithm is still correct, but stating the purpose of the linked list at the beginning of the description makes its correctness more explicit.

Running Time
$log_{10}( 10 \mbox{ billion}) = 10$, (with base 2, it's about ~33) which is a heck of a lot smaller than 10 million. In the example given, $m$ is many times larger than $\log n$, and so I don't think you can safely assume that $m$ is constant.

You should provide an algorithm that is $o(n \log n)$ as long as $m = o(n)$. Replacing your linked-list and linear search with a balanced binary search tree or a min-heap will achieve this run-time: $O(m \log m + n \log m) = O(n \log m) = o(n \log n)$. (assuming $m = o(n)$, otherwise your run-time is $\Theta(n \log n)$)

In case you're not familiar with the notation, the intuition behind $o(n \log n)$ is $< O(n \log n)$, but see the corresponding question on cs.SE for details of the notation.

Joe
  • 4,107
  • 1
  • 20
  • 38