2

Regarding the median of medians solution to the $k$th smallest element in an array, why does the algorithm split the array into subarrays of length $n/5$, where $n$ is the length of the initial array? Why not $n/7$ or $n/3$? Why 5??

theSongbird
  • 303
  • 2
  • 15

1 Answers1

2

Along with the explanation given on wikipedia, I'll try to give more visual examples. The main point is that subarrays of length $ \frac{n}{3}$ make it a non-linear algorithm.

We can actually check out the 3 scenarios you discuss. I'm going to be referring to area diagrams similar to the one depicted on wikipedia. These area diagrams are a useful abstraction to see how large of a subproblem we might have to recurse on in the worst case. The area diagrams will look something like this:

areadiag

Where $M$ is median of medians, $(< M)$ and $(> M)$ represent the area (amount of values in the array) that are potentially less than $M$ and greater than $M$ respectively, in the worst/best case. Note that in the area diagrams I present I say the area is roughly (~) some value. This is because it may be off by one or two and I am largely ignoring these small constants because they will be insignificant to the analysis. If you are bothered by this, you can assume $n$ is of a useful form such that the values are exact.

For $\frac{n}{3}$ size subarrays we will end up with an area diagram like this:

div3

It is clear that in the worst case we will have to recurse on roughly $\frac{4 n}{6}$ elements in the array. This will give us a recurrence relation: $$\begin{align} T(n) & = T\left(\frac{n}{3}\right) + T\left(\frac{4n}{6}\right) + cn\\ & = T\left(\frac{n}{3}\right) + T\left(\frac{2n}{3}\right) + cn \end{align}$$ This actually turns out to be on the order of $O(n \log n)$. I won't go through an explicit proof of this recurrence because I'm sure it's been done before. If you wish to prove it to yourself, you can use a similar approach to this, or use the recursion tree method, or use the Akra-Bazzi method. So this subarray size won't work because we now have non-linear time complexity.

For $\frac{n}{5}$ size subarrays we will end up with an area diagram like this:

div5

We similarly get a worst-case recurrence of the following:

$$T(n) = T\left(\frac{n}{5}\right) + T\left(\frac{7n}{10}\right) + cn$$

This is linear $O(n)$! You can use this method directly to prove it is linear.

For $\frac{n}{7}$ size subarrays we will end up with an area diagram like this:

div7

We similarly get a worst-case recurrence of the following:

$$T(n) = T\left(\frac{n}{7}\right) + T\left(\frac{10n}{14}\right) + cn$$

This is also linear! Again, you can use the method I described above to prove this.

So in conclusion

Diving the subarrays into length $\frac{n}{3}$ will not be good because it is non-linear! Diving the subarrays into length $\frac{n}{5}$ or $\frac{n}{7}$ will be very good because it is linear! You can actually go on to use $\frac{n}{9}, \frac{n}{11}, \ldots$ and still get linear time!

ryan
  • 4,501
  • 1
  • 15
  • 41
  • You may want to have a look at this paper. This paper claims it is possible to use group size 3 or 4 while maintaining the worst case linear running time. – fade2black Oct 23 '17 at 21:50
  • @fade2black, "The question whether the original select algorithm runs in linear time with groups of 3 remains open at the time of this writing." It appears they use a variant of selection (namely, "Repeated Step") that allows for groups of three in linear time and a variant of selection (namely, "Shifting Target") that allows for groups of four in linear time. I get the argument that they fail to present an input sequence for the standard select algorithm that successively only removes $1/3$ of the elements, but I feel this is not necessary to answer the question as it is still $O(n\log n)$ – ryan Oct 23 '17 at 22:09
  • @fade2black, the argument they present is interesting however, I am curious what results have been developed to attempt a structured input that does only remove $1/3$ of element in each call and successive call. – ryan Oct 23 '17 at 22:14
  • A very good explanation, sir, thank you! But if any number of the form $n/(2k+1), k\in\mathbb{N}, k\gt3$ is a good division example, may I suppose that we pick $n/5$ out of convenience? – theSongbird Oct 24 '17 at 05:46
  • 1
    @theSongbird, it's probably largely a convention at this point. The original paper simply says the array should be divided into columns of length $n/c$ where $c \geq 5$. I would assume 5 is usually chosen because it is the smallest constant (therefore less work on sorting) that gives linear time over all. – ryan Oct 26 '17 at 01:45