Regarding the median of medians solution to the $k$th smallest element in an array, why does the algorithm split the array into subarrays of length $n/5$, where $n$ is the length of the initial array? Why not $n/7$ or $n/3$? Why 5??
1 Answers
Along with the explanation given on wikipedia, I'll try to give more visual examples. The main point is that subarrays of length $ \frac{n}{3}$ make it a non-linear algorithm.
We can actually check out the 3 scenarios you discuss. I'm going to be referring to area diagrams similar to the one depicted on wikipedia. These area diagrams are a useful abstraction to see how large of a subproblem we might have to recurse on in the worst case. The area diagrams will look something like this:
Where $M$ is median of medians, $(< M)$ and $(> M)$ represent the area (amount of values in the array) that are potentially less than $M$ and greater than $M$ respectively, in the worst/best case. Note that in the area diagrams I present I say the area is roughly (~) some value. This is because it may be off by one or two and I am largely ignoring these small constants because they will be insignificant to the analysis. If you are bothered by this, you can assume $n$ is of a useful form such that the values are exact.
For $\frac{n}{3}$ size subarrays we will end up with an area diagram like this:
It is clear that in the worst case we will have to recurse on roughly $\frac{4 n}{6}$ elements in the array. This will give us a recurrence relation: $$\begin{align} T(n) & = T\left(\frac{n}{3}\right) + T\left(\frac{4n}{6}\right) + cn\\ & = T\left(\frac{n}{3}\right) + T\left(\frac{2n}{3}\right) + cn \end{align}$$ This actually turns out to be on the order of $O(n \log n)$. I won't go through an explicit proof of this recurrence because I'm sure it's been done before. If you wish to prove it to yourself, you can use a similar approach to this, or use the recursion tree method, or use the Akra-Bazzi method. So this subarray size won't work because we now have non-linear time complexity.
For $\frac{n}{5}$ size subarrays we will end up with an area diagram like this:
We similarly get a worst-case recurrence of the following:
$$T(n) = T\left(\frac{n}{5}\right) + T\left(\frac{7n}{10}\right) + cn$$
This is linear $O(n)$! You can use this method directly to prove it is linear.
For $\frac{n}{7}$ size subarrays we will end up with an area diagram like this:
We similarly get a worst-case recurrence of the following:
$$T(n) = T\left(\frac{n}{7}\right) + T\left(\frac{10n}{14}\right) + cn$$
This is also linear! Again, you can use the method I described above to prove this.
So in conclusion
Diving the subarrays into length $\frac{n}{3}$ will not be good because it is non-linear! Diving the subarrays into length $\frac{n}{5}$ or $\frac{n}{7}$ will be very good because it is linear! You can actually go on to use $\frac{n}{9}, \frac{n}{11}, \ldots$ and still get linear time!

- 4,501
- 1
- 15
- 41
-
You may want to have a look at this paper. This paper claims it is possible to use group size 3 or 4 while maintaining the worst case linear running time. – fade2black Oct 23 '17 at 21:50
-
@fade2black, "The question whether the original select algorithm runs in linear time with groups of 3 remains open at the time of this writing." It appears they use a variant of selection (namely, "Repeated Step") that allows for groups of three in linear time and a variant of selection (namely, "Shifting Target") that allows for groups of four in linear time. I get the argument that they fail to present an input sequence for the standard select algorithm that successively only removes $1/3$ of the elements, but I feel this is not necessary to answer the question as it is still $O(n\log n)$ – ryan Oct 23 '17 at 22:09
-
@fade2black, the argument they present is interesting however, I am curious what results have been developed to attempt a structured input that does only remove $1/3$ of element in each call and successive call. – ryan Oct 23 '17 at 22:14
-
A very good explanation, sir, thank you! But if any number of the form $n/(2k+1), k\in\mathbb{N}, k\gt3$ is a good division example, may I suppose that we pick $n/5$ out of convenience? – theSongbird Oct 24 '17 at 05:46
-
1@theSongbird, it's probably largely a convention at this point. The original paper simply says the array should be divided into columns of length $n/c$ where $c \geq 5$. I would assume 5 is usually chosen because it is the smallest constant (therefore less work on sorting) that gives linear time over all. – ryan Oct 26 '17 at 01:45