9

I'm trying to understand why quicksort using Lomuto partition and a fixed pivot is performing erratically, but overall poorly, on randomly generated inputs. I'm thinking that even though the inputs are randomly generated, there may be a lot of order to the sequences, but I'm not sure how to measure the level of disorder in the sequences. I thought about using the number of inversions, but I saw from this other question I asked that that's not really a good measure in this case.

The reason I suspect that my random sequences have a lot of "order" to them is that randomizing the pivot fixes the performance problem. But theoretically there shouldn't be any performance problem on these supposedly "random" input sequences.

Robert S. Barnes
  • 2,911
  • 5
  • 24
  • 24
  • One good measure of disorder for this sort of thing is Kolmogorov complexity. It basically says that the string that are most disordered are the ones that are incompressible. It leads to the incompressibility method, which has been used to do things like average-case analysis of sorting algorithms, and finding the relation between average and worst-case analysis. – Peter May 07 '13 at 12:05
  • I should note, that I'm an undergrad... I was looking for something a bit more straight forward, like maybe one of the measures in this paper ( I just don't know which one ): http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.8017 – Robert S. Barnes May 08 '13 at 07:02
  • You should suspect a programming error rather than an adversary pivoting case. Just sort a scrambled sequence of integers from 1 to N to see if your algorithm sorts ! –  Dec 05 '15 at 22:47
  • @YvesDaoust I don't think that really matters, The amount of "non-monotonicity" is really just the Kolmogorov complexity of the string of length $log n!$ that encodes the ordering of the elements in the sequence. Of course, it's not computable, and you have to think about deep strings like pseudorandom ones, but it's useful in the sense that every measure of disorder is essentially an approximation of the Kolmogorov complexity. And you don't need to compute it to prove things with it. Many complexity results have been shown with the incompressibility method. – Peter Dec 06 '15 at 07:56
  • @YvesDaoust That's why I posted it as a comment and not an answer. Kolmogorov complexity is the relevant way to understand randomness, if you're willing to go that deep. I'm well aware of the complexity of pseudorandom strings. If you're approximating Kolmogorov complexity, the strings will look random to your approximation, and if you're doing an analysis with proper Kolmogorov complexity, you can assume that the strings are truly random. – Peter Dec 07 '15 at 11:45
  • @YvesDaoust No, not even a little bit. I don't even agree that it's a programming question. the OP wants to understand the role that the randomness of the input plays plays in the performance of this algorithm. That gives you two problems: defining randomness and measuring it. K complexity is the only reasonable answer to the first. Yes, it's incomputable, but that just means you need an approximation to measure it empirically. As it is, I don't know whether the OP wants to measure randomness empirically, or analytically. – Peter Dec 07 '15 at 12:06

1 Answers1

1

Lomuto vs Hoare
Lomuto partition suffers when sorting equal keys, whilst Hoare partition does not.
Both partition schemes suffer equally when using a pivot distant from the median.

Measure of disorder
The measure of disorder to choose for the purposes of quicksort is simple.
A: How far removed from median is the fixed pivot, compared to random data?
If you insist on using Lomuto partition and if you assume duplicate values are allowed you need to add the following test against randomness:
B: How many duplicate elements are there, compared to random.

Of course it is rather silly to assume that duplicate values are allowed in your data set and still evaluate Lomuto partition, so you should probably either eliminate duplicates beforehand or switch to Hoare partition or assume duplicates are rare.

Both measures are trivial to quantify using statistics.

We can rule out pathological data
Any other deviations from randomness will not matter for the purposes of analysing quicksort. As long as the pivot is close to the median it will perform well on all data that is not pathological.
The distance from random would have to be great indeed to be quicksort-pathological, so we can rule that out.

Never use any fixed pivot(s) in real code
Do note that if you write real code with a fixed pivot*) (whatever that pivot may be) you are opening yourself up to a denial of service attack, because an attacker can insert a pathological value at just that point and thus you should always choose a random element as pivot.

*) or multiple pivots if you choose best of x pivots.

Johan
  • 1,070
  • 9
  • 27