1

I've implemented an algorithm, that when analyzed should be running with the time complexity of $O(n \log n)$.

However when plotting the computational time against the cardinality of the input set, it seems somewhat linear and computing $R^2$ confirms this somewhat. When then sanity checking myself by plotting $n$ on the $x$-axis and $n \log_2 n$ on the $y$-axis with python, and plotting this it also seemed linear. Computing $R^2$ (scipy.stats.linregress) further confuses me, as I get $R^2=0.9995811978450471$ when my $x$ and $y$ data is created as so:

for n in range(2, 10000000):
    x.append(n)
    y.append(n * math.log2(n))

Am I missing something fundamental? Am I using too few iterations for it to matter? When looking at the graph at http://bigocheatsheet.com/ it does not seem linear at all.

Raphael
  • 72,336
  • 29
  • 179
  • 389
Andreas V.
  • 11
  • 2

1 Answers1

1

Just some general observations.

  • O(n log n) is only an upper bound. If it's not tight, that's your explanation right there.
  • A Θ(n log n) running time can have many different components, for instance

    $\qquad\displaystyle a \cdot n\log n + b \cdot n \log \log n + c \cdot \sqrt n + d \cdot n + e \cdot \log n + d$

    While technically the linearithmic term dominates, if $a$ is small compared to the other coefficients you will have a hard time detecting it.

  • Measuring wall-clock running time is noisy without end, inparticular because the coefficients mentioned above get skewed by platform details. Try investigating counts, for instance of a dominant operation or block.
  • Linear regresssion always works. Since the "difference" between $n \log n$ and $n$ is rather small (also considering above point), it's not susprising you'd get a high confidence. Run linearithmic regression and compare!
Raphael
  • 72,336
  • 29
  • 179
  • 389
  • Good point on suggesting to measure counts, or some other property that depends only on your algorithm and not on the implementation. – Discrete lizard Apr 25 '19 at 06:54
  • @Discretelizard I think it's fair to count implementation specifics; my point was to remove the machine (metal, I/O, OS, competing tasks, ...) from the measurement. – Raphael Apr 25 '19 at 08:07
  • Well, if you want to compare with the analysis of your algorithm, I'd say you should try to eliminate the rest. Measuring performance of implementations is a reasonable thing to do in general, but not necessarily what would be best here. – Discrete lizard Apr 25 '19 at 08:12
  • @Discretelizard The OP will have to clarify what they want. Imho, if you want to compare (the performance of) your implementation to (the analysis of) the abstract algorithm -- to check against performance bugs, say, or validate the model you used for analysis -- then counting only those things that appaear identically in both implementation and algorithm is rather meaningless. – Raphael Apr 25 '19 at 08:14
  • 1
    Yes, the OP should clarify that more. However, what I propose is not necessarily meaningless. There are many cases in which the theoretical analysis is/cannot be as tight or precise as you would expect the algorithm to behave for 'usual instances' (and where modelling these instances is also out of the question). For example, I recently implemented a distributed protocol for which the best theoretical result was a bound on the number of rounds 'with high probability'. These number of rounds is not an implementation detail. It is not a priori clear how such a bound behaves in practice. – Discrete lizard Apr 25 '19 at 08:28