Is there any standard for comparing runtimes experimentally?

Question

My situation

I am writing a paper presenting a software module I developed and I want to compare its runtime to other modules for the same task. I am aware of the drawbacks of runtime experiments, but please assume as given that there is no way around it in my case. (I can and do deduce some properties theoretically, but it doesn’t suffice for everything.)

The specific scenarios I want to use for benchmarking have two parameters: the complexity $n$ of the problem and a random seed $r$ which determines the detailed problem. Mainly I want to show the dependence on $n$. Going by preliminary investigations and theory, the influence of $r$ on the runtime is minor or negligible. A single task takes at most ten minutes to complete.

Actual question

I am looking for some commonly accepted or published procedure on performing such experiments or at least a list of common pitfalls (ideally published).

What I found so far

Nothing. Internet searches turn up all sorts of unrelated results, but then I may not be using the right terminology. Including the keyword minimum, which I know to be a good standard (see below), didn’t help either.

How I would do it

Run all experiments on the same machine with potentially interfering software such as a GUI disabled as far as possible.
Subject all modules to the same selection of scenarios, i.e., the same $n$ and $r$.
For each scenario, test the different modules directly after each other in random order. With other words, the loop over the different modules is the innermost one. This should avoid bias on the different modules due to slow fluctuations of the machine’s performance (e.g., due to temperature changes). The random order should avoid bias through such effects as caching or one module always being tested after the same one.
For each $n$, take the minimum runtime over several scenarios with different seeds as the benchmark. This should avoid bias on the different modules due to short-time fluctuations of the machine’s performance that make individual runs exceptionally bad.

It might help to explain your reasoning why you think "there is no way around it in my case". But of course, probably as a separate question and link there because this question is focused well enough as it is. — Apiwat Chantawibul, Oct 30 '17 at 12:46
@Billiska: I am not exactly sure what you want me to do. Why should I explain my reasoning for an experimental approach in a separate question? I have no question regarding this. — Wrzlprmft, Oct 30 '17 at 12:49
I have to disagree with you taking the minimum runtime of repeated experiment. You seems to think there might be outliner upwards only. Might it be possible to also have outliner downwards? It is more typical to examine multiple statistics at the same time, e.g., mean, median, max. Who knows they may show something you didn't expect. It's an empirical experiment after all. — Apiwat Chantawibul, Oct 30 '17 at 12:53
Well, about my first suggestion. I was just being skeptical when I hear there is no way around doing runtime experiment. Please ignore it if you are sure runtime experiment is the way to go. — Apiwat Chantawibul, Oct 30 '17 at 13:01
@Billiska: Might it be possible to also have outliner downwards? – There are no considerably easier problems (more precisely, they are astronomically unlikely) and computers do not have upward outliers in performance, AFAIK. — Wrzlprmft, Oct 30 '17 at 14:02
I've a feeling something very much like this has been asked before. However, I searched when the question was posted and I couldn't find anything, and the asker is an experienced SE user so I guess they searched, too. — David Richerby, Nov 01 '17 at 12:36
This is very broad; books can be written about the topic, e.g. McGeoch's "A Guide to Experimental Algorithmics". One might even say you're asking, "Is there any standard for doing science?". So I'm not sure that this is reasonably scoped. Do you have more specific questions? — Raphael, Nov 01 '17 at 21:50
FWIW, avoid time measurements wherever possible. Counting things such as certain operations is much more robust. — Raphael, Nov 01 '17 at 21:52
books can be written about the topic – Well, this is a reference request. If there is a book about the topic, then this answers my question. Taking a quick glance at the table of contents of the book you mention, however, it devotes only about thirty pages to what I am looking for, so I would suspect that there is a review paper or something similar that is a better match. But still, that book is the best answer I have so far. (Sorry for the late reply. I somehow managed to miss your first comment.) — Wrzlprmft, Nov 04 '17 at 21:08
Loosely related: https://cs.stackexchange.com/q/39597/755, https://cs.stackexchange.com/q/29854/755, https://cs.stackexchange.com/q/74178/755. — D.W., Nov 05 '17 at 17:47

score 2 · Answer 1 · answered Nov 01 '17 at 21:12

In addition to elapsed time for each run, report seconds of user & system mode, and total IP packets, and total disk I/Os, if only to verify that some numbers are consistently "low" and have negligible impact on elapsed time.

On https://wiki.freebsd.org/BenchmarkAdvice PHK and others offer good advice, including

Use ministat to see if your numbers are significant. Consider buying "Cartoon guide to statistics"

score 2 · Accepted Answer · answered Nov 05 '17 at 14:07

2

C.C. McGeoch's "A Guide to Experimental Algorithmics" is a good reference for

how to set up experiments on algorithms,
how to interpret and use results, and
how to iterate towards more meaningful results if necessary.

answered Nov 05 '17 at 14:07

Raphael

72,336
29
179
389

Is there any standard for comparing runtimes experimentally?

My situation

Actual question

What I found so far

How I would do it

2 Answers2