-2

Context

In light of the perishability of external resources as illustrated here: https://xkcd.com/1909/ I was wondering if it is desirable to ensure a single PDF, including its research code, can be replicated as stand-alone.

This question pertains to articles produced with latex that use code as part of their research. Often, this code is hosted in an external repository, which is convenient but might some day stop being accessible.

Sometimes, the code used in the research is included in the appendices of the article PDF, which would make it available as long as the PDF exists. However, that still requires a scientists that wants to reproduce the results and/or enhance the work that is presented, to manually copy-paste the code and form it into some sort of runnable format (e.g. with line endings etc). This could be made easier in the following way (among undoubtedly many others):

One could include the code and data in the pdf, write a separate reproducibility appendix that contains a small code that scans the PDF to reproduce the research (and the report).

However, I have never found an article that has that level of reproducibility. Furthermore, I would expect this could save quite a bit of time of people reproducing the work of others, as it is made easier.

Example

Suppose someone published an article that describes how a certain type of pancakes can be baked. To illustrate the method, a code simulating pancake baking is written, and the particular pancake baking datafile (4 kb) is created to simulate that particular pancake baking. The results of the simulation are plotted in the report. The code is published in somerepositoryhost.com/somerepository

Fast forward 10 years, the original author died, the repository is not longer hosted, and the particular pancake baking datafile is not available anymore. Yet, person V wants to verify whether the that particular pancake could indeed be baked according to the presented method. Now person V should be able to reproduce the code and accordingly the results presented in the paper based on the method section. However, that requires more work than having to run a single command that does that for person V (as long as person V posesses hardware that can still run that code).

By lowering the amount of work that is required to reproduce the results, one could make it easier to verify the articles, and to build upon them.

Question

Why is this level of reproducibility not a standard in the fields and/or latex-compiled articles that do not rely on either large external input datasets or custom computing platforms?

Note

This is not to suggest that this would be the best, or even a perfect way to guarantee reproducibility of scientific articles that rely on code, it is merely an approach that could be used to increase the speed of reproducibility and the ease of building upon such work when externally referenced code- and/or data resources have decayed.

a.t.
  • 115
  • 3
  • 3
    Am confused, if you already had the PDF, why do you need to be able to replicate itself? (wrt to maintainability) – Azor Ahai -him- Jan 08 '21 at 16:43
  • 2
    You seem to be assuming that the pdf has been built from LaTeX. Even then you beg the question: reproducing the pdf depends on a LaTeX compiler that still works on the source. Moreover, there are other ways to produce pdfs - from Word, or knitr, or using some publisher's more or less proprietary software. (Right now, arXiv calls for TeX source.) I don't really see the value of the question - see @AzorAhai-him- 's comment. – Ethan Bolker Jan 08 '21 at 16:46
  • 1
    To Ethan's comment - I don't think I've seen a single paper prepared in LaTeX in my field. – Azor Ahai -him- Jan 08 '21 at 16:48
  • 4
    what kind of reproducibility are you looking for? Are you afraid that in 20 years no then modern computer can read pdfs and just want to make it easier to reproduce the text in some other future format or do you want to include the code and data into the document itself so reproduction of the research (rather than the document) is easier? – Maarten Buis Jan 08 '21 at 17:00
  • 2
    Because if we make self-reproducing articles, they may actually start spontaneously self-reproducing and soon the world will be overrun with pdfs... – Jon Custer Jan 08 '21 at 17:03
  • 1
    Are you asking if there is a PDF decompiler? Something that will reproduce the source from which the PDF was made? – Buffy Jan 08 '21 at 17:09
  • @AzorAhai-him- Thank you, to quickly reproduce the results and or enhance the work presented, it could be desirable to ensure the PDF can replicate itself (if for example the externally referenced repository is not longer accessible) (and/or datasets have gone missing over time). I also included the explicit assumption of latex compiled reports that rely on code to generate their results. – a.t. Jan 08 '21 at 17:21
  • @a.t. I'm not sure I follow. – Azor Ahai -him- Jan 08 '21 at 17:25
  • @MaartenBuis I indeed meant to include the code and data into the document itself so reproduction of the research can be done automatically. – a.t. Jan 08 '21 at 17:55
  • I think the premise of the question is based on a misunderstanding... That is, there is no "absolute" or "universal" computer format. Even ASCII formerly competed with EDBIC (or whatever)... Unicode may not last 1000 years. Etc. PDF has been a stable standard for longer than many. – paul garrett Jan 08 '21 at 23:00

1 Answers1

5

Probably the best answer is with another XKCD:

https://xkcd.com/927/

PDFs are a document standard right now. Their ubiquity makes their content more likely to be reliably accessible in the future, because there will be a lot of value in maintaining methods to access or convert them in some imagined future. This applies to data storage methods, too: if you're trying to read data from something like a CD, that's still fairly easy to do today although becoming more difficult as it becomes less common for new computers to come with CD/DVD readers. However, it's much harder to access data from a similar era that came on one of the other lesser used standards like the Zip drive and similar. If you don't have the special hardware from that era you're in trouble. You can create a new standard, but in that case it's just that: a new standard.

If you added your imagined code, who's going to maintain it? Who says this won't break in the future? Why is it more likely to continue working when methods to read the PDF itself (which are much more ubiquitous) are broken?

In any event, when we talk about research being difficult to reproduce, we're referring to the methodology being insufficiently specified or the results depending too much on the experimental conditions for an independent replication. It is not about the actual documents being inaccessible or unable to be copied. This is a separate problem which some people do experience depending on their field, especially when there are relevant reports from the pre-digital era. Many journals have made efforts to digitize their older publications and reduce this issue; for what remains, librarians are professionals who can help.

Bryan Krause
  • 114,149
  • 27
  • 331
  • 420
  • 1
    Oh is that why people call thumb drives "zip drives"?? – Azor Ahai -him- Jan 08 '21 at 17:32
  • @AzorAhai-him- Haha that's quite possible, I can't say I personally saw the transformation though. I remember encountering them early on and then later finding them in another context where they had fallen out of use and was amazed that these things had actually taken off someplace. – Bryan Krause Jan 08 '21 at 17:36
  • Zip drives? Those are new-fangled things. How about Bernoulli drives? (And, yes, I've thrown away my old piles of both Zip and Bernoulli disks.) – Jon Custer Jan 08 '21 at 20:37