How to keep track of assumptions/design choices and avoid errors in complex projects

Question

I am part of a small team data analysts that has been trying to optimise the process we use to keep track of the assumptions and design choices that are made over the lifetime of a complex data science project (several months) - ultimately we want to (i) catch (conceptual) errors when they are made (ii) ensure that it is at all times clear what assumptions underlie the results being produced.

Our current process relies on code-generated and manually maintained excel spreadsheets - the process is (i) time consuming/results in too many files, (ii) introduces a new source of error (the copying of results into the spreadsheet) and (iii) is not dynamic, i.e. cannot easily be adapted as a project evolves.

While there are a large number of questions/answers focused on version control, reproducible research and project organisation more broadly we are not looking for a new suite of tools but a light-weight process that addresses a very specific problem.

The ideal process:

Is easy to maintain & as much as possible automated to reduce errors
Captures all design choices and assumptions made
At a glance reflects the impact of any changes, i.e. reflects (changes in) the 'key metrics' that resulted from the change
Links all changes with the related code file versions
Can easily be shared with other analysts & Principal Investigators

score 1 · Answer 1 · answered Aug 25 '16 at 10:53

1

I would go for parametrised Rmarkdown reports. I think it adresses all your needs, it's simple to use and maintain, offers great flexibility, you can embed all kinds of things and it offers different output formats.

answered Aug 25 '16 at 10:53

davidski

236
2
5

score 1 · Answer 2 · answered Aug 25 '16 at 12:02

Another alternative is to use Jupyter notebooks, if it supports kernels of your programming language. It allows for normal markdown, LaTeX, output shown in the web document and interactivity. You can share the true, runnable document if it's hosted on a normal server or a HTML dump for easy sharing.

score 0 · Answer 3 · edited Jun 16 '20 at 11:08

You should look at the Open Science Framework:Here is a Introductory YouTube video This Open Source Project addresses all your concerns.

Structured projects:

Keep all your files, data, and protocols in one centralized location. No more trawling emails to find files or scrambling to recover from lost data. Secure Cloud

Control access:

You control which parts of your project are public or private making it easy to collaborate with the worldwide community or just your team. Project-level Permissions

Respect for your workflow:

Connect your favorite third party services directly to the Open Science Framework. 3rd Party Integrations

How to keep track of assumptions/design choices and avoid errors in complex projects

3 Answers3