I am part of a small team data analysts that has been trying to optimise the process we use to keep track of the assumptions and design choices that are made over the lifetime of a complex data science project (several months) - ultimately we want to (i) catch (conceptual) errors when they are made (ii) ensure that it is at all times clear what assumptions underlie the results being produced.
Our current process relies on code-generated and manually maintained excel spreadsheets - the process is (i) time consuming/results in too many files, (ii) introduces a new source of error (the copying of results into the spreadsheet) and (iii) is not dynamic, i.e. cannot easily be adapted as a project evolves.
While there are a large number of questions/answers focused on version control, reproducible research and project organisation more broadly we are not looking for a new suite of tools but a light-weight process that addresses a very specific problem.
The ideal process:
- Is easy to maintain & as much as possible automated to reduce errors
- Captures all design choices and assumptions made
- At a glance reflects the impact of any changes, i.e. reflects (changes in) the 'key metrics' that resulted from the change
- Links all changes with the related code file versions
- Can easily be shared with other analysts & Principal Investigators