I'm about to give my first conference presentation, in historical linguistics. (In case it matters: I'm an undergraduate presenting at an undergraduate conference in the UK). My presentation and results are the result of corpus analysis using Python and pandas. I want to share my code (in a public Github repo), but at the moment it is fairly rubbish - messy, disorganised, and full of half-working things, tangents I abandoned, etc.
My question is: how far should I re-organise my code before I share it?
Should I:
- Write a
reproduce
script, which draws together the different bits of different files (e.g..py
files, Jupyter notebooks) that I ended up actually using, and spits out the results I discuss in my presentation. Not share the previous versions, attempts, unused scripts etc. - Leave the repository as is (i.e. with the results for different questions coming out of different scripts in a somewhat disorganised way), but delete unused code and write a
README
which explains which script does what. - Leave the repository as is, including all the guff. Write a
README
which explains which bits do what and which bits are guff. - Do something else entirely.
My initial thought was option 1, but then I read this question and specifically this article which suggest that the right thing to do would be option 3 - including all previous versions etc.
It's worth noting that it is not massively complex (it's a corpus linguistics study, not computational linguistics/NLP, and probably only runs to about 300 lines in total, most of which are applying regex
es and looking at frequencies of particular patterns in the corpus), and I imagine not particular technically impressive to future graduate applications committees or employers :)