How much should I tidy up code before I share it?

Question

I'm about to give my first conference presentation, in historical linguistics. (In case it matters: I'm an undergraduate presenting at an undergraduate conference in the UK). My presentation and results are the result of corpus analysis using Python and pandas. I want to share my code (in a public Github repo), but at the moment it is fairly rubbish - messy, disorganised, and full of half-working things, tangents I abandoned, etc.

My question is: how far should I re-organise my code before I share it?

Should I:

Write a reproduce script, which draws together the different bits of different files (e.g. .py files, Jupyter notebooks) that I ended up actually using, and spits out the results I discuss in my presentation. Not share the previous versions, attempts, unused scripts etc.
Leave the repository as is (i.e. with the results for different questions coming out of different scripts in a somewhat disorganised way), but delete unused code and write a README which explains which script does what.
Leave the repository as is, including all the guff. Write a README which explains which bits do what and which bits are guff.
Do something else entirely.

My initial thought was option 1, but then I read this question and specifically this article which suggest that the right thing to do would be option 3 - including all previous versions etc.

It's worth noting that it is not massively complex (it's a corpus linguistics study, not computational linguistics/NLP, and probably only runs to about 300 lines in total, most of which are applying regexes and looking at frequencies of particular patterns in the corpus), and I imagine not particular technically impressive to future graduate applications committees or employers :)

Related: Should I share my horrible software? A different question but some of the answers discuss some of the points in your question. — GoodDeeds, Mar 20 '23 at 12:44
The article you linked does NOT in any way suggest that you should not do #1. Why do you think it does? — David Ketcheson, Mar 20 '23 at 15:39
Good book on the subject of writing good code: Code Complete by Steve McConnell. https://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ I wish I had this book when I first started coding. — Boba Fit, Mar 21 '23 at 14:23

score 16 · Accepted Answer · answered Mar 20 '23 at 20:59

I want you to imagine yourself in a few years from now. You're writing up your PhD thesis and need to check that result you presented in that undergraduate conference back in 2023. You can't really remember what you did, so you want to reproduce the calculations. You also want to remake the figures with nicer formatting. You comb back through your Github, find the repository, and, with bated breath, open it.

What would you want to see at that moment?

Is it a) a repository full of random, messy files with a script that reproduces some of the old work; b) a repository with a few files and a README that explains what some of them do; c) a repository full of random, messy files and a README?

Hopefully your answer is: d) none of the above.

Messy, undocumented code is of no use to anyone; and since chances are you're going to be the only person using this code you're shooting yourself in the foot by leaving it that way. You've got a big job ahead of you to clean it up. In future, you will know to do these things as you go along instead of at the end of a project.

My advice:

Sit down and structure the code with a pen and paper. Which scripts/classes/functions depend on each other? Which order are they executed in? Which parts do what?
Remove defunct/unused parts of the code.
Add docstrings to all classes and functions. These should explain very clearly what each class/function does, and list all the variables in the functions and their types.
Add comments to any line of code where it's unclear from the code itself what it's doing (if you have good variable and function naming conventions this shouldn't be too many).
Commit the changes to the repository little and often. Much better to clean the code incrementally than do it all in one go and accidentally delete a vital script.
Related to the above, in future make sure you know how to use git efficiently. It sounds like you could have done with making some branches for those tangents you abandoned, rather than leaving them on the main branch.
If you really want to work on it, see if you can synthesise your code into a proper package rather than a collection of scripts. Your idea for a "reproduce" script would be akin to a Run() class for the package.
Write a Jupyter Notebook which reproduces every result the code spits out. Take advantage of the markdown cells to write a few sentences/paragraphs about each one. Commit the executed notebook to the repository.
Share your nice, clear code with the world without shame.

With these steps, the original, messy version of the code will still be visible in the commit history. But the latest version will be clean and well-documented. This means that it should work out of the box three+ years down the line, but you will also be able to go back and look at those deleted tangents should you need to.

A good way to learn how to write nice, well-documented code is to look at the best piece of public code in your field and examine its structure, documentation and commit history. When you write code in future, follow that example. An even better way is to starting modifying such public codes e.g. to add new features, and submitting pull requests so that other people will review your changes, and in the process, your code.

This is exactly the right approach (+1). Always write for your future self (who has forgotten everything you did). — Ben, Mar 22 '23 at 09:09

score 0 · Answer 2 · answered Mar 22 '23 at 08:32

0

You should do number #1 if you can. Doing #2 is better than nothing.

A combination of #1 and Number #3 is something you should do for your own records, but you don't need to share. Think you this as a computational lab notebook.

answered Mar 22 '23 at 08:32

Ian Sudbery

38,074
2
86
132

How much should I tidy up code before I share it?

2 Answers2