Where should I host software for individual papers?

Question

I'm a teen who's writing a paper on Poisson-Disc distribution over the summer for fun, and I wanted to distribute the software I used for the simulations to maintain scientific integrity. Obviously I don't have anything like a university server I can host it on, so my first instinct is to host it on my GitHub. I guess this would work, but there are several problems with it, such as its fallibility. Furthermore, I'm not really used to professionalism yet, and I'm not sure if it's appropriate to host professional, academia stuff on my personal GitHub. I do go to a more-or-less significant technological program at my school, and since school starts back up in a few weeks, I guess I could talk to my teachers about hosting it on the school servers, but there's only a 50-50 chance they'll let me do that. Do y'all have any better ideas?

Lots of professional people use GitHub. Why is it fallible (presuming you use their servers, not your own)? — Jon Custer, Jul 26 '22 at 16:25
I thought of GitHub as more of a place to hold fast changing, collaborative code, but it makes sense that lots of professional people would use it. Thanks so much for the help! — Leland Kilborn, Jul 26 '22 at 16:28
A personal GitHub account with your actual name is certainly considered professional. This may be a duplicate of How to share computer code? though. — Anyon, Jul 26 '22 at 16:30
For a few hundred dollars per year you can host your own domain. For this option, choose a professional name, not a frivolous one. You need both the domain and hosting. You can host several domains with the same host about as easily as one. — Buffy, Jul 26 '22 at 17:13
Especially if you are thinking of building a scientific portfolio, I would counsel against using school resources, because you will graduate (or change schools) someday, and it's a bit of a toss-up whether you can keep using school resources as an alumnum. A personal GitHub account under your real name is definitely better in terms of future-proofing. — Stephan Kolassa, Jul 27 '22 at 07:35
@LelandKilborn Hosting professional people's code is literally what GitHub is *designed* for. Unless you consider open-source developers "unprofessional", in which case I'd point you at Linux... ;) — Graham, Jul 27 '22 at 14:18
If by “fallibility”, you meant technical problems making your files inaccessible… You could provide your software on more than one host. For example, use BitBucket in addition to GitHub. If one service is having problems, your readers could use the other one. — Basil Bourque, Jul 28 '22 at 04:34

score 27 · Answer 1 · answered Jul 26 '22 at 18:53

27

Use GitHub (as mentioned in the comments), but with a twist.

You can archive GitHub repos to Zenodo (they have a machinery for that) or really any other scientific research storages (as a tarball, for example). In this manner you can ensure permanence. The visibility is easier on the GitHub, though.

And, as mentioned, a somewhat official-looking (like, with a real name) personal GitHub account is professional enough. It's basically the link from the paper to the GitHub that matters, not other way round.

answered Jul 26 '22 at 18:53

Oleg Lobachev

7,202
25
33

Just create a read-only branch or a tag for the version used in your paper. That assures that the version should be readily available for future investigators as well as allowing ongoing development. – Jul 28 '22 at 11:53
2

Zenodo will also give you a DOI for the specific release of the code, which will be pretty easy to cite in your papers, and very easy to find in a few years, even if your Github repository goes missing. – TonioElGringo Jul 28 '22 at 13:15
+1. You can also just upload a tarball to zenodo and bypass github entirely – thegreatemu Jul 28 '22 at 19:25

score 8 · Answer 2 · answered Jul 27 '22 at 13:20

Using Github for this is fine and is generally not seen as unprofessional. There is a little bit more to proper sharing of code than just sticking it into a git repo. Github has a "releases" function that will let you capture an image of your code base in time. You can say "I tested this code with release v1.2.3 from this github repository." In that way your code will always be retrievable in the state that worked despite any modifications, updates, re-releases etc you do on it.

The problem is capturing the metadata and environment data that you used to run the code. There are tools for this but one of the "easiest" ways to do this would be to write a docker file that has all the library/environment dependencies you need to make the code run, include your released code into that, build the image and host that on docker-hub. Gitlab also has a docker image repo that you can use for free. There are also all sorts of automated CI/CD pipelines built into gitlab/hub you can use to automate this but that might be going a little far.

Using a combination of releases/docker, in the future, a different user will be able to pull the exact same code you used, run it with the exact same versions of all the libraries, and hopefully get the exact same results. You should also address the provenance of the input data, preprocessing steps etc in case the code doesn't auto generate everything from scratch.

score 0 · Answer 3 · answered Jul 27 '22 at 14:59

I'm not sure what you mean by GitHub's "fallibility." Due to its size, reputation and the funding behind it (Microsoft, now) it's likely one of the most reliable places to host a Git repo.

But the key characteristic of Git repos is that they're essentially blockchain based, so the source of the repo (i.e., where it's hosted) is not really that important. Commit 0b9b56adcdf56ff013421c50d0721d29fa08f43a is that commit regardless of where you got it. So it's not such a big deal if a repo moves around, or came from a dodgy source; someone has a copy of a repo with that commit in it, it's almost certainly the code I'm talking about when I talk about that commit. (Note that all this is not the case for information outside of the repo, such as in GitHub wikis or issues.)

So put your code up on GitHub, or GitLab, or Bitbucket, or on any other site that looks as if it's not going to vanish too soon and when you reference it give not just the hosting location but the commit ID of the commit in the repo containing the code used in your paper. The commit ID ensures that regardless of the source your code is identified, and being on GitHub (or any other major provider) makes it likely that it will be easy to find and download the repo. But even if it vanishes off GitHub for some reason (which is unlikely unless you deliberately delete it), if there's enough identifying information in the repo and in your paper (e.g., if the paper's author and title are mentioned in the repo) a web search may find it, and the commit ID will provide verification that the correct repo has been found.

I removed a long discussion about blockchain, Merkle trees, and data structures, which eventually devolved into some rude/condescending comments. Let us remember our code of conduct. — cag51, Jul 28 '22 at 05:10

Where should I host software for individual papers?

3 Answers3

Linked