69

In an upcoming publication, I need to link to the data I used for the publication so that others can see/use the data as well — both for reviewing the given work and also for intended use in the future. However, my institution has not offered any hosting solutions and I have not (yet) found any acceptable external solutions which absolve me of financial and legal responsibility for maintaining the data and the hosting infrastructure. I will not be at the given institution for very much longer, so e.g. putting it on my personal site at the institution is not a solution. The primary data in question is about 12GB in size, so it needs to be a proper "repository" for the data rather than just e.g. an attachment distributed with the publication itself.

Nevertheless, I need to at least have a stable link to some place where the data can be located; The stability of the actual location is not as important as the stability of the link itself. How/where can I procure a permanent URL to link to research data in a publication which does not cost me anything as an individual?

errantlinguist
  • 2,918
  • 16
  • 27
  • 5
    How much data? At least in chemistry it is almost always possible to include supporting information file hosted by the publisher (if it's not too much) or there are special archives for certain type of data you can use (an example would be the ccdc) –  Dec 23 '17 at 13:49
  • 3
    About 12GB, so I doubt that that would be a possible avenue... – errantlinguist Dec 23 '17 at 13:53
  • 3
    I recently discovered that my institute offers the use of a file hosting service that is maintained by a network of institutions for exactly such purposes. I had never heard about it. I suggest you talk to the helpful people at your institute's library. –  Dec 23 '17 at 18:09
  • 4
    Is it possible to register for a doi in cases such as this? – Jim Belk Dec 23 '17 at 19:24
  • It's probably not the best idea because it requires other people to decide to host your data, but if you expect that to be the case or if you have a Computer constantly running anyways, IPFS might be a solution – lucidbrot Dec 23 '17 at 22:00
  • 1
    In addition to server based hosting you might consider making a torrent available. Torrent applications can make use of online mirrors in addition to p2p transfer. – Wes Toleman Dec 24 '17 at 03:17
  • @WesToleman reading your comment, I was reminded by Archive BitTorrents because SE uses it. – Andrew T. Dec 24 '17 at 11:37
  • 7
    Regardless of how you choose to host the data I recommend that the publication itself contains a cryptographic hash of the data file (I am assuming it will be distributed as an archive file such that a single hash will cover it all). That will allow anybody who wants to inspect the data to verify that they have the correct data and it can also help a bit in tracking down the data should the original download link stop working. – kasperd Dec 24 '17 at 12:30
  • This question is now protected so I can't answer but you could have a look at http://academictorrents.com/ – Droplet Dec 24 '17 at 17:32
  • @errantlinguist https://goo.gl/ It is from Google. It has been around since 2009, hence I find it trustable. The good thing abou this is - once you put the link in a hard place - a research paper or resume, you can be sure whomsoever access that will reach where you want them to. This is because you can change the content linked to your goo.gl link. So, suppose you host the 12 GB file in onedrive for 5 years because you have free space, you can link that to your permanent goo.gl link. Later on, you get free space in GDrive, you move the 12gb data there, and update your goo.gl link. :) – Rahul Dec 25 '17 at 12:22
  • There is an important distinction between a Uniform Resource Locator and a Uniform Resource Name; the latter might be a better fit for your requirements. The most common URN in this setting is probably a DOI. – chrylis -cautiouslyoptimistic- Dec 26 '17 at 07:41
  • I can't post a proper reply, because the thread is locked, but I think Mendeley Data (https://data.mendeley.com/) provides a free service for exactly that purpose. You will have to find out if they can host that much data and if you are in agreement with their sharing model. – George ZP Dec 26 '17 at 19:26
  • How about google drive? – Sarthak Mittal Dec 27 '17 at 08:28
  • Perhaps the question should state whether or not the data should also be immutable. Having a permanent link (i.e. one that always resolves) doesn't mean it always resolves to the same data. – jiggunjer Dec 28 '17 at 03:48

9 Answers9

70

Maybe Zenodo or other "Academic Data Repository". Googling this would give you a list. Zenodo have some advantages.

  1. Gives you a DOI, Digial Object Identifier, a unique link and a academic standard for citations.
  2. You don't need acceptance to publish your data.
  3. Is a official EU project, used for giving research grants in Open AIRE project.
  4. Is hosted by CERN.
  5. Runs free software in the entire stack.
Cochise
  • 1,615
  • 1
  • 12
  • 14
50

If you also have some code associated with this data that you might like to share, another option might be GitHub. You wouldn't host the 12GB dataset in a GitHub repository itself; instead you would host your code, and create a readme.md file (GitHub will do this virtually automatically for you) where you write out instructions or other narrative. This is where you would include a link to wherever you've chosen to host the data. You can then update this link any time you want or need (for example, if you change institutions).

This has a number of advantages over simply finding a static place to stick the data and sharing that link:

  1. GitHub is almost a decade old and has over 20 million users, so it's not going anywhere
  2. Public repositories are free
  3. Including any code you want to share in the same place is very convenient
  4. The readme.md lets you write out whatever message you would like a future user to encounter, such as guidance not included in the original paper, errata, etc.
  5. Everything is updatable by you at any time, but still maintains the static link
  6. Using version control on your code is a fantastic habit to form
  7. GitHub makes it very easy to include copyright and licensing info
  8. You can use GitHub to build an entire website if you want to go that route (GitHub Pages), which can include what you've shared
Jeff
  • 16,238
  • 6
  • 43
  • 65
  • 8
    This doesn't answer the question at all. Your answer explains where to host code (OP didn't ask for that) and doesn't resolve the question about where to host the 12GB of actual data. –  Dec 23 '17 at 18:15
  • 46
    @DSVA It most certainly answers the question. He didn't ask where you host data, he asked how to get a static link to data. GitHub does that and more, which is what I wrote. – Jeff Dec 23 '17 at 18:17
  • 4
    @DSVA this is not a "bad" (downvote-worthy) answer even it's not amazingly "good": I have seen people doing something similar, where they e.g. create a bare-bones GitHub repo with a few example files and a note saying "This data was used for Stark et al. 2017. 'Sustainable mining of dragonglass in coastal regions'. Westerosi Geology, pp. 12--44. Contact [email protected] for the entire dataset." The only bad part about this is that I can't have this data under a personal GitHub account and don't want to create an orphan account which no one at the department uses. – errantlinguist Dec 23 '17 at 18:24
  • 2
    @DSVA unfortunately this is actually answering the question. Even in the question it is stated that "The stability of actual location is not as important as the stability of link itself." So github, bitbucket or whatever service you are choosing can host a permanent link to your possibly changing location of the data. And this a good idea as well since you can move your data around instead of a single storage for ethernity. But answer can be converted to more Version Control Repositories centric than Github though. – ifyalciner Dec 23 '17 at 18:59
  • 11
    +1 this is a great answer. Github will provide a link as close to permanent as you are likely to find anywhere on the Internet. This way you can put the 12GB of data anywhere you like, even on multiple free hosts, as many as you can find within reason. And provide a list of links on the Github readme. If a few links die (a site like pingdom will monitor them for you, for free) you can always top the list up by uploading the data to some new hosts – Darren H Dec 23 '17 at 19:30
  • The Github approach doesn't quite satisfy the requirements of the OP either; they will need to ensure that whatever hosting the readme.md points to stays valid, which is a possibly manual process, and also it's possible they may change their username at some point for whatever reason. However, if the latter is acceptable as a tradeoff, then using a bit.ly link that can be modified later would be a suitable frontend (and could be redirected anywhere later on). – fluffy Dec 24 '17 at 07:21
  • 3
    Tell us about GitHub in 50 years, then we'll talk. – einpoklum Dec 24 '17 at 17:25
  • I actually use the GitHub to host a link of my data. The data is actually on a home server with a dynamic IP address. I have a cron job to check the IP address and update GitHub page everytime the IP address changes. It works great. – Guangliang Dec 24 '17 at 23:09
  • The answer is not that bad, but only answers half of what OP asked. He is not looking only for a static link, but also for some place to put the data 'which absolve me of financial and legal responsibility for maintaining the data and the hosting infrastructure'. There are lot of valid and complete answers here. but this is valid, but not complete. – Cochise Dec 25 '17 at 23:06
  • 3
    Why should github be more permanent than all the other repositories which closed in the last years (BerliOS, Freshmeat, Freecode, gna!, gitorious, codehaus, code.google, Fedorahosted.org)? I update the list from time to time on: https://wiki.gentoo.org/wiki/Upstream_repository_shutdowns – Jonas Stein Dec 26 '17 at 01:19
  • @Cochise What you're describing as missing is not part of his question. – Jeff Dec 26 '17 at 06:59
  • 1
    @JonasStein Because Github is very big and those were all very small? I just looked at a few, but Berlios reported 50k users versus GitHubs 20m, and Gitorious reported 11% of the Git market share versus GitHubs 87%. Nothing on the internet is 100% permanent. That doesn't change anything about Github being a very safe option. – Jeff Dec 26 '17 at 14:41
  • The only reason Github is being used for is that 'it is unlikely to go away anytime soon'. As far as I can tell there's little benefit of Github over say a wordpress.com site. – icc97 Dec 26 '17 at 16:22
  • I would also add that you can generate a PURL for the github repo (or for anything for that matter). purl.org – thariri Dec 26 '17 at 17:18
  • Big files can be handled using https://git-lfs.github.com/ – md2perpe Dec 27 '17 at 10:06
30

There are services that provide enough to support 12 GB of data. For example, Figshare provides 20 GB of free space (file size limit 5 GB) for private storage and apparently unlimited public space. They state they can support larger files but not through user upload.

When you publish data you can assign a doi to the data set (this can actually be done much earlier in the process as a reserved number). Many journals also use Figshare (and likely other services) for their "Supporting information" as well. I do not know if adding such information is associated with costs.

I am only familiar (not associated) with Figshare and do not know limitations of other similar services so see this as an example. Also look in to the possibility to add the data as supporting information to your article.

Peter Jansson
  • 73,217
  • 12
  • 206
  • 344
  • I see that figshare has been around for 6-7 years, which is a relatively long time on the Internet and a good sign since longevity is key here. – Luke Sawczak Dec 23 '17 at 17:50
  • 12
    I agree that the "right way" to link to a dataset is to assign a DOI to it. To that end, zenodo.org is a free service that accept up to 50GB per dataset. – LCT Dec 23 '17 at 22:37
  • @LCT There's something to that but I think people tend to overemphasize the importance of a DOI. Don't get me wrong, it does have clear advantages, in the sense of being intended to be permanent, being a standard that academics are familiar with, and being compatible (in some sense) with existing citation formats, but let's not get carried away thinking that, say, anything without a DOI is necessarily inferior. – David Z Dec 24 '17 at 00:21
  • 1
    @DavidZ An archive without DOIs isn't necessarily inferior. An archive without a reasonably well-established method for permanent document identification is, though. – E.P. Dec 24 '17 at 10:38
  • @E.P. Right, I'm just calling out the implication (intended or not) that a DOI is the "right way" to permanently identify a resource and any other type of permanent identifier is the "wrong way". – David Z Dec 24 '17 at 10:47
  • @E.P. A cryptographic hash is a more reliable and more widely used way to identify the data than DOI. The hash doesn't give you any URL for locating the data but it does give you a way to verify that you got the right data once you have located it somehow. – kasperd Dec 25 '17 at 23:00
10

If your data is a collection of books, audio, or video files, you may host them on the Internet Archive's website, https://archive.org (upload page: https://archive.org/create/).

The Internet Archive is a San Francisco–based nonprofit digital library with the stated mission of "universal access to all knowledge." It provides free public access to collections of digitized materials, including websites, software applications/games, music, movies/videos, moving images, and nearly three million public-domain books. As of October 2016, its collection topped 15 petabytes. In addition to its archiving function, the Archive is an activist organization, advocating for a free and open Internet. [...] Founded by Brewster Kahle in May 1996.

It's free to upload and download.

Examples:

Franck Dernoncourt
  • 33,669
  • 27
  • 144
  • 313
8

You could use a service that provides PURLs (persistent URLs).

Such a URL redirects to a target URL of your choice, and you can update the target URL in case you need to move to a new hosting location.

Examples

  • The best known service is https://archive.org/services/purl/.

    Since 2016, the service is provided by the Internet Archive (blog post). From 1995 to 2016, it was provided by the OCLC.

    Lorcan Dempsey of OCLC welcomed the announcement as “a major step in the future sustainability and independence of this key part of the Web and linked data architectures. OCLC is proud to have introduced persistent URLs and purl.org in the early days of the Web and we have continued to host and support it for the last twenty years. We welcome the move of purl.org to the Internet Archive which will help them continue to archive and preserve the World’s knowledge as it evolves.”

    It uses several domain names, including purl.org, purl.net, and purl.com.

    You need an account on https://archive.org/ to create and manage your PURLs.

  • Another, younger service is https://w3id.org/, provided by a group of organizations that follow a social contract:

    There are a growing group of organizations that have pledged responsibility to ensure the operation of this website. These organizations are: […]. They are responsible for all administrative tasks associated with operating the service. The social contract between these organizations gives each of them full access to all information required to maintain and operate the website. The agreement is setup such that a number of these companies could fail, lose interest, or become unavailable for long periods of time without negatively affecting the operation of the site.

    They claim:

    All identifiers associated with this website are intended to be around for as long as the Web is around. This means decades, if not centuries.

    It uses the domain name w3id.org.

    To create and manage your PURLs, you need to submit a pull request on GitHub or send an email to their mailing list.

  • Some more.

Risk assessment

For the objective to get a permanent HTTP URL (with the ability to change the redirect target) without having to pay something, a PURL service would be the best choice:

  • Providing permanent HTTP URLs is the primary goal of these services, and their only reason of existence. Their whole focus will be on keeping these URLs working.

  • Providing such a service is not complex, and not hard on the servers, so there is a good chance that it can be kept online in the future, even with a very limited budget.

Other web services might also care about permanent URLs, but they have to care about much more stuff in addition, so their priorities are different, and they might have to discontinue their service because of commercial reasons.
As an example, take Google and look at how many services they discontinued (among them also services that provided URLs for their users’ content). And if there are businesses that could afford (and want) to keep URLs from unprofitable services alive, Google would certainly be among them, right?

unor
  • 359
  • 2
  • 14
  • This is a useful suggestion, but note that OP is looking also for somewhere to host the data. A reasonably permanent URL for accessing the data is nice, but you'll still need some place to point it, which the OP seems to want to be free of charge. – user Dec 26 '17 at 21:00
  • @MichaelKjörling: Yeah, this post answers only the question in the title and the bold part. As OP says "The stability of the actual location is not as important as the stability of the link itself", I don’t think it makes sense to recommend a hoster here, as any gratis hosting service would do the job, given that the permalink can be updated. – unor Dec 26 '17 at 21:08
  • 1
    This makes more sense than a github repo – icc97 Dec 27 '17 at 07:00
  • 2
    I think this neatly combines with @FranckDernoncourt's answer – icc97 Dec 27 '17 at 07:04
  • isn't this what doi.org does too? With the added benefit that url = site + doi ? – jiggunjer Dec 28 '17 at 04:39
  • @jiggunjer: I’m not familiar with DOIs, but according to this answer, the services that allow gratis registrations require that the data gets uploaded to their own servers (which I would not recommend); and I think there are some gratis registrants only for specific scientific domains. -- If there is a registrant that offers gratis DOIs that can point to any URL, I guess this could be a good alternative to PURLs. A benefit of PURLs is that they can be semantic (you can choose to use meaningful words instead of only numbers). – unor Dec 28 '17 at 10:31
3

One recently launched service that addresses your problem is the Wolfram Data Repository:

The Wolfram Data Repository is a public resource that hosts an expanding collection of computable datasets, curated and structured to be suitable for immediate use in computation, visualization, analysis and more.

In the launch announcement, Stephen Wolfram writes:

With the Wolfram Data Repository (and Wolfram Notebooks) there’s finally a great way to do true data-backed publishing—and to ensure that data can be made available in an immediately useful and computable way.

In another part of the post, he writes:

Each entry in the Wolfram Data Repository has an associated webpage, which describes the data it contains [...] every entry also has a unique readable registered name, that’s used both for the URL of its webpage, and for the specification of the ResourceObject that represents the entry.

Regarding the size of the data sets, he writes:

There’s no limit in principle on the size of the data that can be stored in the Wolfram Data Repository. But for now, the “plumbing” is optimized for data that’s at most about a few gigabytes in size—and indeed the existing examples in the Wolfram Data Repository make it clear that an awful lot of useful data never even gets bigger than a few megabytes in size.

The announcement is very long and has much more about the rationale and vision behind this service and details of how it works. I couldn't find information about pricing -- presumably it's free for now -- or what promises Wolfram is making regarding the permanence of the data storage (except for the vague sentence "The Wolfram Data Repository, though, is intended to be something much more permanent"). But the service is fairly new so I expect those things will be clarified eventually. Wolfram Research is a serious company with high credibility in the scientific community and has been around since 1987, so this looks like an intriguing option for your data storage problem.

Dan Romik
  • 189,176
  • 42
  • 427
  • 636
  • 1
    Nice addition. I dont knew. But, they expect the data to be in the user account before submission, and a account that supports this size of data appears to be $103/month. http://www.wolfram.com/development-platform/pricing/ The vendor locked in is another point to consider, but out of the scope of OP question. – Cochise Dec 25 '17 at 17:35
1

DataPort is an initiative from the IEEE. You can host up to 2 TB and you will receive a DOI.

Giacomo1968
  • 117
  • 5
Mychele
  • 19
  • 1
  • 1
1

I need to link to data used for a publication...I need at least a stable link to some place where the data can be located

Provide a link to your personal site and redirect from there.


E.P. raised the issue

Google Drive data is mutable - it could be altered by the owner at any point (and, conversely, viewers do not have any guarantee that the data they see five years after publication, if it is still there, has not been altered in the meantime). This makes it completely unsuitable for this purpose.

This issue is orthogonal to the OP's question, but nonetheless interesting. It can be solved by taking a cryptographic hash of the data and including that hash in the publication.

user2768
  • 40,637
  • 9
  • 93
  • 144
0

Nothing lasts forever, but free file hosting services exist even without restrictions on size. Nothing in the world is really free, so, these services would impose some other kind of restriction, e.g., advertisement, or noticeable downtimes, or low bandwidth, or discomfort for uploading or downloading, or really ugly and long (but stable!) URLs, etc. These services might also ask you for all your private data and sell it later or send you lots of targeted spam. Choose a service that causes you as little discomfort as possible. That would be my solution.

How to find such a service would be a different question. I usually first find a site comparing dozens of free hosting services and then take it from there.

Leon Meier
  • 3,993
  • 1
  • 13
  • 37