61

I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have protection against disk corruption. I find git also extremely helpful for having consistent backups but I know that git cannot handle large amounts of binary data efficiently.

In my masters studies I worked on data sets of similar size (also images) and had a lot of problems keeping track of different version on different servers/devices. Diffing 100GB over the network really isn't fun, and cost me a lot of time and effort.

I know that others in science seem to have similar problems, yet I couldn't find a good solution.

I want to use the storage facilities of my institute, so I need something that can use a "dumb" server. I also would like to have an additional backup on a portable hard disk, because I would like to avoid transferring hundreds of GB over the network wherever possible. So, I need a tool that can handle more than one remote location.

Lastly, I really need something that other researcher can use, so it does not need to be super simple, but should be learnable in a few hours.

I have evaluated a lot of different solutions, but none seem to fit the bill:

  • svn is somewhat inefficient and needs a smart server
  • hg bigfile/largefile can only use one remote
  • git bigfile/media can also use only one remote, but is also not very efficient
  • attic doesn't seem to have a log, or diffing capabilities
  • bup looks really good, but needs a "smart" server to work

I've tried git-annex, which does everything I need it to do (and much more), but it is very difficult to use and not well documented. I've used it for several days and couldn't get my head around it, so I doubt any other coworker would be interested.

How do researchers deal with large datasets, and what are other research groups using?

To be clear, I am primarily interested in how other researchers deal with this situation, not just this specific dataset. It seems to me that almost everyone should have this problem, yet I don't know anyone who has solved it. Should I just keep a backup of the original data and forget all this version control stuff? Is that what everyone else is doing?

Aleksandr Blekh
  • 6,518
  • 4
  • 28
  • 54
Johann
  • 721
  • 1
  • 5
  • 5
  • If you tell me three things I might have an answer! 1. Does your medium size data gets bigger? If so how periodically. 2. Do you use programming languages/frameworks to do pattern matching and data analysis? 3. Does anyone else use this data? If so do they change it? –  Feb 13 '15 at 10:40
  • I voted to "Leave Open" when reviewing Close Votes queue because I don't think this question is for specific situation. However, can anybody who is familiar with Software Recommendations SE tell us this question would fit into that site? –  Feb 13 '15 at 10:52
  • 2
    @scaaahu I don't think this is necessarily a software question; an acceptable answer could also describe a workflow or combination of tools and systems. (Anyways, being on topic somewhere else shouldn't play into the decision to close a question here.) –  Feb 13 '15 at 11:01
  • 2
    Just to protect against data corruption with image data, I periodically run a script that re-computes a checksum file with all files and their md5 checksums. The checksum file is then kept in git. Now I can immediately see with git diff if any of the checksums have changed. And I can also see which files have been removed & added. And if there are e.g. any signs of data corruption, then I can use the regular backups to restore old versions. Not perfect but better than nothing. –  Feb 13 '15 at 11:30
  • @JukkaSuomela I think you should post that as an answer, not a comment. –  Feb 13 '15 at 11:48
  • @Johann Why not just files, without version control (but with backups)? – Piotr Migdal Feb 13 '15 at 12:22
  • @PiotrMigdal: Are you seriously asking why people should use version control, instead of just having a bunch of files with backups?-) –  Feb 13 '15 at 13:01
  • 1
    @JukkaSuomela I think it's a reasonable question when you've got very large datasets, if those datasets change frequently... in those cases, backup often is what's used as version control. –  Feb 13 '15 at 13:29
  • 1
    I'm voting to close this question as off-topic because it deals with data/databases rather than something specific to academia. The questions is great, and (IMHO) should be moved to DataScience.SE or (perhaps) Databases.SE. – Piotr Migdal Feb 13 '15 at 14:02
  • @PiotrMigdal I don't know if this is off-topic, but this question does also not fit well with Databases.SE or DataScience.SE. I would like to know what other researchers/Institute do in practice to deal with this kind of problem - I've updated the question accordingly. –  Feb 15 '15 at 19:22
  • @DaveRose 1. Yes, I will hopefully add more experimental data and processed images, but not very often (maybe a few iterations); 2. Yes, that is part of my thesis; 3. Yes, others will hopefully use and change the data. –  Feb 15 '15 at 19:32
  • @JukkaSuomela That actually sounds pretty good (at least much better than anything I've found so far). –  Feb 15 '15 at 19:33
  • @PiotrMigdal Going without version control kind of pains me for the reasons I've stated in the question (especially: "did my data change with out me noticing?" and then 2 days of diffing by hand). So I am looking for something smarter. –  Feb 15 '15 at 19:35
  • @Johann 1. How you store or version control it is in domain of data science (regardless if you use it in academia, industry or for a hobby project). I really want to ensure the best answers and it is good to go where there are many experts in data. 2. My point was only that it might be not for git. (And, all in all, git is a filesystem). Do you want to diff per file, line, or what? – Piotr Migdal Feb 15 '15 at 19:53
  • @PiotrMigdal 1) You are probably right, though I am still curious how scientist - without a background in data science - handle that situation, 2) No doubt, git cannot handle that kind of data. A diff per file would be enough, just to see if my data has changed (or I changed it inadvertently). The point is to have control/documentation over how my data changed through what action by whom. –  Feb 16 '15 at 05:16
  • 1
    @Johann Data scientist have different backgrounds. Mine is in quantum mechanics, for example. The whole point here is that: 1. StackExchange discourages so-called boat questions and 2. its better to get best practices rather than how it is solved by people who had to solve it but had no idea. – Piotr Migdal Feb 18 '15 at 12:33
  • @Johann Are you sure you need to version control your dataset? If you are processing a big set of images, you usually want to version control the procedures you followed that led to a modified set of images. Therefore, you usually don't need to keep track the images themselves. If you want to restore this modified set in the future, obviously you only need to take the original dataset and apply the procedures you did according to some commit in your code. – r_31415 Feb 21 '15 at 00:50

10 Answers10

13

What I am ending up using is a sort of hybrid solution:

  • backup of the raw data
  • git of the workflow
  • manual snapshots of workflow + processed data, that are of relevance, e.g.:
    • standard preprocessing
    • really time-consuming
    • for publication

I believe it is seldom sensible to have a full revision history of large amount of binary data, because the time required to review the changes will eventually be so overwhelming that it will not pay off in the long run. Maybe a semi-automatic snapshot procedure (eventually to save some disk-space, by not replicating the unchanged data across different snapshots) would be of help.

norok2
  • 256
  • 4
  • 8
  • Well, I'm using find . -type f -print0 | xargs -0 md5sum > checksums.md5 to calculate the checksums and md5sum -c checksums.md5 to checksums, and version control the checksums. That helps to check the data at different locations/on different machines. Seems to be the best we can do at the moment, – Johann Sep 29 '15 at 21:30
  • If by modifying your data, you always change its file name, then it might be good solution. Otherwise, I would highly recommend to check on the data itself, for example with rsync on (a copy of) the original data. One other possibility which is common in neuroscience (although I do not like it so much because sometimes it is not as well documented as it should be), is to use the nipype python package, which can be seen as a (sort of) workflow manager and it manages the cache of binary data of the intermediate steps of the analysis automatically. – norok2 Oct 01 '15 at 09:11
  • @norok you've described a great general framework. I've implemented something similar in DVC tool - please take a look at my answer below. I'd appreciate your feedback. – Dmitry Petrov May 13 '17 at 23:10
11

Try looking at Git Large File Storage (LFS). It is new, but might be the thing worth looking at.

As I see, a discussion on Hacker News mentions a few other ways to deal with large files:

Piotr Migdal
  • 756
  • 5
  • 15
  • unfortunatel git lfs only accept 2GB files. In my case I'd like to manage 15gb system images and I'm also looking for a solution – fralbo Nov 28 '19 at 16:04
9

This is a pretty common problem. I had this pain when I did research projects for a university and now - in industrial data science projects.

I've created and recently released an open source tool to solve this problem - DVC.

It basically combines your code in Git and data in your local disk or clouds (S3 and GCP storage). DVC tracks dependency between data and code and builds the dependency graph (DAG). It helps you to make your project reproducible.

DVC project could be easily shared - sync your data to a cloud (dvc sync command), share your Git repository and provide access to your data bucket in the cloud.

"learnable in a few hours" - is a good point. You should not have any issues with DVC if you are familiar with Git. You really need to learn only three commands:

  1. dvc init - like git init. Should be done in an existing Git repository.
  2. dvc import - import your data files (sources). Local file or URL.
  3. dvc run - steps of your workflow like dvc run python mycode.py data/input.jpg data/output.csv. DVC derives the dependency between your steps automatically, builds DAG and keeps it in Git.
  4. dvc repro - reproduce your data file. Example: vi mycode.py - change code, and then dvc repro data/output.csv will reproduce the file (and all the dependencies.

You need to learn a couple more DVC commands to share data through the cloud and basic S3 or GCP skills.

DVC tutorials is the best starting point.

MD004
  • 310
  • 1
  • 3
  • 10
Dmitry Petrov
  • 261
  • 3
  • 4
  • 1
    Can this be used with only storing large binary files (mostly videos). ML is not the goal. Goal is to have a repo to store large binary file. Repo should have caching, selective checkout/pull (like perforce) and file/directory locking mechanism. Is it suitable for such purpose? – hemu May 28 '18 at 05:20
  • 1
    @hemu Yes. DVC works just fine for the basic large data file scenario without ML features (like ML pipelines and reproducibility). Perforce-lock semantic is not supported due to Git semantic. Please use per-file checkout instead. – Dmitry Petrov May 29 '18 at 05:37
9

I have dealt with similar problems with very large synthetic biology datasets, where we have many, many GB of flow cytometry data spread across many, many thousands of files, and need to maintain them consistently between collaborating groups at (multiple) different institutions.

Typical version control like svn and git is not practical for this circumstance, because it's just not designed for this type of dataset. Instead, we have fallen to using "cloud storage" solutions, particularly DropBox and Bittorrent Sync. DropBox has the advantage that it does do at least some primitive logging and version control and manages the servers for you, but the disadvantage that it's a commercial service, you have to pay for large storage, and you're putting your unpublished data on a commercial storage; you don't have to pay much, though, so it's a viable option. Bittorrent Sync has a very similar interface, but you run it yourself on your own storage servers and it doesn't have any version control. Both of them hurt my programmer soul, but they're the best solutions my collaborators and I have found so far.

  • There is a popular open source version of Dropbox, OwnCloud. I haven't tried it, though. –  Feb 13 '15 at 21:35
9

I have used Versioning on Amazon S3 buckets to manage 10-100GB in 10-100 files. Transfer can be slow, so it has helped to compress and transfer in parallel, or just run computations on EC2. The boto library provides a nice python interface.

fritzo
  • 236
  • 1
  • 3
6

We don't version control the actual data files. We wouldn't want to even if we stored it as CSV instead of in a binary form. As Riccardo M. said, we're not going to spend our time reviewing row-by-row changes on a 10M row data set.

Instead, along with the processing code, I version control the metadata:

  • Modification date
  • File size
  • Row count
  • Column names

This gives me enough information to know if a data file has changed and an idea of what has changed (e.g., rows added/deleted, new/renamed columns), without stressing the VCS.

BChan
  • 131
  • 1
  • 4
2

I haven't used them but there was a similar discussion in a finance group

data repository software suggestions scidb, zfs, http://www.urbackup.org/

seanv507
  • 751
  • 3
  • 12
0

Generally speaking, there are 2 different approaches.

  • Store data in 1 system and use another system to keep manage the overhead of version control using another system: If you have access to a blob store (like S3 by Amazon), then tools like DVC work pretty well for this. I believe they've already been mentioned in this thread. The benefit here is that blob stores essentially have no storage limits so you don't have to worry about the scale of your data. The downside is the workflow is a bit less intuitive and you have state stored in a few different systems.

  • Use a scalable system that can both version your data & metadata: The other approach is to use a platform can do both. I work at XetHub and this is the approach we're taking. Here's an example repo containing Meta's Llama 2 models, which are basically binary files. Because we deduplicate repetitions in the data, we're able to store 660 GB of files using just 568 GB. You can keep using Git as the interface if you'd like or use our simplified Xet command line interface instead. The downside here is that XetHub can scale to petabytes of data in your repos but we may struggle at the 100 PB+ scale that Amazon S3 can support.

Regarding your checklist, XetHub doesn't require you to run a server, shows you diffs in pull requests, etc.

0

You may take a look at my project called DOT: Distrubuted Object Tracker repository manager.
It is a very simple VCS for binary files for personal use (no collaboration).
It uses SHA1 for checksuming and block deduplication. Full P2P syncing.
One unique feature: adhoc one time TCP server for pull/push.
It can also use SSH for transport.

It is not yet released, but might be a good starting point.
http://borg.uu3.net/cgit/cgit.cgi/dot/about/

Archie
  • 863
  • 8
  • 20
Borg
  • 1
0

You could try using hangar. It is a relatively new player to the data version control world but does a really nice job by versioning the tensors instead of versioning the blob. The documentation must be the best place to start. Since the data is being stored as tensors, you should be able to use it directly inside your ML code (plus hangar now has data loaders for PyTorch and Tensorflow). With hangar, you could get all the benefit of git such as zero-cost branching, merging, time travel through history. One nice feature about cloning in the hangar is you could do partial cloning. Which means, if you have 10 TB of data at your remote and only need 100 MB for prototyping your model, you could fetch only 100 MB via partial cloning instead of a full clone.