0

I am a PhD student in data science, basically I design a model for a Vision / Language task. The dataset and the state of the arts models are public. It has been 2 years that I trained myself to use Docker for my dev environments / experiments. I am wondering if it's not overkill? Most of my colleagues use Conda and they seem to be fine with it.

TL;DR: What are potential issues with using Conda to set up my development environment as an academic data scientist?

Context

Suppose you're designing or fine-tuning deep learning models. Essentially, you're a data scientist. In this scenario, you're operating in an academic environment, which means there's no need to send your models into production for a client or a demo. Your goal is to explore, train, and test your model on a given public database for your research paper. After conducting your experiments, you'll probably maintain a public repository to share your code and facilitate publication. Additionally, you have access to a remote cluster for computations, as you may not have 4 A100s in your local machine.

Note: The research process isn't quite as straightforward as this, as it's iterative and can be chaotic. I've just listed the main steps for simplicity.

Question

What could be the potential issues with using Conda to set up a deep learning environment?

My Attempted Answer

Here's my perspective, though I'm unsure of its accuracy. I don't use Conda myself; instead, I use Docker, and I'm wondering if that might be overkill.

Conda can construct a comprehensive development environment, encapsulated in a configuration file environment.yml. This file can store your Python packages, helper programs, and libraries with their versions. However, with Conda, you don't explicitly specify your system's requirements; instead, you download Conda packages. Thus, for tools like Git, the CUDA Toolkit, and others, you would need to find corresponding Conda packages.

Now let's turn our attention to the specifics of a deep learning environment, particularly CUDA libraries. Thanks to these libraries, we can leverage our GPU for model computations. According to the CUDA documentation, you'll notice extensive tables outlining the dependencies required based on your host OS. Therefore, with Conda, you cannot have a single configuration file that is host-independent, as the Conda packages you'll need to install will differ based on the host's OS.

A possible solution could be using Docker or any other container solution. A container shares the kernel with the host machine and can encompass any distribution of an OS (in this case, a Linux OS). This means you could use the same container across different machines, as the container includes an OS layer that Conda lacks.

I would welcome any corrections or additions to this post. I'm relatively inexperienced with these matters, so I may have overlooked something. Ultimately, I'm seeking a definitive answer to my question. I'm not certain if my attempted answer provides a part of the solution, or if it's entirely incorrect. Any insights would be greatly appreciated.

Note: I read this exhaustive answer. The problem is that it does not fit my case. I do not have to many different clusters, production pipelines or any other complicated stuff. I "just" need to develop, train remotely, experiment and share a reproducible model.

0 Answers0