DVC: Artifact Version Control for Machine Learning
If you’re learning Machine Learning, just starting your professional journey in the field, or have some experience under your belt, one challenge you may face is how to keep track of your data, experiments, and other artefacts. Fortunately, there’s a solution to this problem: DVC(Data version Control), an open-source tool designed to be developer-friendly. With DVC, you won’t need to make any changes to your code apart from copying/pasting in order to use it. All you have to do is install and start using it — it’s that simple.

Takeaways from this short blog
- The need for DVC as a tool for tracking and versioning data, experiments, and other artefacts in ML projects
- The common challenges faced by ML developers and the ways in which DVC aims to solve them
- A general overview of what a typical ML workflow looks like
- The prerequisites for following along with the rest of the blog series, include knowledge of Github commands, data pre-processing, and model training.
ML Workflow
In order to understand the context of the discussion about machine learning development workflow, it’s important to understand that developing an ML project is not a one-time task. Instead, it is an iterative process that involves moving between different activities. The typical ML development workflow can be broadly divided into the following five phases:
- Data management and analysis
- Experimentation
- Solution development and testing
- Deployment and serving
- Monitoring and maintenance

Each phase requires a different set of tools and methodologies and we will be discussing those in the upcoming blogs.
Common challenges faced during the ML Project Collaboration
Collaborating on ML projects can come with its own set of challenges, including:
- Difficulty sharing large datasets among teammates

- Work duplication due to lack of visibility into the progress of other team members

- Slow updates and lack of visibility into the latest status of a project

- Pipelines that are not reliable or reproducible, resulting in varying models depending on machine configurations
- Inconsistent data quality, hindering reproducibility

- Difficulty tracking and comparing model metrics
These are just a few examples of the types of challenges you may encounter while working on ML projects with a team. However, many of these issues can be overcome by ensuring that your projects are reproducible for everyone involved. Reproducibility ensures that the artifacts produced by experiments, such as models and datasets, can be used throughout the ML development workflow.
To achieve reproducibility, there are several prerequisites to consider, including:
- Environment dependency control
- Code version control
- Control over run parameters
- Automated pipelines
- Artifact version control
- Experiment results tracking
- Automated CI/CD and MLOps
Throughout this blog series, we will be covering these points as a checklist to help you overcome the common challenges of collaborating on ML projects.
Is there really is a need of DVC?
As we all know, Jupyter Notebooks and Google Colab are popular choices for experimenting and developing Machine Learning models. However, as projects grow in complexity, code length, and number of experiments, it becomes increasingly difficult to keep everything organized. Additionally, Jupyter Notebooks are not the best tool for versioning code. To address these challenges, we need a system that can:
- Organize code into reusable units/Re-usable Code
- Use Git for version control
- Make dependencies and requirements explicit
Pre-requisites
Before diving into this series of blogs, it’s important to have a basic understanding of Github commands such as git add, git commit, git push, git fetch, git pull, git checkout, git merge, and git rebase. Additionally, having a foundation in data pre-processing, model training, and saving models will be beneficial for understanding the content covered in these blogs.
Blog Links
Stay tuned for our upcoming blog series where we’ll be discussing important topics in ML/Data Science such as:
- The best practices for structuring your working directory (Published)
- A step-by-step guide on how to install DVC
- How to connect DVC to remote storage options like GCP Bucket, AWS S3, and even Google Drive to store and version your data
- Tips and tricks for sharing artefacts across team members
Links to these informative blogs will be available soon, so make sure to check back and learn more about optimizing your ML/Data Science workflow.
That concludes our discussion for now. Be sure to follow me for further updates and insights on Machine Learning and Data Science. You can also connect with me on Linkedin for more information and to stay in touch.