DVC: Artifact Version Control for Machine Learning

Published in

Dev Genius

5 min readJan 16, 2023

If you’re learning Machine Learning, just starting your professional journey in the field, or have some experience under your belt, one challenge you may face is how to keep track of your data, experiments, and other artefacts. Fortunately, there’s a solution to this problem: DVC(Data version Control), an open-source tool designed to be developer-friendly. With DVC, you won’t need to make any changes to your code apart from copying/pasting in order to use it. All you have to do is install and start using it — it’s that simple.

Evolution of a Machine Learning Projects (Credits)

Takeaways from this short blog

The need for DVC as a tool for tracking and versioning data, experiments, and other artefacts in ML projects
The common challenges faced by ML developers and the ways in which DVC aims to solve them
A general overview of what a typical ML workflow looks like
The prerequisites for following along with the rest of the blog series, include knowledge of Github commands, data pre-processing, and model training.

ML Workflow

In order to understand the context of the discussion about machine learning development workflow, it’s important to understand that developing an ML project is not a one-time task. Instead, it is an iterative process that involves moving between different activities. The typical ML development workflow can be broadly divided into the following five phases:

Data management and analysis
Experimentation
Solution development and testing
Deployment and serving
Monitoring and maintenance

A diagram explaining the life cycle of an ML Problem (Credits)

Each phase requires a different set of tools and methodologies and we will be discussing those in the upcoming blogs.

Common challenges faced during the ML Project Collaboration

Collaborating on ML projects can come with its own set of challenges, including:

Difficulty sharing large datasets among teammates

Work duplication due to lack of visibility into the progress of other team members

Duplication of def() function by 2 ML Developers

Slow updates and lack of visibility into the latest status of a project

Pipelines that are not reliable or reproducible, resulting in varying models depending on machine configurations
Inconsistent data quality, hindering reproducibility

Difficulty tracking and comparing model metrics

These are just a few examples of the types of challenges you may encounter while working on ML projects with a team. However, many of these issues can be overcome by ensuring that your projects are reproducible for everyone involved. Reproducibility ensures that the artifacts produced by experiments, such as models and datasets, can be used throughout the ML development workflow.

To achieve reproducibility, there are several prerequisites to consider, including:

Environment dependency control
Code version control
Control over run parameters
Automated pipelines
Artifact version control
Experiment results tracking
Automated CI/CD and MLOps

Throughout this blog series, we will be covering these points as a checklist to help you overcome the common challenges of collaborating on ML projects.

Is there really is a need of DVC?

As we all know, Jupyter Notebooks and Google Colab are popular choices for experimenting and developing Machine Learning models. However, as projects grow in complexity, code length, and number of experiments, it becomes increasingly difficult to keep everything organized. Additionally, Jupyter Notebooks are not the best tool for versioning code. To address these challenges, we need a system that can:

Organize code into reusable units/Re-usable Code
Use Git for version control
Make dependencies and requirements explicit

Pre-requisites

Before diving into this series of blogs, it’s important to have a basic understanding of Github commands such as git add, git commit, git push, git fetch, git pull, git checkout, git merge, and git rebase. Additionally, having a foundation in data pre-processing, model training, and saving models will be beneficial for understanding the content covered in these blogs.

Blog Links

Stay tuned for our upcoming blog series where we’ll be discussing important topics in ML/Data Science such as:

The best practices for structuring your working directory (Published)

What Directory Structure Should I Follow for an ML/DS Project?

Do you find yourself drowning in a sea of Jupyter Notebooks, struggling to keep track of your ML/DS projects and…

danishbansal.medium.com

A step-by-step guide on how to install DVC

Step-by-Step Guide To Install DVC

DVC is Data Version Control, an Open Source Tool for Data Versioning and Experiment Tracking.

danishbansal.medium.com

How to connect DVC to remote storage options like GCP Bucket, AWS S3, and even Google Drive to store and version your data

How to connect DVC to GCP Bucket (remote storage) to store and version your data

In this blog, we’ll be diving into the practical usage of DVC by connecting it to various remote storage options such…

danishbansal.medium.com

How to connect DVC to Google Drive (remote storage) to store and version your data

Don’t have access to cloud storage options like GCS, S3? No problem! The always free Google Drive is here to help…

danishbansal.medium.com

Tips and tricks for sharing artefacts across team members

Links to these informative blogs will be available soon, so make sure to check back and learn more about optimizing your ML/Data Science workflow.

That concludes our discussion for now. Be sure to follow me for further updates and insights on Machine Learning and Data Science. You can also connect with me on Linkedin for more information and to stay in touch.

Dev Genius

DVC: Artifact Version Control for Machine Learning

Takeaways from this short blog

ML Workflow

Common challenges faced during the ML Project Collaboration

Is there really is a need of DVC?

Pre-requisites

Blog Links

What Directory Structure Should I Follow for an ML/DS Project?

Do you find yourself drowning in a sea of Jupyter Notebooks, struggling to keep track of your ML/DS projects and…

Step-by-Step Guide To Install DVC

DVC is Data Version Control, an Open Source Tool for Data Versioning and Experiment Tracking.

How to connect DVC to GCP Bucket (remote storage) to store and version your data

In this blog, we’ll be diving into the practical usage of DVC by connecting it to various remote storage options such…

How to connect DVC to Google Drive (remote storage) to store and version your data

Don’t have access to cloud storage options like GCS, S3? No problem! The always free Google Drive is here to help…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Dev Genius

Written by Danish Bansal

Responses (3)