Dev Genius

Coding, Tutorials, News, UX, UI and much more related to development

Follow publication

What Directory Structure Should I Follow for an ML/DS Project?

--

Do you find yourself drowning in a sea of Jupyter Notebooks, struggling to keep track of your ML/DS projects and collaborate with your team? Look no further! This blog will guide you through the process of creating a well-organized project directory and provide tips for seamless collaboration with your team members, putting an end to your project management woes.

Why is Project Structure Important?

When it comes to working on Machine Learning and Data Science projects, it’s easy to get caught up in using Jupyter Notebooks. They’re great for experimenting and exploring data, and they make it easy to share your results with others. But what happens when it’s time to take your project to the next level and put it into production? That’s where having a good project structure becomes essential.

A well-organized project structure makes it easier for other developers to understand and work on your project. It also makes it easier to move your code from a development environment to a production environment. That’s why it’s important to start thinking about your project structure from the very beginning of your project, instead of trying to fix it later on.

Here’s why you should follow a good directory structure

Others will thank you.

Well-organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. People will thank you for this because they can:

  • Collaborate more easily with you on this analysis
  • Learn from your analysis about the process and the domain
  • Feel confident in the conclusions at which the analysis arrives

You will thank you.

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it’s now impossible to interpret whether you should use a.py.old, b.py or c.py to get things done. Here are some questions which you may have experienced or will experience-

  • Are we supposed to go in and join column X to the data before we get started or did that come from one of the notebooks?
  • Come to think of it, which notebook do we have to run first before running the plotting code: was it “process data” or “clean data”?
  • Et cetera, times infinity.

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work

Nothing here is binding

Disagree with a couple of the default folder names? Working on a project that’s a little nonstandard and doesn’t exactly fit with the current structure? Prefer to use a different package than one of the provided ones?

Go for it! This is a lightweight structure and is intended to be a good starting point for many projects

Let’s break the suspense here and let me provide you with the best directory structure which will help you start an ML/DS Project.

Best Directory Structure

The best repository structure is the Cookiecutter data science project structure, which contains all of the elements to lay a good foundation for our projects:

├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.

├── docs <- A default Sphinx project; see sphinx-doc.org for details

├── models <- Trained and serialized models, model predictions, or model summaries

├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-db-initial-data-exploration`.

├── references <- Data dictionaries, manuals, and all other explanatory materials.

├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting

├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`

├── setup.py <- Make this project pip installable with `pip install -e`
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py

└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io

You can add some additional folders as per your requirement.

One more good news you don’t need to make it yourself. You can just use the following command to get started

$ pip install cookiecutter
$ cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

Below is the preceding article of this blog series. Follow me to learn how industry-standard ML is implemented

That concludes our discussion for now. Be sure to follow me for further updates and insights on Machine Learning and Data Science. You can also connect with me on Linkedin for more information and to stay in touch

--

--

Published in Dev Genius

Coding, Tutorials, News, UX, UI and much more related to development

Written by Danish Bansal

Machine Learning Engineer turning research into real-world solutions. Specialising in streamlining ML workflows, deployment, version control and coding.

No responses yet

Write a response