Data Version Control (DVC)

Blog Post
GitHub

Technologies

Docker
dvc
Python
Github Actions

Summary

Implemented Data Version Control (DVC) to orchestrate and manage datasets and machine learning models within a deep learning–driven drug discovery pipeline. Automated the curation of large, heterogeneous datasets from public repositories to ensure reproducibility, auditability, and transparency across all machine learning workflows. This approach strengthened traceability, improved collaboration between data scientists and domain experts, and established a foundation for reproducible research.

What did I do?

Built reproducible ML pipelines using Data Version Control (DVC) to manage large-scale drug discovery data and ensure full experiment traceability.
Automated dataset curation from public sources (ChEMBL, PubChem) into a unified, versioned repository for consistent and transparent data workflows.
Used DVC to version control datasets, features, and models, enabling seamless tracking and reproducibility across experiments.
Integrated DVC with Docker to create isolated, consistent environments that preserved dependencies and configurations.
Reduced experimental drift, improved collaboration, and enabled reproducible deployment across local, HPC, and cloud environments.