reproducible Research

23 Jun 2022

Using DVC pipelines with examples in R (stages)

DVC provides the elements for managing machine learning pipelines using git repositories as a backend. If you use git for tracking your machine learning experiments, you will feel comfortable with DVC.[9min read]

02 Jun 2022

Why Using DVC

Notebooks rule! We agree on that, but they can get messy very fast. The truth is they are not the best tool for good software engineering practices. Data Version Control (DVC) is a toolset for helping the development of machine learning experiments favoring better practices. Are you ready to give it a try? [5min read]

30 Nov 2021

Github flow for conducting research projects

The Software Industry has well-defined standards and procedures which are heavily based on tools such as Gitlab. However, in research sometimes we follow a more relaxed and not structured way. At LABSIN we have recently begun to apply software industry approaches to our daily work. The match is not perfect since research could be different in some way. But, the benefits are clear. [9min read]

29 Jun 2021

Deploying a simple ML model with Plumber 101

Sometimes notebooks are not enough and you will need to deploy your machine learning model into company infrastructre. The task involves a lot of Software Ingenieering knowledge, BUT with Plumber package for R you can do the basics with not so much pain 😉. [6 min read]

29 Mar 2021

Inference with Observational Data

Despite selection and information bias it is possible to do inference from non-randomized experiments?. The good ol' statistics comes to help us with its strong theoretical framework. [6min read]

16 Nov 2020

Three Common Ways for Comparing Two Dataset Distributions

From time to time you will need to compare the distribution of two datasets. There are plenty of information about this topic in statistics books and all over the Internet. In this post I discuss three very practical approaches coming from different perspectives. [3.5 min read](updated 04/01/2021)

22 Sep 2020

Machine Learning Experimental Design 102

The usual approach of shuffling the data and split in train and test could not be the best strategy for some cases for getting a good error estimator of your model. Sometimes pure random splits do not guarantee the required level of dissimilarity. In adition, all the precautions to avoid contamination of your test set during your trainset manipulation must be taken for the test fold during cross validation. [4min read]

18 Sep 2020

Machine Learning Experimental Design 101

Experimental Design in Machine learning is well established. However, from time to time it is important to revisit the process to analyze the confidence level you have in your results. Machine learning shares a lot with statistics, but since Machine learning practitioners have a more practical vision, sometimes the experimental design is neglected when applied to real-world problems. This note explains the basic strategy followed in almost any machine learning experimental setup.

26 Aug 2020

Notebooks in Data Science Development and other tools for reproducibility.

Jupyter and Rstudio notebooks have become the default standard for data science development. However, it is important to know their limitations and detect the moment of moving to a more "powerful" tool. Since in data science a very significant portion of the work is related to development, it is always important to be aware of the last development tools.