Computer Science Notes

16 Nov 2020

Three Common Ways for Comparing Two Dataset Distributions

From time to time you will need to compare the distribution of two datasets. There are plenty of information about this topic in statistics books and all over the Internet. In this post I discuss three very practical approaches coming from different perspectives. [3.5 min read](updated 04/01/2021)

11 Oct 2020

Resources for a Gentle Introduction to Machine Learning

Jurgen has his own list of recommended resources for new members of his lab. Well, we have ours too. Here is a portion of the list for a gentle introduction to Machine Learning for LABSIN new members. [5 min read]. (updated 09/24/2021).

22 Sep 2020

Machine Learning Experimental Design 102

The usual approach of shuffling the data and split in train and test could not be the best strategy for some cases for getting a good error estimator of your model. Sometimes pure random splits do not guarantee the required level of dissimilarity. In adition, all the precautions to avoid contamination of your test set during your trainset manipulation must be taken for the test fold during cross validation. [4min read]

18 Sep 2020

Machine Learning Experimental Design 101

Experimental Design in Machine learning is well established. However, from time to time it is important to revisit the process to analyze the confidence level you have in your results. Machine learning shares a lot with statistics, but since Machine learning practitioners have a more practical vision, sometimes the experimental design is neglected when applied to real-world problems. This note explains the basic strategy followed in almost any machine learning experimental setup.

06 Sep 2020

Are Boosting Algorithms the new baseline model for your Tabular data? Part 1

Neural networks rule the world of machine learning IFF, you have a lot of data, and just for a reduced set of problems. The fact is that for heterogeneous (numerical and categorical) tabular data, decision trees are still one of the best options. Also, they have the benefit of being (more) explainable to the customer. Boosting decision trees are among the most successful algorithms in data science competitions, but could they replace Random Forest? The absolute leader, when you try a first model in your data.[updated]