Nobody has doubts about the importance for humankind of the PANINI sticker album for the FIFA World Cup. From a mathematical point of view, several interesting questions arise. How much money do they need to spend? How many other collectors do they need to interact with? What if a sticker pack had 6 stickers instead of 4? Rodralez, from LABSIN developed an app for answering these and other questions [3min read].
Given a prediction on a particular example, how sure is Random Forest about it? For answering this question it is necessary to look beyond usual performance metrics and dive into the swampy waters of the confidence interval estimation for statistical learning algorithms 😖. [6 min read] (updated 11/21/22)
Despite selection and information bias it is possible to do inference from non-randomized experiments?. The good ol' statistics comes to help us with its strong theoretical framework. [6min read]
The usual approach of shuffling the data and split in train and test could not be the best strategy for some cases for getting a good error estimator of your model. Sometimes pure random splits do not guarantee the required level of dissimilarity. In adition, all the precautions to avoid contamination of your test set during your trainset manipulation must be taken for the test fold during cross validation. [4min read]
Experimental Design in Machine learning is well established. However, from time to time it is important to revisit the process to analyze the confidence level you have in your results. Machine learning shares a lot with statistics, but since Machine learning practitioners have a more practical vision, sometimes the experimental design is neglected when applied to real-world problems. This note explains the basic strategy followed in almost any machine learning experimental setup.