Posts

29 Mar 2025

Maybe it is time to focus on open LLMs for doing research?

The growing dependence on closed, inaccessible systems in academic publishing undermine the core principles of transparency, reproducibility, and openness. It is time to open models and community-driven innovation as the path forward for meaningful, equitable research. [3 min read]

01 Apr 2024

Converting your own finetuned language model to GGUF

Sometimes your finetuned language Model work as expected but you need a faster inference time. Some other times you need to reduce memory footprint. By transformring your models to the GGUF format you can store quantized models and using them on top of the fast llama.cpp inference engine. [5 min read]

30 Jul 2023

Sharing your own dataset on Hugging Face

What are the benefits of using Hugging Face for sharing your datasets? Not sure really, but let's try it to see what all this hype is all about [5 min read]

08 Apr 2023

Publishing your model on Hugging Face.

Share your model, your dataset, provide a simple mechanisms for using them. That is what research is all about. Hugging Face provides you with a great infrasctruture for doing that an a little more. [5 min read]

19 Mar 2023

Using CNN for a Domain name Generation Algorithm (2)

DGA is a mechanism used by malware for establishing contact with the C2 channel. This is the second post of the series for creating a simple DGA using techniques for text generation. In particular, CNN uses Keras and Tensorflow for R. [6 min read]

12 Mar 2023

Don't be afraid of AI. Embrace it

The use of artificial intelligence (AI) algorithms in various fields are becoming an integral part of our lives. While some people are opposed to their use others have embraced the technology and are using it. I am one of them. [6 min read]

27 Feb 2023

Using CNN for a Domain name Generation Algorithm (1)

DGA is a mechanism used by malware for establishing contact with the C2 channel. The idea behind this post is to show how to create a simple DGA using techniques for text generation. In particular, CNN using Keras and Tensorflow for R. This is the first part of a series of two. [6 min read]

31 Oct 2022

Clustering techniques for time series

The good old clustering analysis techniques present some differences when applied to time series. So many to discuss in one simple post. However, I will do my best to provide some examples of two basic approaches for doing time series analysis [6min read].

28 Sep 2022

Paninimania!!

Nobody has doubts about the importance for humankind of the PANINI sticker album for the FIFA World Cup. From a mathematical point of view, several interesting questions arise. How much money do they need to spend? How many other collectors do they need to interact with? What if a sticker pack had 6 stickers instead of 4? Rodralez, from LABSIN developed an app for answering these and other questions [3min read].

04 Sep 2022

Beyond Random Split for Assessing Statistical Model Performance

Sometimes the standard splitting techniques used for testing your machine learning models can underestimate the generalization performance of the model. In this post, I expose some of the most common approaches for splitting your data beyond the classical random split approach. [5min read]

31 Jul 2022

Art with Data

The idea of making art with code is not new, but what about Data? Can data be a work of art? Well, the truth is that thanks to conceptualism, it is possible. Trust me! [4min read]

23 Jun 2022

Using DVC pipelines with examples in R (stages)

DVC provides the elements for managing machine learning pipelines using git repositories as a backend. If you use git for tracking your machine learning experiments, you will feel comfortable with DVC.[9min read]

02 Jun 2022

Why Using DVC

Notebooks rule! We agree on that, but they can get messy very fast. The truth is they are not the best tool for good software engineering practices. Data Version Control (DVC) is a toolset for helping the development of machine learning experiments favoring better practices. Are you ready to give it a try? [5min read]

09 May 2022

The Crisis of Computer Science careers in LATAM

Are there any tangible benefits for the IT sector when hiring university graduates? A two-year program can fill the industry's needs?. The latter are valid questions that we should ask ourselves if we want to bring the university to current times. [6min read]

18 Feb 2022

Tackling the limitations of tree-based algorithms

Tree-based algorithms suffer from severe limitations when applied to forecasting problems. They can't predict beyond observed training data points values. However, not everything is lost. There are some alternative approaches to improve the performance of the tree-based algorithm under such scenarios. [5min read]

30 Nov 2021

Github flow for conducting research projects

The Software Industry has well-defined standards and procedures which are heavily based on tools such as Gitlab. However, in research sometimes we follow a more relaxed and not structured way. At LABSIN we have recently begun to apply software industry approaches to our daily work. The match is not perfect since research could be different in some way. But, the benefits are clear. [9min read]

19 Oct 2021

NO, Data Science is not just cleaning and transforming data!

Decent programming skills, strong math and stats knowledge, and amazing visuals are not enough for a data science position in the industry. These are just necessary tools you will need for doing your daily tasks, but you don't have to lose the ultimate goal "to provide valuable information to decision-makers" (Duh!). This is how you can make a difference and companies know it. [5min read]

12 Sep 2021

SHAP values with examples applied to a multi-classification problem.

We can not continue treating our models as black boxes anymore. Remember, nobody trusts computers for making a very important decision (yet!). That's why the interpretation of Machine Learning models has become a major research topic. SHAP is a very robust approach for providing interpretability to any machine learning model. For multi-classification problems, however, documentation and examples are not very clear. [8min read]

03 Aug 2021

How confident is Random Forest about its predictions?

Given a prediction on a particular example, how sure is Random Forest about it? For answering this question it is necessary to look beyond usual performance metrics and dive into the swampy waters of the confidence interval estimation for statistical learning algorithms 😖. [6 min read] (updated 11/21/22)

29 Jun 2021

Deploying a simple ML model with Plumber 101

Sometimes notebooks are not enough and you will need to deploy your machine learning model into company infrastructre. The task involves a lot of Software Ingenieering knowledge, BUT with Plumber package for R you can do the basics with not so much pain 😉. [6 min read]

14 May 2021

Thoughts about differences in ML evaluation for Academia and Industry

The processeses and the methods followed in Academia for evaluating a Machine Learning Model are different from the approaches used by the Industry. Why? [4min read]

02 Apr 2021

Selected Papers for the Week 13/2021

Brief comments about 3 Papers dealing with Generalization and Autodifferentiation [2min read]

29 Mar 2021

Inference with Observational Data

Despite selection and information bias it is possible to do inference from non-randomized experiments?. The good ol' statistics comes to help us with its strong theoretical framework. [6min read]

04 Mar 2021

Machine Learning in Production

Working with machine learning is not what it used to be. Let's face it. Now, there is much less time for hacking and much more time for deployment. The situation is not new and certainly not bad at all. That is why you should be prepared for the new roles and positions offered by the market. [5min read] (updated 04/09/2021)

14 Dec 2020

Feature Selection Strategies

Feature selection is a topic any machine learning practicioner should master. There are plenty strategies for performing feature selection. Some more useful than others. Some with more limitation than benefits. Here, I mention the most common approaches for feature selection using information collected from articles, books and research papers. [5 min read]

16 Nov 2020

Three Common Ways for Comparing Two Dataset Distributions

From time to time you will need to compare the distribution of two datasets. There are plenty of information about this topic in statistics books and all over the Internet. In this post I discuss three very practical approaches coming from different perspectives. [3.5 min read](updated 04/01/2021)

11 Oct 2020

Resources for a Gentle Introduction to Machine Learning

Jurgen has his own list of recommended resources for new members of his lab. Well, we have ours too. Here is a portion of the list for a gentle introduction to Machine Learning for LABSIN new members. [5 min read]. (updated 09/24/2021).

22 Sep 2020

Machine Learning Experimental Design 102

The usual approach of shuffling the data and split in train and test could not be the best strategy for some cases for getting a good error estimator of your model. Sometimes pure random splits do not guarantee the required level of dissimilarity. In adition, all the precautions to avoid contamination of your test set during your trainset manipulation must be taken for the test fold during cross validation. [4min read]

18 Sep 2020

Machine Learning Experimental Design 101

Experimental Design in Machine learning is well established. However, from time to time it is important to revisit the process to analyze the confidence level you have in your results. Machine learning shares a lot with statistics, but since Machine learning practitioners have a more practical vision, sometimes the experimental design is neglected when applied to real-world problems. This note explains the basic strategy followed in almost any machine learning experimental setup.

06 Sep 2020

Are Boosting Algorithms the new baseline model for your Tabular data? Part 1

Neural networks rule the world of machine learning IFF, you have a lot of data, and just for a reduced set of problems. The fact is that for heterogeneous (numerical and categorical) tabular data, decision trees are still one of the best options. Also, they have the benefit of being (more) explainable to the customer. Boosting decision trees are among the most successful algorithms in data science competitions, but could they replace Random Forest? The absolute leader, when you try a first model in your data.[updated]

26 Aug 2020

Notebooks in Data Science Development and other tools for reproducibility.

Jupyter and Rstudio notebooks have become the default standard for data science development. However, it is important to know their limitations and detect the moment of moving to a more "powerful" tool. Since in data science a very significant portion of the work is related to development, it is always important to be aware of the last development tools.

20 Aug 2020

Double Descent in Deep Learning

The U-shape observed when measuring model performance on testset as a function of its flexibility does not hold during training deep learning models. (WHAT!!!?????. Is the world going mad?. Not really.)

13 Aug 2020

COVID-19 Resources for the Amateur Epidemiologist

Some useful resources for playing with epidemic models (Mostly COVID-19 of course). From an explanation of the dificulties behind building an epidemic models to the different approaches followed by researchers and enthusiasts (me) here in my home town, Mendoza City.

06 Aug 2020

Computer Science Notes