Sometimes your finetuned language Model work as expected but you need a faster inference time. Some other times you need to reduce memory footprint. By transformring your models to the GGUF format you can store quantized models and using them on top of the fast llama.cpp inference engine. [5 min read]
What are the benefits of using Hugging Face for sharing your datasets? Not sure really, but let's try it to see what all this hype is all about [5 min read]
Share your model, your dataset, provide a simple mechanisms for using them. That is what research is all about. Hugging Face provides you with a great infrasctruture for doing that an a little more. [5 min read]
DGA is a mechanism used by malware for establishing contact with the C2 channel. This is the second post of the series for creating a simple DGA using techniques for text generation. In particular, CNN uses Keras and Tensorflow for R. [6 min read]
The use of artificial intelligence (AI) algorithms in various fields are becoming an integral part of our lives. While some people are opposed to their use others have embraced the technology and are using it. I am one of them. [6 min read]
DGA is a mechanism used by malware for establishing contact with the C2 channel. The idea behind this post is to show how to create a simple DGA using techniques for text generation. In particular, CNN using Keras and Tensorflow for R. This is the first part of a series of two. [6 min read]
The good old clustering analysis techniques present some differences when applied to time series. So many to discuss in one simple post. However, I will do my best to provide some examples of two basic approaches for doing time series analysis [6min read].
Nobody has doubts about the importance for humankind of the PANINI sticker album for the FIFA World Cup. From a mathematical point of view, several interesting questions arise. How much money do they need to spend? How many other collectors do they need to interact with? What if a sticker pack had 6 stickers instead of 4? Rodralez, from LABSIN developed an app for answering these and other questions [3min read].
Sometimes the standard splitting techniques used for testing your machine learning models can underestimate the generalization performance of the model. In this post, I expose some of the most common approaches for splitting your data beyond the classical random split approach. [5min read]
The idea of making art with code is not new, but what about Data? Can data be a work of art? Well, the truth is that thanks to conceptualism, it is possible. Trust me! [4min read]
DVC provides the elements for managing machine learning pipelines using git repositories as a backend. If you use git for tracking your machine learning experiments, you will feel comfortable with DVC.[9min read]
Notebooks rule! We agree on that, but they can get messy very fast. The truth is they are not the best tool for good software engineering practices. Data Version Control (DVC) is a toolset for helping the development of machine learning experiments favoring better practices. Are you ready to give it a try? [5min read]
Are there any tangible benefits for the IT sector when hiring university graduates? A two-year program can fill the industry's needs?. The latter are valid questions that we should ask ourselves if we want to bring the university to current times. [6min read]
Tree-based algorithms suffer from severe limitations when applied to forecasting problems. They can't predict beyond observed training data points values. However, not everything is lost. There are some alternative approaches to improve the performance of the tree-based algorithm under such scenarios. [5min read]
The Software Industry has well-defined standards and procedures which are heavily based on tools such as Gitlab. However, in research sometimes we follow a more relaxed and not structured way. At LABSIN we have recently begun to apply software industry approaches to our daily work. The match is not perfect since research could be different in some way. But, the benefits are clear. [9min read]
Decent programming skills, strong math and stats knowledge, and amazing visuals are not enough for a data science position in the industry. These are just necessary tools you will need for doing your daily tasks, but you don't have to lose the ultimate goal "to provide valuable information to decision-makers" (Duh!). This is how you can make a difference and companies know it. [5min read]
We can not continue treating our models as black boxes anymore. Remember, nobody trusts computers for making a very important decision (yet!). That's why the interpretation of Machine Learning models has become a major research topic. SHAP is a very robust approach for providing interpretability to any machine learning model. For multi-classification problems, however, documentation and examples are not very clear. [8min read]
Given a prediction on a particular example, how sure is Random Forest about it? For answering this question it is necessary to look beyond usual performance metrics and dive into the swampy waters of the confidence interval estimation for statistical learning algorithms 😖. [6 min read] (updated 11/21/22)
Sometimes notebooks are not enough and you will need to deploy your machine learning model into company infrastructre. The task involves a lot of Software Ingenieering knowledge, BUT with Plumber package for R you can do the basics with not so much pain 😉. [6 min read]
The processeses and the methods followed in Academia for evaluating a Machine Learning Model are different from the approaches used by the Industry. Why? [4min read]
Despite selection and information bias it is possible to do inference from non-randomized experiments?. The good ol' statistics comes to help us with its strong theoretical framework. [6min read]
Working with machine learning is not what it used to be. Let's face it. Now, there is much less time for hacking and much more time for deployment. The situation is not new and certainly not bad at all. That is why you should be prepared for the new roles and positions offered by the market. [5min read] (updated 04/09/2021)
Feature selection is a topic any machine learning practicioner should master. There are plenty strategies for performing feature selection. Some more useful than others. Some with more limitation than benefits. Here, I mention the most common approaches for feature selection using information collected from articles, books and research papers. [5 min read]
From time to time you will need to compare the distribution of two datasets. There are plenty of information about this topic in statistics books and all over the Internet. In this post I discuss three very practical approaches coming from different perspectives. [3.5 min read](updated 04/01/2021)
Jurgen has his own list of recommended resources for new members of his lab. Well, we have ours too. Here is a portion of the list for a gentle introduction to Machine Learning for LABSIN new members. [5 min read]. (updated 09/24/2021).
The usual approach of shuffling the data and split in train and test could not be the best strategy for some cases for getting a good error estimator of your model. Sometimes pure random splits do not guarantee the required level of dissimilarity. In adition, all the precautions to avoid contamination of your test set during your trainset manipulation must be taken for the test fold during cross validation. [4min read]
Experimental Design in Machine learning is well established. However, from time to time it is important to revisit the process to analyze the confidence level you have in your results. Machine learning shares a lot with statistics, but since Machine learning practitioners have a more practical vision, sometimes the experimental design is neglected when applied to real-world problems. This note explains the basic strategy followed in almost any machine learning experimental setup.
Neural networks rule the world of machine learning IFF, you have a lot of data, and just for a reduced set of problems. The fact is that for heterogeneous (numerical and categorical) tabular data, decision trees are still one of the best options. Also, they have the benefit of being (more) explainable to the customer. Boosting decision trees are among the most successful algorithms in data science competitions, but could they replace Random Forest? The absolute leader, when you try a first model in your data.[updated]
Jupyter and Rstudio notebooks have become the default standard for data science development. However, it is important to know their limitations and detect the moment of moving to a more "powerful" tool. Since in data science a very significant portion of the work is related to development, it is always important to be aware of the last development tools.
The U-shape observed when measuring model performance on testset as a function of its flexibility does not hold during training deep learning models. (WHAT!!!?????. Is the world going mad?. Not really.)
Some useful resources for playing with epidemic models (Mostly COVID-19 of course). From an explanation of the dificulties behind building an epidemic models to the different approaches followed by researchers and enthusiasts (me) here in my home town, Mendoza City.
Beware of Random Forest GINI index for feature importance. Some other resources related with feature selection such as how to use PCA and the problems (or not) behind colinearity.