Computer Science Notes

Computer Science Notes

CS Notes is a simple blog to keep track about CS-related stuff I consider useful.

19 Oct 2021

NO, Data Science is not just cleaning and transforming data!

by Harpo Maxx (5 min read)

Thanks to Tincho Marchetta from LABSIN to point me the bases for this article

When discussing the skills and common activities for a Data Science position, you will find a lot of articles in the Internet saying that the creation and tuning of machine learning models are just a little part of the whole data science workflow. In fact, you will find several articles saying that Data Science is not as sexy as you may think and that you will spend most of your working hours dealing with messy and incomplete data, and just a couple of minutes per day using your math (or stats) knowledge for building a predictive model. Not to mention those articles discussing the benefits of having decent software engineer skills to deal with all the data pipeline. Well, if you are expecting me to tell you this is not the truth, you will be disappointed. Sorry, but this is absolutely true 😥.

But…(there is always a but), being proficient in programming, software engineering, and even in math or stats are not the only skills you will need as a data scientist (yes, things get worse). The truth is that for a Data Science position (don’t confuse with MLE) you will also need the ability to find and transmit useful information to the decision-makers. In other words, you need to hide all the data transformation process (cleaning, modeling, etc.) and communicate in a short and succinct way the insights you have found.

For a Data Science position you will need the ability to find and transmit useful information to the decision-makers.

And the two key words from the previous paragraph are communication and insights. The first could be pretty obvious. I’m sure you have probably heard about communication skills for a DS position. But, believe me, it is more important than you think. Communication skills are FUNDAMENTAL during all the stages of the data science workflow. Please keep in mind that communication not only refers to building nice visualization during a presentation. You need to express your ideas verbally in a clear way. Imagine yourself you are waiting for a coffee and you have just three minutes to express an idea to your PM. That’s the kind of communication skill you need to improve, and that is the kind of communication skills recruiters look for during the interviews (sometimes more than strong programming and math skills).

The word insight is even more important. Actually, discussing it is the whole purpose of this post (Sorry you had to read till here for finding it). To be honest, this was something I discovered very recently. At LABSIN, we do consultancy jobs, and my usual approach was to be “agnostic to the problem”. I didn’t follow Conway’s Venn diagram. In fact I focused mostly on the programming and machine learning areas and left the expert to take all the important decisions. Of course, I made tons of very nice visualization using the latest visual packages I could find. But, that was my limit. My business knowledge was the very minimum. I presented the results to the experts and left them to decide how to continue.

Well, I have some news for you: The fact is that a data scientist who works with data sitting at her desk and not talking about the "business" , only exists in large organizations having giant data departments. Under these scenarios, data scientists end up being a sort of “highly qualified secretary” of those who know the business. However, as Data Science is an incipient discipline in the industry, in the vast majority of cases you will need to have a deep understanding of business. In other words, you can’t do data science dissociated from the business. Therefore, things like “breathing the business, its problems, and its KPIs” can’t be ignored. You can’t say something like: “You are the domain expert, tell me the minimum I need to get the models working”. In fact, you have to take ownership of the problem. That is the only way, you can get valuable insights from the business data. After all, this is why people are paying you.

Things like “breathing the business, its problems, and its KPIs” can’t be ignored.

I recently discovered the term storytelling, which in my opinion is more or less related to the concepts of communication + insights. Storytelling is a fundamental skill in data science positions. There is a well-known book about storytelling and in the first chapter, the author extends the importance of insights as part of your communication process: at the end of any presentation, you must leave a recommendation of steps to follow / actions to take.

As I told you, I used to be agnostic to the problem/business. So, I usually presented all the analysis results and finished there, letting the expert propose the possible way of action. Well, this is clearly not a good approach. In fact, you have to give a recommendation at all costs. Your recommendations have the objective of forcing a discussion. If not, the decision-maker (or expert) can simply look at your results, say OK, and move to the next thing in the schedule. Let me be more clear about this, even without knowing the domain, you have to propose an action ( even an incredible wrong action ) in order to make your audience think and force a discussion that can eventually lead to a correct action!

Just a few more words…

To be honest, I was comfortable being agnostic to the problem. I accept I was wrong, but I still will need some time to learn this new skill. So, I will probably need to rewrite this post in a couple of months 😜. As usual, I finish this post with some places to continue reading and learning about this topic.

[1] A Forbes article explaining why storytelling is important. (way better than this post)

[2] A video with an example about storytelling applied to the legal business.

[3] Storytelling with Data ( Cole Nussbaumer Knaflic). A (highly recommended) book about storytelling.