Image source: Carta Marina, ca 1544 – via Wikipedia
This article was originally published on datawanderings.com
Wanted: Data Scientist
Data Scientists have it difficult to explain their jobs to their parents, and there is a high chance their employers might not understand what they do, either. The word has it that a Data Scientist is this elusive hybrid of a programmer, a statistician, and a business analyst. To complicate it further, every business has their own take on this definition, usually adding requirements to the mix rather than limiting the scope. This creative process has resulted in Data Scientist being an idea of an employee rather than a flesh and blood office worker; something that Mary Shelley could have a blast finding a name for. To see how absurd the expectations about a Data Scientist’s skill-set got, look at what seems to be a LinkedIn-wide agreement on their resume: a proficiency in 4 programming languages, an understanding of distributed computing, a PhD in Physics or Engineering, communication and presentation skills, and preferably some production system administrator experience. Separately, these skills constitute at least 14 distinct career paths (I didn’t really calculate that). The eclectic mix of competencies that define the role makes it close to impossible to find a person who could fill it. Accumulating this sort of knowledge – unless you are a Silicon Valley-bred engineer with a lot of time on your hands and unlimited funding – takes years. It requires practice, enthusiasm, and luck as you would need to have worked on very distinct projects in the past, and covering various roles while on it. No wonder the market struggles with filling the Data Scientist vacancies. And, as the search is long and the goal is elusive, finding such person is not only difficult, but also very expensive.
The keyword-based search cultivated by Linkedin headhunters is similar to trying to catch a butterfly with an iron skillet: it is uninspiringly stupid. Obsessing about the level of Java expertise often misses the point by prioritizing the years of experience over the situations in which this knowledge was applied. To a certain extend this approach is rational as years can be measured and compared; a qualitative indicator cannot. However, it’s also prejudiced: I continuously meet Business Intelligence folks with years spent on the job that construct their monthly reports through a correct orchestration of clicks and copy-paste. On one of my projects I worked with a senior ETL specialist who exercised his proficiency by building a complex data process flow. While impressive on the surface, it was mortifying on the closer look: the system complexity was uncalled-for, powered by little understanding of the system architecture and possibly, his ego.
Pro coders can build you a recommendation engine, but that doesn’t guarantee it won’t ruin your company. Gordon Linoff’s and Michael Berry’s Data Mining Techniques is not only an educational resource but also a widely enjoyable read on how algorithms can go wrong. There is a story about a retail company that outsources an implementation a popularised by Amazon you-might-also-like widget on their online platform. While shopping, the customers are suggested products compatible with the contents of their online baskets. If they consider getting a hammer, they are instantly recommended a set of nails. The widget is very successful, and the baskets grow fat. Oddly, instead of an anticipated swell in money, the company notices a counter-intuitive loss in revenue. Wait, weren’t the customers buying more? They were: and indeed these were the algorithm-generated recommendations. A nuance of human behaviour was missed, however; (post-fact) unsurprisingly the cheaper a product was, the more people it attracted: and so the more aggressively was it pushed by the widget. The widget made these cheaper products easily accessible to everyone, essentially turning the goal of cross-selling into the nightmare of down-selling. Good tech in bad hands? Outsourcing gone awry?
A bird in the hand is worth two in the bush
Companies have plenty of clever resources. While none of them might go by “Data Scientist“, many would have plenty of business-specific expertise and domain experience. These are the in-house developers, data analysts, or the lab crew. In expanding the search for the data magician to job boards, the organisations miss on building on knowledge they already have – and, instead, opt for starting from scratch.
From the cost perspective, it is more feasible to invest in current staff than hire externally. Yet to some it sounds too far-fetched. Isn’t there risk that throwing all that money onto an existing employee training would be a) a waste of time, or b) result in the person skilling up and leaving the parents’ house? An external resource is ready now: you pay once and benefit right away. I say that is the classic two birds in the bush fallacy. Undoubtedly, the cost/risk calculation is a balancing act, but less straightforward than it looks like. Yes, it costs time and money to get people trained. But neither is it cheap to get somebody new on-board. The cost of recruitment can run into months; then there will be monthly spending on the hire’s attractive salary; let’s also consider the time it takes for the person to know your business, the processes, and people. Establishing personal relationships may be intangible, but it is often the key to unlocking the knowledge base of the company, or key information that any analyst need to do their job. Then we should multiply that cost by the number of the Data Science recruits, as we don’t want the new hire to do everything single-handedly. Finally*, I will add to it the cost of the rest of the team sitting idle while the new experts are taking over. There are organisations who can afford a landslide business transformation, but most companies need and benefit from organic growth. An equivalent of the latter in Data Science terms is investing internally and growing the resources you already have.
*I will spare the comment for organisations refusing to train staff in fear of future attrition.
A little stir
Say we know the who, how then?
The key, and what the rest of this article makes a case for, is making this change in the team by applying the right mindset and by continuous education. As much as the media outlets advice businesses to jump on the hype, businesses do not change abruptly. People, even the most talented people, do not grow new skills overnight. I realised this working as an IT sales rep and talking to the actual analysts: while I was trying to get my head around Tensorflow, the Kappa architecture, or whatever was the latest thing, some of them have just learned there is data world beyond Excel or Access. What’s happening in Silicon Valley gets to the industry in years time and in a striped down form. It can be that Reinforcement Learning has got the potential to change programming as we know it (or at least Go tournaments), but currently there are only few people who can implement, or even understand its workings. At the moment, creating a simple decision tree is a game changer for many: an analysis most of us can understand, design, test, and apply. An example of a decision tree would be defining what characteristics make the customers most likely to purchase strawberry ice cream. A decision tree can turn a company’s approach by 180°, challenging former assumptions through look at patterns in customer behaviour.
Disrupting industries is sexy. It is really hard, too. Along the epitomes such as AirBnb and Uber, you’ll find virtual graveyards of ideas from smart torches to yet another baby-recording device. The cost of disruption is well portrayed in the The Silicon Valley series: the villain, Hooli Corporation, spends millions of dollars chasing the next big thing and maintaining a “we-are-on-the-schedule” façade while everything, the tech and the team, is literally falling apart (including actual human injuries caused by faulty software). Innovation is the right mindset, but I would argue that disruption is not the right model for every business. So, how about a little stir?
The infallible art of taking steps back
By no account I want to claim I have the recipe for creating a superstar Data Scientist. Below are some thoughts on the problem and a handful of approaches to consider. I have called the list the art of taking steps back because it is not focused on learning new skills or technologies. Instead, I look at strengthening the knowledge foundation of the team and revising the core business problems. One step back at the time.
1. Where does your data live? Nailing the information architecture.
Some organisations have their information architecture well documented and accessible internally. Many don’t. Many have gotten themselves into so much mess they need to hire an outsider to help them find the way through their corporate information mazes. Sometimes the best asset there is has been created when a company made their first acquisition 10 years ago, hasn’t been updated ever since, and the person who created it has long left. Some companies thrive in an equilibrium of an IT department doing their ol’ thing and their analysts working on their side of the office with some curated data sets they have been delivered. Both are perfect situations for a Data Scientist to be conceived.
Getting a grip of the corporate information architecture is a standard requirement for any data science project: you cannot predict the future if you don’t understand the past. The task is to document the inbound and outbound data flow: where the data lives, how it is captured, how it is processed, how the systems connect with each other, what exceptions are tolerated, what the software and hardware specifications of the architectural components are, who has access to the database – and how they do it. Getting this information documented, reviewed, and shared is an invaluable asset. It’s a reference document for all sorts of projects: Can our system handle supporting that new application? Do we really need that new NoSQL store? Can database consolidation make some of the current reporting nightmares go away? Along the way, the team documenting the system will identify its potential drawbacks, pinpoint missing architectural or security pieces, see what is redundant, and what can be fixed. This will happen, because every analyst has by definition a taste for investigation. Plus, they will be talking to people during the process and if there is one thing people tend not keep to themselves is their complaints.
2. What is data, again? Building resistance to trends.
Cutting edge business comes from a profound understanding of how things work now and building on them. So, what is data?
Spending some time covering the classic data theory can help answering that question. Another step back: it’s time to put the student hat on again. Revisiting the core data warehousing concepts never fails to enlighten; Inmon and Kimball are very friendly.
As the system’s documentation is in place, the theory review empowers educated questions of how things work in (one’s) reality. Refreshing the basics – or covering them for the first time – allows for an informed assessment of new tools in the future. Brushing off the RDBMS knowledge makes you fad-proof when new technologies are discussed. It’s easy to get caught in the tech frenzy otherwise: vendors promising near-real-time processing on systems not designed to do so, analytical tools with shiny UI but little functionality, Artificial Intelligence black boxes, a processing engine that requires you to move there the data every time and cement your lousy marriage to the IT department. Any standard data warehousing training or a coursebook would be good enough.
It is useful to know the theory, because theory facilitates best practices. A couple of weeks ago I have read an article with a click-bait title “9 Mistakes to Avoid When Starting Your Career in Data Science” that kicks off the list with ‘too much focus on theory’. If only everybody was focusing on the theory too much, we wouldn’t be living in the reality of duplicate databases, legacy systems on life-long support, and algorithms that cause harm to people’s finance, employment, and health. Theory helps people assess what they have. Theory is based on studies, and likely, the mistakes of others trying to solve similar problem to yours.
3. Do we need Big Data?
There is no Data Science conversation that goes without mentioning Big Data.
Big data systems are inherently different from RDBMS and while they can be better in distributing the computer load, this perk comes at cost of no or forced joins, very limited indexing and little data governance. There is a lot of buzz around Big Data and understandably so: there are companies like Netflix or Amazon who built impressive applications and provide unmatched customer experience through massive intelligent processing of data. This is supported by the natural longing for new and cool technology, by inherent curiosity, and by envy toward the competition.
Many Big Data projects fail because they were either poorly motivated or badly executed. Worse yet, these initiatives are not cheap and risk getting companies in trouble. The unambiguously-titled article “Lean Big Data: How to Avoid Wasting Money with Big Data Technologies and Get Some ROI” by ex-Spotify’s Adam Kawa ellaborates further on the reasons why Big Data projects die. Some of these problems could be avoided by education, resistance to fad, and correct assessment of current and available tools. Two of them I discussed above, the assessment is more self-explanatory. Such examination can by done in many ways: new tools can be installed in tested in virtual environments; Amazon has flexible deployment options that would cut down development costs. Employing ‘old’ tools to a new problem re-evaluates their capability: it would discover both their advanced options and their shortcomings. MS Excel and Notepad++ are capable of magic if you know how to use them. Bugged by some unstructured text analysis: how about using regular expressions to extract keywords or numbers?
4. Get that RDBMS hat off.
If the only thing you know is running, then you will ruin a basketball match trying to make a who is faster contest out of it. Similarly, if the only thing you know is RDBMS, you risk approaching big data problems as if these were relational problems.
Many Business Intelligence professionals need to make a mental lap when approaching Data Science. Data Science is not just classic databases, but also programming languages, a myriad of NoSQL options, algorithms SQL cannot support, and an unprecedented world of noisy data that is by no means in the divine Third Normal Form. Data Science in combination with new tools and Big Data (but not necessarily) allow for analysis of these funnily formatted sets and escaping the strict (but comfy) nature of relational databases. Benefiting from these capabilities is a mix of technical acumen and an open mind. The bottom line: if the first thing you want to try out on Hadoop is a join, you are lacking at least one of them.
5. Learning Paths.
What’s the best route onto learning the ‘Data Scientist toolkit’? There is an abundance of compilation blog posts with all the courses one can take to become a great data scientist. These courses are very popular and I admire the determination of people who pass them. My feeling though – but keep in mind I’m a comfortable European 9-to-5 worker – is that many people in full employment find these training unrealistic to complete. Rarely we are given an opportunity to spend a considerable chunk of office hours on studying a technology that is only potentially useful. Out of office hours are tricky if you have a family, a social life, or a sports routine you’d like to maintain. I don’t think I have to convince anybody that the most natural way to learn is on the go. My approach to learning or investing in a team’s skill set would be to base it on a project’s ROI: work on solutions that can bring potential future benefits. Speeding up a process through its optimisation or automation would be a good example of an ROI-based approach: often these are well-studied problem with predictable high gains. Struggling with data warehouse performance is a chance to both reexamine the design decisions behind it and evaluate potential new tools, e.g Hadoop. Honest study of the past marketing campaigns could spawn the consensus that we really suck at intuition. Looking at manual struggles of the analysts’ team could bring scripting to the table. Running project pilot is essential: the risks are kept in place, and the team is given a chance to learn on the case. Once befriended technologies can be successfully applied to scenarios that weren’t considered before. Perhaps most importantly, people leave their comfort zones through these exercises and they become more likely to question the status quo, research on their own, and experiment.
Summary
A web Data Science course would have by now covered the basics of distributed computing, before stepping up the game to talk about neural trees and decision networks. Amazon has just made another trillion dollars as you were reading this chapter. But hey, remember Karate Kid? The future adolescent champion didn’t start fighting until he tidied up the house of the old master.
Like this article? Subscribe to our weekly newsletter to never miss out!