Ferris is a full stack data scientist at LinkedIn who enjoys building products at the forefront of intelligent technology. He understands that the next generation won’t be concerned with how to use technology to do things, but will expect technology to do and adapt for them.
As a data scientist, I am usually heads down in numbers, patterns, and code, but as crazy as it sounds, one of the hardest parts of my job is actually describing what I do. There are plenty of resources that offer descriptions and guides on the career of a data scientist. I’ve heard them described as those at the intersection of statistics, hacking abilities, and domain expertise. Or, as data analysts who live in San Francisco.
Rather than add a new definition to the collection, I thought I’d take a data-centric approach towards defining the role. I looked at what skills people with the title “Data Scientist” have listed on their LinkedIn profiles and aggregated the top ten by occurrence*.
*Corrected using a measure called TFIDF
While this list sheds some light on what skills are most frequently included on the profiles of data scientists, it’s difficult to understand how they relate to each other when we’re just looking at a stagnant ranking. To dig a bit deeper, I explored the relationships among these skills by representing and visualizing them as a network. A’la, the Data Science Skill Network (High Res Image):
In the network, each node is a skill. Skills are connected when both are listed together in a profile, with the connection growing stronger the more often they are listed together. Since the goal was to visualize the relationships between skills, I clustered similar skills together, represented by colors. Next, skills were sized depending on how well connected they were, and to what extent they influenced other skills in the network, using a measure called network centrality. While there are plenty of conclusions to be drawn, both figures highlight a few key themes. Namely, that today’s data scientists typically:
Approach data with a mathematical mindset
- We see that machine learning, data mining, data analysis and statistics are all highly ranking skills in the network. This indicates that being able to understand and represent data mathematically, with statistical intuition, is a key skill for data scientists.
Use a common language to access, explore and model data
- Python, R, and Matlab are the three most popular languages for visualization and model development and SQL is the most common for data access. When it comes to data, extracting, exploring, and testing hypotheses is a big part of the job, so it’s no surprise to see these skills rising to the top.
Develop strong computer science and software engineering backgrounds
- We also see computer science and software engineering skillsets, with Java, C++, Algorithms, and Hadoop all having notable real estate on the Network visualization. These are skills that are primarily used to leverage data to architect systems.
In my experience, most data scientists will not be experts in all of these categories (math, tools, and software development), but, instead, specialize or hone their skills in one or two of them. These are, therefore, a more holistic view of the skills represented within a typical data science team.
I hope this helped to shed some light on what a data scientist is, and what skills are required to become one. These analyses are all pulled from the skills you list on your LinkedIn profile so hopefully it is also a reminder for you to keep your profile up to date.
Thank you, and I’d be interested in hearing your thoughts below.
(This post was originally published on LinkedIn.)