Being a data scientist is hard. In addition to the combination of advanced mathematics and coding skills required to do the job, it’s a newer role for many organizations, so data scientists are called upon to navigate corporate landscapes, source the right IT resources, and establish new workflows across departments to do their jobs effectively.
These best practices will help data leaders be more effective at their jobs, lead the way for future data scientists, and establish a department that’s innovative and productive.
Table of Contents
Optimize use of open-source
Because open-source tools are such an important part of the data science technology stack, it’s important that hiring criteria reflect this.
Data scientists with experience contributing to open-source projects will have a better understanding of how to evaluate and manage open-source tools by looking at code activity, package metadata, release history, and project contributors.
They should also understand when and how to make pull requests if packages can be updated, enhanced, or made more secure to meet an organization’s needs. In addition to hiring data scientists and developers with open source expertise, consider working with a vendor that provides support for open-source tools and libraries.
Institute a security-aware culture
When data scientists don’t monitor for potential threats, vulnerabilities inevitably creep into models over time. Data science leaders must step up and collaborate with IT and security leaders to take charge of their data science and machine learning pipelines.
Because these pipelines usually involve the use of open-source libraries, it’s important to understand an organization’s risk tolerance for open-source software. Learn about Common Vulnerabilities and Exposures (CVEs), how to look for them, and how to monitor environments for high-risk packages. Ignoring a high CVE score can result in data breaches and unstable applications.
Devise a team structure that maximizes impact
Many data scientists don’t start out on teams, rather they are scattered across the organization and assigned to specific lines of business to solve particular problems.
This is usually effective for organizations starting their data science journeys because it’s easier to demonstrate business impact with small, focused projects. But over time, data scientists will need to collaborate to develop processes and eliminate redundancy. They will also need to work with IT to understand how to put projects into production, assess the limits of their resources, and understand security standards.
Many organizations have found success in adopting a hub-and-spoke model, in which some data scientists remain within lines of business, while others work in a data science lab or center to help data scientists and analysts across the organization.
Customize presentations by line of business
To solve business problems, data science teams should understand how to speak the language of the business units they work with. It’s essential that common terms and acronyms are used in presentations with their respective lines of business. This will help establish common ground in defining and evaluating success.
Alignment with lines of business is also important in building out custom dashboards that serve unique needs. Referring to these dashboard metrics on a regular basis in joint meetings will also be important as new data science project goals are discussed and the effects of decision-making based on model output are evaluated.
Bring IT and developers into the POC phase
Nearly half of data science projects never make it to production. One way to help ensure models ultimately make it into the hands of end-users and bring value to the business is to involve IT and software developers early in the process, especially for security protocols to be met early on.
Early evaluations of the software components that will be used to build a model assure that data will be managed securely once a model is in production. IT teams can also help secure the right infrastructure for model training and production, and developers will help ensure a better end-user experience for the final product.
Establish a workflow
Because data science is a new function in many organizations, custom workflows must be established. Many organizations are turning to software development for a workflow model, using Agile principles and Scrum methods for data science output. But this doesn’t always work for research-intensive data science. As data scientists know, the steps needed to arrive at the final goal are not always clear. Research and data exploration can yield results that drastically shift the course of a project.
With this in mind, data scientists can still adopt an Agile methodology (especially helpful for data science projects that become web applications) and tweak it to suit their goals and processes. Keep in mind that projects frequently revert to previous stages and new deliverables can be added in each stage, so keep deadlines soft to allow for changes in course as projects unfold.
Data scientists do not work in silos. Data scientists scattered across the organization should meet regularly to discuss processes, tools, and projects, while those in centralized structures should meet regularly with business managers. Through regular communication, data scientists will learn more quickly, grow their skill set, make a better case for resources they need, and provide more value to the organization overall.
Implement tools that mitigate bias and maximize fairness
Data science and machine learning are increasingly used to help make decisions that impact people’s lives through credit scoring, job and college applicant scoring, and even potential healthcare outcomes. When implemented thoughtfully, machine learning can improve human decision-making and reduce racial disparity. On the other hand, when machine learning models are implemented without regard for bias or fairness, they can enforce and exacerbate human biases.
The most important steps data scientists can take are to understand biases in their data and understand how their models make decisions. Fortunately, several new open-source tools are available to help data scientists do this, such as FairLearn, InterpretML, and LIME.
Be a voice for ethics and explainability
As modern data science becomes more and more ingrained into day-to-day business practices, politics, and society, it’s important that questions around bias and fairness be on the minds of every data scientist, business leader, and academic.
A failure to proactively address these areas poses a strategic risk to enterprises and institutions across competitive, financial, and even legal dimensions. We see an opportunity for data professionals to exert leadership within their organizations and drive change.
Doing so will increase the discipline’s stature in the organizations which depend on it, and more importantly, it will bring the innovation and problem solving for which the profession is known to address critical problems impacting society.
Start by exploring the talent within an organization and seek out ethics professionals in other areas, such as ethics attorneys, managers in ethics and compliance, or even a professional ethicist.
Work with ethics-minded individuals to host workshops or seminars that involve senior leadership. Explain the risks and negative impacts of ignoring bias in data and models, and lead the charge across the department and organization.