The rise of data science in the last decade has been driven by the ease of access to deep data and significant reductions in the costs associated with processing it. These days anyone with a credit card can now setup a cloud-based data warehouse and tracking system within minutes, but achieving a return on this investment is not so straightforward. This is not to say that effective use of data science can’t be very profitable, just that it is not always guaranteed.

There are three key reasons why data science projects can potentially fail:

1.Solving the wrong problem

Most data science applications are about optimization, i.e. let’s take a product and make it better, faster and easier using data. Ideally you could take the product as a whole and optimize for revenue, but this is not always possible, as it requires taking into account all the elements that can influence revenue, and their relationships with each other, thereby factorially increasing the number of permutations that would need to be tested. This is why optimization problems are usually smaller scale i.e. increase consumption via better recommendations, increase conversions with re-targeting, etc. However, this simpler view can lead to large resource dedicated to solving a problem that may have little impact on the overall revenue.

Taking an example from games, product managers obsess over getting non-spenders to convert. There is a good reason to obsess over this; most F2P games only get around 2% of players to ever spend. However, the clearest route to improving this number, offering a large ‘first payment’ discount, is not a good solution to improving revenue as it naturally deflates the value of in-game content and may put off repeat spends. There are analogs to this in retail CRM systems which have established a ‘race to the bottom’ for pricing to the point where consumers now will only spend if they think they are getting a substantial discount.

2.Mismatch of problem, technology and personnel

While the pool of qualified data scientists and engineers has swollen over recent years, the diverse and ever growing range of technology solutions means that any single data scientist or team may only have experience with a small range of vendors. This is a problem as each source of technology is very much designed to work with a particular class of data. For instance, Hadoop is a great solution for batch-processing, but is not well suited to data that is drip-fed in real-time. Similarly, NoSQL databases are great for cases when the data structure needs to be flexible, but will not perform as well on large static structured data as a relational database. Additionally, while normalized data schema may be appropriate for traditional database tech and low dimension problems, flat wide structures will give a performance boost in modern column oriented databases like PostgreSQL and AWS redshift.

Mismatches of technology and/or personnel to the core business problems are common across the games industry and typical for small companies that simple cannot invest in a large diverse data team. In these cases the impact on data productivity can be devastating, with latency and data complexity issues causing all but the simplest business use cases to be abandoned.

3.Data integrity

Ultimately the best laid plans of data scientists are undone by the simplest of errors. Erroneous data feeds are one of the most common issues in data science projects. These are usually caused by a lack of communication with product developer and/or a lack of understanding of how the product operates. In recent times the proliferation of third party cloud services, and the need to combine data from them, has vastly increased the opportunity for data bugs to spawn and propagate.

With time and diligence problem data feeds can be corrected, but this process will introduce costly delays that can reduce confidence in data usage across a business.

It is generally the mismatch between simplified business goals and the way they are defined as analysis projects that causes failures or problems with data science projects. Often, commercial management doesn’t fully understand the process required to conduct an analysis project and this is compounded as they invariably have a requirement to obtain broad answers quickly. Analysts need to stand strong and become better communicators and negotiators; as it is they who ultimately have to take responsibility for the scope of the projects they agree to lead. Research topics need to be broken-down into their constituent parts, with gap analysis undertaken on the data, tools and human resources available, so that realistic expectations and timescales are set.

Like this article? Subscribe to our weekly newsletter to never miss out!

Previous post

The History of Data Mining

Next post

The Future of Big Data Is Open Source