Data Science 101

Three questions you need to answer to succeed in data-driven projects
The success of data-driven projects has quite a few challenges and barriers. Here is a look at how you could overcome them by simply asking yourself three questions. Data has become probably the most valuable asset that companies could have nowadays. It can give you insights into your customers’ behaviour

Performing Nonlinear Least Square and Nonlinear Regressions in R
Linear regression is a basic tool. It works on the assumption that there exists a linear relationship between the dependent and independent variable, also known as the explanatory variables and output. However, not all problems have such a linear relationship. In fact, many of the problems we see today are

75 Big Data terms everyone should know
This article is a continuation of my first article, 25 Big Data terms everyone should know. Since it got such an overwhelmingly positive response, I decided to add an extra 50 terms to the list. Just to give you a quick recap, I covered the following terms in my first

10 Rules for Creating Reproducible Results in Data Science
In recent years’ evidence has been mounting that points to a crisis in the reproducible results of scientific research. Reviews of papers in the fields of psychology and cancer biology found that only 40% and 10%, respectively, of the results, could be reproduced. Nature published the results of a survey of

How Faulty Data Breaks Your Machine Learning Process
This article is part of a media partnership with PyData Berlin, a group helping support open-source data science libraries and tools. To learn more about this topic, please consider attending our fourth annual PyData Berlin conference on June 30-July 2, 2017. Miroslav Batchkarov and other experts will be giving talks

Boost Your Data Wrangling with R
The R language is often perceived as a language for statisticians and data scientists. Quite a long time ago, this was mostly true. However, over the years the flexibility R provides via packages has made R into a more general purpose language. R was open sourced in 1995, and since

Confused by data visualization? Here’s how to cope in a world of many features
The late data visionary Hans Rosling mesmerised the world with his work, contributing to a more informed society. Rosling used global health data to paint a stunning picture of how our world is a better place now than it was in the past, bringing hope through data. Now more than

Three Mistakes that Set Data Scientists up for Failure
The rise of the data scientists continues and social media is filled with success stories – but what about those who fail? There are no cover articles praising the failures of the many data scientists that don’t live up to the hype and don’t meet the needs of their stakeholders.

Big Data 101: Intro To Probabilistic Data Structures
Oftentimes while analyzing big data we have a need to make checks on pieces of data like number of items in the dataset, number of unique items, and their occurrence frequency. Hash tables or Hash sets are usually employed for this purpose. But when the dataset becomes so enormous that

Programming with R – How to Get a Frequency Table of a Categorical Variable as a Data Frame
Categorical data is a kind of data which has a predefined set of values. Taking “Child”, “Adult” or “Senior” instead of keeping the age of a person to be a number is one such example of using age as categorical. However, before using categorical data, one must know about various