Introduction to Data Science: How to "Big Data" with Python

On November 25th-26th 2019, we are bringing together a global community of data-driven pioneers to talk about the latest trends in tech & data at Data Natives Conference 2019. Get your ticket now at a discounted Early Bird price!

Data science, analytics, machine learning, big data… All familiar terms in today’s tech headlines, but they can seem daunting, opaque or just simply impossible. Despite their schick gleam, they are *real* fields and you can master them! We’ll dive into what data science consists of and how we can use Python to perform data analysis for us.

Data science is a large field covering everything from data collection, cleaning, standardization, analysis, visualization and reporting. Depending on your interests there are many different positions, companies and fields which touch data science. You can use data science to analyze language, recommend videos, or to determine new products from customer or marketing data. Whether it’s for a research field, your business or the company you work for, there’s many opportunities to use data science and analysis to solve your problems.

When we talk about using big data in data science, we are talking about large scale data science. What “big” is depends a bit on who you ask. Most projects or questions you’d like to answer don’t require big data, since the dataset is small enough to be downloaded and parsed on your computer. Most big data problems arise out of data that can’t be held on one computer. If you have large data requiring several (or more) computers to store, you can benefit from big data parsing libraries and analytics.

So what does Python have to do with it? Python has emerged over the past few years as a leader in data science programming. While there are still plenty of folks using R, SPSS, Julia or several other popular languages, Python’s growing popularity in the field is evident in the growth of its data science libraries. Let’s take a look at a few of them.

Pandas

One of the most popular data science libraries is Pandas. Developed by data scientists familiar with R and Python, it has grown to support a large community of scientists and analysts. It has many built-in features, such as the ability to read data from many sources, create large dataframes (or matrixes / tables) from these sources and compute aggregate analytics based on what questions you’d like to answer. It has some built-in visualizations which can be used to chart and graph your results as well as several export functions to turn your completed analysis into an Excel Spreadsheet.

Agate

A much younger and newer library which aims to solve data analysis problems is agate. Agate was developed with journalism in mind, and has many great features for dataset analysis. Do you have a few spreadsheets you need to analyze and compare? Do you have a database on which you’d like to run some statistics? Agate has a much smaller learning curve and less dependencies than Pandas, and has some really neat charting and viewing features so you can see your results quickly.

Bokeh

If you’re interested in creating visualizations of your finished dataset, Bokeh is a great tool. It can be used with agate, Pandas, other data analysis libraries or pure Python. Bokeh helps you make striking visualizations and charts of all types without much code.

There are many other libraries to explore, but these are a great place to start if you’re interested in data science with Python. Now let’s talk about “big data.”

Working with Big Data: Map-Reduce

When working with large datasets, it’s often useful to utilize MapReduce. MapReduce is a method when working with big data which allows you to first map the data using a particular attribute, filter or grouping and then reduce those using a transformation or aggregation mechanism. For example, if I had a collection of cats, I could first map them by what color they are and then reduce by summing those groups. At the end of the MapReduce process, I would have a list of all the cat colors and the sum of the cats in each of those color groupings.

Almost every data science library has some MapReduce functionality built in. There are also numerous larger libraries you can use to manage the data and MapReduce over a series of computers (or a cluster / grouping of computers). Python can speak to these services and software and extract the results for further reporting, visualization or alerting.

Hadoop

If the most popular libraries for MapReduce with large datasets is Apache’s Hadoop. Hadoop uses cluster computing to allow for faster data processing of large datasets. There are many Python libraries you can use to send your data or jobs to Hadoop and which one you choose should be a mixture of what’s easiest and most simple to set up with your infastructure, and also what seems like the most clear library for your use case.

Spark

If you have large data which might work better in streaming form (real-time data, log data, API data), then Apache’s Spark is a great tool. PySpark, the Python Spark API, allows you to quickly get up and running and start mapping and reducing your dataset. It’s also incredibly popular with machine learning problems, as it has some built-in algorithms.

There are several other large scale data and job libraries you can use with Python, but for now we can move along to looking at data with Python.

Exploring Data with Python

Let’s take a quick look at what we can do with some simple data using Python. I took a look around Kaggle and found San Francisco City Employee salary data. Since I know a few folks in San Francisco and San Francisco’s increasing rent and cost of living has been in the news lately, I thought I’d take a look.

After downloading the dataset, I started up my Jupyter Notebook which is really just a fancy name for a Python terminal I can run in my browser. This is incredibly useful when you’re first learning and want to come back to your scratchpad of thoughts. I use Jupyter Notebooks when I’m first exploring data so I can see what I found interesting as I continue to explore and easily save my work in one place so I can come back to them later.

First, I imported the data and read it into a Panda’s dataframe. Then, I wanted to see the data. I looked at a few of the rows and used the Panda’s dataframe’s describe method to see how the data is distributed.

%CODE1%

I noticed the dataset had numerous years, but I was most interested in the most recent data, so I decided to make a new dataframe of just that data.

%CODE2%

I looked up the average yearly rental cost from the latest reports on Priceonomics. I wanted to know what percentage of their income the average city employee was paying for rent. (This assumes they live in a single-income household with no children in a one-bedroom apartment – probably unlikely, but it’s a starting point).

%CODE3%

Ouch! Considering most financial advice instructs to spend no more than 30% of your salary for housing expenses, this is shocking. On that note, how many city employees make below the average one-bedroom rent per year?

%CODE4%

Yikes. Using the dataframe’s shape method (which returns the number of rows and columns), we can see there are more than 11,000 employees in that group. I also noticed ‘TotalPay’ is a combination of ‘BasePay’ and ‘OvertimePay’. I wondered how many city employee’s *needed* to work overtime to afford to live. Since the ‘BasePay’ column didn’t properly import as a number, we must do some conversion first.

%CODE5%

After converting, I took a rough guess if you spend 70% of your income on rent, you can’t afford to live there.

So more than 20,000 city employees can’t afford the average one-bedroom apartment in San Francisco on their own salary. Although not entirely surprising, given the news coverage in recent years, that’s still quite extreme. I wanted to see how many city employees were working more than $1K overtime annually.

%CODE6%

More than 15,000! You can see there’s much more to explore here. Within this dataset we haven’t explored which types of employees make more or less money, or whether employees are getting normal pay raises and promotions. We could also do more research to determine whether the average family household in San Francisco has two incomes and what those are, as well as how many bedrooms the average family in San Francisco has. We could figure out more average pay wages in San Francisco (how much do teachers make? Cab drivers? What about manual labor? Restaurant staff?). We could also map the dataset against the Priceonomics data to show what neighborhoods the average city employee can afford and how much longer their commute is due to the rent increases. We could calculate more about the cost of living using many sites tracking average cost of living, and build a salary converter to show how much you would need to make in San Francisco to support your quality of life.

Regardless of what questions you are interested in learning about, you can see that with only a little bit of Python, data analysis is simple and straightforward. With Python, you can ingest and transform data in less than 10 minutes and start exploring your questions immediately.

Where to go from here

There are many different online courses for an introduction to Python. I recommend taking a look at a few and determining which fits your needs.

If you’re interested in learning Pandas, start with their tutorials. If you want to begin with agate, their tutorial is also full of good examples. If you want to focus on visualization, take a look at Bokeh’s User Guide.

If you want to get started with MapReduce, take your first steps with Hadoop via Michael Noll’s excellent introduction. If you’d rather use Spark, Josh Rosen has a great video introduction.

Whatever your path to Python data scientist is, remember to stay curious! There’s always new and different ways to explore data and new questions to answer. Through your curiosity and willingness to learn, you will have a long and successful endeavor as a data analyst.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia