Introduction

A couple of months ago a client of mine asked me the following question: “What is the faster data structure object in Python for Big Data analysis today?” I get questions like this one all the time. Some of them are not easy to solve at all and it takes some time to find the right optimized solutions for them. In general, I do this for fun on the weekends and at night.

At that time, based on this question, my first simple answer was the Python List object. I used the List object in many Data Science projects including Data Pipeline and Extract-Transform-Load (ETL) production system. Then the following questions came to mind: Can I use the List object for data manipulation and analysis of millions or billions of rows? What about if I divide a Data Science project into small tasks and run them asynchronously using the latest Python asyncio library? Based on these questions, I decided to spend some time and find out some practical solutions for Big Data analysis using Pythion Data Ecosystem libraries. To make it simple to understand and find the results quickly, the program will calculate the Arithmetic Mean, Median and Sample Standard Deviation values from a float one dimensional NumPy array. For the programs runtime comparison I’ll use the following libraries:

  • NumPyThe fundamental Pyrhon package for scientific computing.
  • NumbaNumba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and FORTRAN, without having to switch languages or Python interpreters.
  • asyncioPython asynchronous programming library.

Why use NumPy?

As the NumPy website said: NumPy is the fundamental package for scientific computing with Python. It provides a powerful N-dimensional array object and sophisticated (broadcasting) functions. With NumPy imported library the Python programs performance better with a high execution speed, more convenient for consistency and a lot of numerical and matrix functionalities. Maybe because of this there isn’t a reason to use Python List objects anymore? It’s important to mention that many Python Data Ecosystem libraries are built on top of NumPy like Pandas, SciPy, Matplotlib, etc.  

Used Python Algorithms

I have decided to provide simple calculations of the Arithmetic Mean, Median and Sample Standard Deviation to show the Python programs execution time (runtime) and compare them. The test data will be generated using one dimensional NumPy array with float 64-bit data type. The following three Python algorithms were implemented and analyzed:

  1. NumPy array
  2. NumPy array with asyncio asynchronous library
  3. NumPy array with Numba library

NumPy Array Program

Let’s look at the code for each algorithm. Every algorithm has its own class object and a main calling program to follow the Object-Oriented Programming (OOP) methodology. This class object contains the following five methods:

  • calculate_number_observation() – calculate number of observation
  • calculate_arithmetic_mean() – calculate arithmetic mean
  • calculate_median() – calculate median
  • calculate_sample_standard_deviation() – calculate sample standard deviation
  • print_exception_message() – print exception message if occurred

Listing 1 shows the summary statistics class object code using NumPy array only.

import sys
 
import traceback
 
import time
 
from math import sqrt
 
 
 
class SummaryStatistics(object):
 
   """
 
   calculate number of observations, arithmetic mean, median
 
   and sample standard deviation using standard procedures
 
   """
 
   def __init__(self):
 
       pass
 
       
 
   def calculate_number_observation(self, one_dimensional_array):        
 
       """
 
       calculate  number of observations
 
       :param one_dimensional_array: numpy one dimensional array
 
       :return number of observations value
 
       """
 
       number_observation = 0
 
       try:
 
           number_observation = one_dimensional_array.size   
 
       except Exception:
 
           self.print_exception_message()
 
       return number_observation
 
 
 
   def calculate_arithmetic_mean(self, one_dimensional_array, number_observation):    
 
       """
 
       calculate arithmetic mean
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return arithmetic mean value
 
       """
 
       arithmetic_mean = 0.0
 
       try:
 
           sum_result = 0.0
 
           for i in range(number_observation):       
 
               sum_result += one_dimensional_array[i]    
 
           arithmetic_mean = sum_result / number_observation
 
       except Exception:
 
           self.print_exception_message()
 
       return arithmetic_mean
 
 
 
   def calculate_median(self, one_dimensional_array, number_observation):      
 
       """
 
       calculate  median
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return median value
 
       """
 
       median = 0.0
 
       try:
 
           one_dimensional_array.sort()    
 
           half_position = number_observation // 2
 
           if not number_observation % 2:
 
               median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0
 
           else:
 
               median = one_dimensional_array[half_position]        
 
       except Exception:
 
           self.print_exception_message()
 
       return median
 
 
 
   def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    
 
       """
 
       calculate sample standard deviation
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation:  number of observations
 
       :param arithmetic_mean: arithmetic mean value
 
       :return sample standard deviation value
 
       """
 
       sample_standard_deviation = 0.0
 
       try:
 
           sum_result = 0.0
 
           for i in range(number_observation):                   
 
               sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            
 
           sample_variance = sum_result / (number_observation - 1)            
 
           sample_standard_deviation = sqrt(sample_variance)        
 
       except Exception:
 
           self.print_exception_message()
 
       return sample_standard_deviation
 
 
 
   def print_exception_message(self, message_orientation = "horizontal"):
 
       """
 
       print full exception message
 
       :param message_orientation: horizontal or vertical
 
       :return none
 
       """
 
       try:
 
           exc_type, exc_value, exc_tb = sys.exc_info()            
 
           file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]       
 
           time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))
 
           file_name = " [File Name]: " + str(file_name)
 
           procedure_name = " [Procedure Name]: " + str(procedure_name)
 
           error_message = " [Error Message]: " + str(exc_value)        
 
           error_type = " [Error Type]: " + str(exc_type)                    
 
           line_number = " [Line Number]: " + str(line_number)                
 
           line_code = " [Line Code]: " + str(line_code)
 
           if (message_orientation == "horizontal"):
 
               print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
 
           elif (message_orientation == "vertical"):
 
               print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
 
           else:
 
               pass                    
 
       except Exception:
 
           pass

Listing 1. Summary statistics class object code using NumPy array

Listing 2 shows the summary statistics main program. As you can see this program creates the summary_statistics class object and then the methods that they are called. The NumPy library needs to be imported to generate the one dimensional array. The program runtime is calculated by using the clock() method of the time module.

import time
 
import numpy as np
 
 
 
from class_summary_statistics import SummaryStatistics
 
 
 
def main(one_dimensional_array):
 
   
 
#     create summary statistics class object
 
   summary_statistics = SummaryStatistics()
 
   
 
#     calculate number of observation
 
   number_observation = summary_statistics.calculate_number_observation(one_dimensional_array)
 
   print("Number of Observation: {} ".format(number_observation))
 
   
 
#     calculate arithmetic mean
 
   arithmetic_mean = summary_statistics.calculate_arithmetic_mean(one_dimensional_array, number_observation)
 
   print("Arithmetic Mean: {} ".format(arithmetic_mean))
 
   
 
#     calculatte median
 
   median = summary_statistics.calculate_median(one_dimensional_array, number_observation)
 
   print("Median: {} ".format(median))
 
   
 
#     calculate sample standard deviation
 
   sample_standard_deviation = summary_statistics.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)
 
   print("Sample Standard Deviation: {} ".format(sample_standard_deviation))
 
 
 
if __name__ == '__main__':
 
   start_time = time.clock()  
 
   one_dimensional_array = np.arange(100000000, dtype=np.float64)        
 
   main(one_dimensional_array)
 
   end_time = time.clock()
 
   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

Listing 2. Summary statistics main program code using NumPy array

With one million rows the summary statistics main program will show the result below:

 

Number of Observation: 1000000

Arithmetic Mean: 499999.5

Median: 499999.5

Sample Standard Deviation: 288675.27893349814

Program Runtime: 1.3 seconds

 

Numpy Array with asyncio Library

Listing 3 shows the summary statistics asyncio class object code with Python asyncio asynchronous library. Note that the main() method starts the event loop asynchronous process with the calculate_number_observation() as the first and unique task.

import sys
 
import time
 
import traceback
 
import asyncio
 
from math import sqrt
 
 
 
class SummaryStatisticsAsyncio(object):
 
   """
 
   calculate number of observations, arithmetic mean, median
 
   and sample standard deviation using asyncio library
 
   """
 
   def __init__(self):
 
       pass
 
   
 
   async def calculate_number_observation(self, one_dimensional_array):    
 
       """
 
       calculate  number of observations
 
       :param one_dimensional_array: numpy one dimensional array
 
       :return none
 
       """
 
       try:
 
           print('start calculate_number_observation() procedure')   
 
           await asyncio.sleep(0)
 
           number_observation = one_dimensional_array.size
 
           print("Number of Observation: {} ".format(number_observation))    
 
           await self.calcuate_arithmetic_mean(one_dimensional_array, number_observation)
 
           print("finished calculate_number_observation() procedure")   
 
       except Exception:
 
           self.print_exception_message()
 
           
 
   async def calcuate_arithmetic_mean(self, one_dimensional_array, number_observation):    
 
       """
 
       calculate arithmetic mean
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return none
 
       """
 
       try:
 
           print('start calcuate_arithmetic_mean() procedure')   
 
           await self.calculate_median(one_dimensional_array, number_observation)
 
           sum_result = 0.0
 
           await asyncio.sleep(0)
 
           for i in range(number_observation):       
 
               sum_result += one_dimensional_array[i]    
 
           arithmetic_mean = sum_result / number_observation
 
           print("Arithmetic Mean: {} ".format(arithmetic_mean))    
 
           await self.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)
 
           print("finished calcuate_arithmetic_mean() procedure")   
 
       except Exception:
 
           self.print_exception_message()
 
           
 
   async def calculate_median(self, one_dimensional_array, number_observation):      
 
       """
 
       calculate  median
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return none
 
       """
 
       try:
 
           print('starting calculate_median()')   
 
           await asyncio.sleep(0)
 
           one_dimensional_array.sort()    
 
           half_position = number_observation // 2
 
           if not number_observation % 2:
 
               median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0
 
           else:
 
               median = one_dimensional_array[half_position]        
 
           print("Median: {} ".format(median))
 
           print("finished calculate_median() procedure")   
 
       except Exception:
 
           self.print_exception_message()
 
           
 
   async def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    
 
       """
 
       calculate sample standard deviation
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation:  number of observations
 
       :param arithmetic_mean: arithmetic mean value
 
       :return none
 
       """
 
       try:
 
           print('start calculate_sample_standard_deviation() procedure')   
 
           await asyncio.sleep(0)
 
           sum_result = 0.0
 
           for i in range(number_observation):                   
 
               sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            
 
           sample_variance = sum_result / (number_observation - 1)            
 
           sample_standard_deviation = sqrt(sample_variance)        
 
           print("Sample Standard Deviation: {} ".format(sample_standard_deviation))
 
           print("finished calculate_sample_standard_deviation() procedure")   
 
       except Exception:
 
           self.print_exception_message()
 
 
 
   def print_exception_message(self, message_orientation = "horizontal"):
 
           """
 
           print full exception message
 
           :param message_orientation: horizontal or vertical
 
           :return none
 
           """
 
           try:
 
               exc_type, exc_value, exc_tb = sys.exc_info()            
 
               file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]       
 
               time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))
 
               file_name = " [File Name]: " + str(file_name)
 
               procedure_name = " [Procedure Name]: " + str(procedure_name)
 
               error_message = " [Error Message]: " + str(exc_value)        
 
               error_type = " [Error Type]: " + str(exc_type)                    
 
               line_number = " [Line Number]: " + str(line_number)                
 
               line_code = " [Line Code]: " + str(line_code)
 
               if (message_orientation == "horizontal"):
 
                   print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
 
               elif (message_orientation == "vertical"):
 
                   print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
 
               else:
 
                   pass                    
 
           except Exception:
 
               pass
 
   
 
   def main(self, one_dimensional_array):    
 
       """
 
       start the event loop asynchronous process
 
       :param one_dimensional_array: numpy one dimensional array
 
       """
 
       try:
 
           ioloop = asyncio.get_event_loop()
 
           tasks = [ioloop.create_task(self.calculate_number_observation(one_dimensional_array))]
 
           wait_tasks = asyncio.wait(tasks)
 
           ioloop.run_until_complete(wait_tasks)
 
           ioloop.close()
 
       except Exception:
 
           self.print_exception_message()

Listing 3. Summary statistics asyncio class object code with Python asynchronous library

The summary statistics asyncio main program is shown in Listing 4. As you can see the main() method is the only one to be called.

import time
 
import numpy as np
 
 
 
from class_summary_statistics_asyncio import SummaryStatisticsAsyncio
 
 
 
def main(one_dimensional_array):
 
   
 
#     create summary statistics asyncio class object
 
   summary_statistics_asyncio = SummaryStatisticsAsyncio()
 
 
 
#     call main method
 
   summary_statistics_asyncio.main(one_dimensional_array)
 
 
 
if __name__ == '__main__':
 
   start_time = time.clock()  
 
   one_dimensional_array = np.arange(1000000000, dtype=np.float64)        
 
   main(one_dimensional_array)
 
   end_time = time.clock()
 
   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

Listing 4. Summary statistics asyncio main program code with Python asynchronous library

With one billion rows the summary statistics asyncio main program will show the result below. I have included the printing of the start/finish procedures to show how the asynchronous process works in this particular case.

 

start calculate_number_observation() procedure

Number of Observation: 1000000000

start calcuate_arithmetic_mean() procedure

starting calculate_median()

Median: 499999.5

finished calculate_median() procedure

Arithmetic Mean: 499999.5

start calculate_sample_standard_deviation() procedure

Sample Standard Deviation: 288675.27893349814

finished calculate_sample_standard_deviation() procedure

finished calcuate_arithmetic_mean() procedure

finished calculate_number_observation() procedure

Program Runtime: 1504.4 seconds

 

Numpy Array with Numba Library Program

The summary statistics class object code with Numba library is shown in Listing 5. Check the Numba GitHub repository to learn more about this Open Source NumPy-aware optimizing compiler for Python. It’s important to mention that Numba supports CUDA GPU programming. As you can see the debugging code has been removed to run the program in compile mode.

import time
 
from numba import jit
 
import numpy as np
 
from math import sqrt
 
 
 
class SummaryStatisticsNumba(object):
 
   """
 
   calculate number of observations, arithmetic mean, median
 
   and sample standard deviation using numba library
 
   """
 
   def __init__(self):
 
       pass
 
       
 
   @jit
 
   def calculate_number_observation(self, one_dimensional_array):    
 
       """
 
       calculate  number of observations
 
       :param one_dimensional_array: numpy one dimensional array
 
       :return number of observations value
 
       """        
 
       number_observation = one_dimensional_array.size
 
       return number_observation
 
   
 
   @jit
 
   def calcuate_arithmetic_mean(self, one_dimensional_array, number_observation):    
 
       """
 
       calculate arithmetic mean
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return arithmetic mean value
 
       """
 
       sum_result = 0.0
 
       for i in range(number_observation):       
 
           sum_result += one_dimensional_array[i]    
 
       arithmetic_mean = sum_result / number_observation
 
       return arithmetic_mean
 
   
 
   @jit
 
   def calculate_median(self, one_dimensional_array, number_observation):      
 
       """
 
       calculate  median
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation: number of observations
 
       :return median value
 
       """
 
       one_dimensional_array.sort()    
 
       half_position = number_observation // 2
 
       if not number_observation % 2:
 
           median = (one_dimensional_array[half_position - 1] + one_dimensional_array[half_position]) / 2.0
 
       else:
 
           median = one_dimensional_array[half_position]        
 
       return median
 
   
 
   @jit
 
   def calculate_sample_standard_deviation(self, one_dimensional_array, number_observation, arithmetic_mean):    
 
       """
 
       calculate sample standard deviation
 
       :param one_dimensional_array: numpy one dimensional array
 
       :param number_observation:  number of observations
 
       :param arithmetic_mean: arithmetic mean value
 
       :return sample standard deviation value
 
       """
 
       sum_result = 0.0
 
       for i in range(number_observation):                   
 
           sum_result += pow((one_dimensional_array[i] - arithmetic_mean), 2)            
 
       sample_variance = sum_result / (number_observation - 1)            
 
       sample_standard_deviation = sqrt(sample_variance)        
 
       return sample_standard_deviation

Listing 5. Summary statistics class object code with Numba library

The summary statistics Numba main program is shown in Listing 6.

import time
 
import numpy as np
 
 
 
from class_summary_statistics_numba import SummaryStatisticsNumba
 
 
 
def main(one_dimensional_array):
 
   
 
#     create class summary statistics_numba class object
 
   class_summary_statistics_numba = SummaryStatisticsNumba()
 
   
 
#     calculate number of observation
 
   number_observation = class_summary_statistics_numba.calculate_number_observation(one_dimensional_array)
 
   print("Number of Observation: {} ".format(number_observation))
 
       
 
#     calculate arithmetic mean
 
   arithmetic_mean = class_summary_statistics_numba.calcuate_arithmetic_mean(one_dimensional_array, number_observation)
 
   print("Arithmetic Mean: {} ".format(arithmetic_mean))
 
   
 
#     calculatte median
 
   median = class_summary_statistics_numba.calculate_median(one_dimensional_array, number_observation)
 
   print("Median: {} ".format(median))
 
   
 
#     calculate sample standard deviation
 
   sample_standard_deviation = class_summary_statistics_numba.calculate_sample_standard_deviation(one_dimensional_array, number_observation, arithmetic_mean)
 
   print("Sample Standard Deviation: {} ".format(sample_standard_deviation))
 
   
 
if __name__ == '__main__':
 
   start_time = time.clock()  
 
   one_dimensional_array = np.arange(1000000000, dtype=np.float64)        
 
   main(one_dimensional_array)
 
   end_time = time.clock()
 
   print("Program Runtime: {} seconds".format(round(end_time - start_time, 1)))

Listing 6. Summary statistics Numba main program

With one billion rows the summary statistics Numba main program will show the result below:

 

Number of Observation: 1000000000

Arithmetic Mean: 499999999.067109

Median: 499999999.5

Sample Standard Deviation: 288675134.73899055

Program Runtime: 40.2 seconds

 

It’s a very exciting result to see the calculations finished in 40.2 seconds with one billion rows in the NumPy array. I think it’s time to use NumPy array with Numba library wherever possible in Big Data Science projects. Some research and testing may be require for specific cases.

Laptop Hardware Parameters

Here are the laptop hardware parameters used to run the Python programs:

  • Windows 10 64-bit OS
  • Intel Core™ i7-2670QM CPU @ 2.20 GHz
  • 16 GB RAM

Programs Runtime Comparison

Table 1 shows the programs runtime execution for 1 million, 10 million, 100 million and 1 billion rows.

Number of Rows                                   Used Python Algorithms
NumPy array NumPy array with asyncio
asynchronous library
NumPy array with Numba
library
1 million 1.3 seconds 1.4 seconds 0.9 seconds
10 million 12.8 seconds 12.7 seconds 1.1 seconds
100 million 2.16 minutes 2.17 minutes 4.4 seconds
1 billion  

22.66 minutes 

 

22.64 minutes 40.2 seconds


Table 1: Programs Runtime Comparison

Conclusions

  1. There is not a practically difference between using NumPy array and NumPy array with asyncio asynchronous library. Although these calculations are not totally sufficient to prove the performances of the asyncio asynchronous library in Python Data Science projects therefore more research may be required to find the right applications.
  2. The combination of NumPy array with Numba library provides the best performance for data manipulation and analysis compared with NumPy array and NumPy array with asyncio asynchronous library. I was very impressed to see 40.2 seconds execution time with one billion rows in the Numpy array. I wonder if R programs can do that today! If not, maybe it’s time for R programmers to learn Python and its Data Ecosystem libraries. One more thing, make sure to write Python programs using OOP methodology for real production environments with Continue Integration software development and deployment practices.
  3. In Data Pipeline and Extract-Transform-Load (ETL) system projects with different types of data sources, the NumPy array with Numba library implementation is one of the best programming practices for Big Data analysis today. There shouldn’t be a need of using Python List objects for it.

Feel free to email Ernest any questions about his article.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Previous post

10 Challenges to Big Data Security and Privacy

Next post

How to promote a culture of data stewardship for your startup