Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Frequency Distribution Analysis using Python Data Stack – Part 1

by Ernest Bonat, Ph.D.
June 6, 2017
in Data Science, Resources
Home Topics Data Science
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

During my years as a Consultant Data Scientist I have received many requests from my clients to provide frequency distribution reports for their specific business data needs. These reports have been very useful for the company management to make proper business decisions quickly. In this paper I would like to show how to design and develop a generic frequency distribution library that will allow you to reduce your development time and provide a good summary table and image report for your clients. One important topic to be covered is this paper is a logic conversion of a top-bottom Python code in a generic reusable super class library for future Object-Oriented Programming (OOP) development applied data analytics and visualization.

I’ll be using the following three main Python Data Stack libraries:

    1. NumPy – is the fundamental package for scientific computing.
    2. pandas – is an open source library, providing high-performance, easy-to-use data structures and data analysis tools
    3. Matplotlib – is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Table of Contents

  • Frequency Statistical Definitions
  • Network Server Activities Frequency Distribution Analysis
  • Network Server Activities Analysis
  • Frequency Distribution Main Library

Frequency Statistical Definitions

The frequency of a particular data value is the number of times the data value occurs. A frequency distribution is a tabular summary (frequency table) of data showing the frequency number of observations (outcomes) in each of several non-overlapping categories named classes. The objective is to provide a simple interpretation about the data that cannot be quickly obtained by looking only at the original raw data.

The Frequency Distribution Analysis can be used for Categorical (qualitative) and Numerical (quantitative) data types. I have seen the most use of it for Categorical data especially during the data cleansing process using pandas library. In general, there are  two types of frequency tables, Univariate (used with a single variable) and Bivariate (used with multiple variables). Univariate tables will be used in this paper. The Bivariate frequency tables are presented as (two-way) Contingency Tables. These tables are used in Chi-squared Test Analysis for the Goodness-Of-Fit Test and Test of Independence. We’ll be covering these topics in future papers.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Network Server Activities Frequency Distribution Analysis

The windows network server activities log file (network_activities.csv) is provided in Table 1.

TimePriorityCategoryMessage
10:47.2InfoFirewall EventSonicWALL initializing
10:55.2ErrorFirewall EventInterface X0 Link Is Down
10:55.2WarningFirewall EventInterface X1 Link Is Up
10:55.2ErrorFirewall EventInterface X2 Link Is Down
10:55.2InfoAuthenticated AccessAdministrator login allowed
10:55.2ErrorFirewall EventInterface X4 Link Is Down
10:55.2AlertIntrusion PreventionPossible port scan detected
10:55.2ErrorFirewall EventInterface X6 Link Is Down
10:55.2InfoAuthenticated AccessGUI administration session ended
10:55.2ErrorFirewall EventInterface X8 Link Is Down
10:55.2ErrorFirewall EventInterface X9 Link Is Down
11:02.2AlertFirewall EventSonicWALL activated
33:20.4WarningFirewall EventInterface X8 Link Is Up
33:23.4WarningFirewall EventInterface X9 Link Is Up
33:56.0ErrorFirewall EventInterface X8 Link Is Down

Table 1. Fifteen rows of network activities log file.

As you can see from Table 1, the log data file contains four columns as Time, Priority, Category and Message. In real production environment this log file may have hundreds of thousands of rows.

Network Server Activities Analysis

The server administrator team has requested a statistical analysis and report of the networking activities to be created for maintenance and management review. In general, this frequency statistical report includes two components:

  1. Frequency Summary Table
  2. Percent Frequency Distribution Chart

The Code Listing 1 shows a simple top-bottom Python code for Frequency Distribution Analysis.


[codesyntax lang=”python” lines=”normal”]

import sys
import os
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def main():

# Frequency Distribution 1 (Vertical Bar Chart)
----------------------------------------------------------------------------------------------------

# set file path name
file_path_name = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities.csv"

# set image path name
image_path_name1 = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities.png"

# get network activity data frame
df_network_activity1 = pd.read_csv(filepath_or_buffer = file_path_name, sep = ",")

# get relative frequencies in a pandas serie
ds_network_activity1 = df_network_activity1["Priority"].value_counts(normalize = True)
print(ds_network_activity1)

# define the x and y axis’s
x_axis = []
y_axis = []
for x, y in ds_network_activity1.iteritems():
x_axis.append(x)
y_axis.append(y * 100)

# build and plot the network activity vertical bar chat
colors = []
for x_value in x_axis:
if x_value == "Error":
colors.append('r')
elif x_value == "Warning":
colors.append('y')
else:
colors.append('g')
plt.style.use("ggplot")
x_pos = np.arange(len(x_axis))
rects = plt.bar(x_pos, y_axis, width = 0.7, color = colors, align = "center", alpha = 0.7, label = "Amount of Messages")
for rect in rects:
rec_x = rect.get_x()
rec_width = rect.get_width()
rec_height = rect.get_height()
height_format = float("{0:.1f}".format(rec_height))
plt.text(rec_x + rec_width / 2, rec_height , str(height_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')
plt.xticks(x_pos, x_axis)
plt.xlabel("Priority")
plt.ylabel("Percent Frequency")
plt.title("Priority Message Percent Frequency Distribution")
plt.legend(loc = 1)
plt.tight_layout()
plt.savefig(image_path_name1, dpi = 100)
plt.show()

# Frequency Distribution 2 (Horizontal Bar Chart)
-------------------------------------------------------------------------------------------------

# set image file path name
image_path_name2 = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities2.png"

# get network activity data frame for priority and message columns
df_network_activity2 = pd.read_csv(filepath_or_buffer = file_path_name, sep = ",")

# group by priority column
df_column_group = df_network_activity2.groupby("Priority")

# get relative frequencies by message column
ds_network_activity2 = df_column_group["Message"].value_counts(normalize = True)

# define the x and y axis’s
x_axis = []
y_axis = []
for x, y in ds_network_activity2.iteritems():
if x[0] == "Error":
x_axis.append(x[1])
y_axis.append(y * 100)

# build and plot the network activity horizontal bar chat
plt.style.use("ggplot")
x_pos = np.arange(len(x_axis))
colors = ["r"]
rects = plt.barh(x_pos, y_axis, color = colors, align = "center", alpha = 0.8, label = "Amount of Messages")
for rect in rects:
rec_y = rect.get_y()
rec_width = int(rect.get_width())
rec_height = rect.get_height()
plt.text(rec_width - 0.6, rec_y + rec_height / 2, str(rec_width) + "%", horizontalalignment = "center", verticalalignment = 'bottom')
plt.yticks(x_pos, x_axis)
plt.xlabel("Percent Frequency")
plt.ylabel("Error Message")
plt.title("Error Server Percent Frequency Distribution")
plt.legend(loc = 1)
plt.tight_layout()
plt.savefig(image_path_name2, dpi = 100)
plt.show()

if __name__ == '__main__':
start_time = time.time()
main()
end_time = time.time()
print("Program Runtime: " + str(round(end_time - start_time, 1)) + " seconds" + "\n")

[/codesyntax]

Code Listing 1. Top-bottom code for Frequency Distribution Analysis.

As you can see from this Code Listing 1 the majority of the input data has been hardcoding in the program and the only way to use this program is to copy and paste in another module file, and of course change the data input values after that – a lot works and a very bad programming practices for sure! Some of the input data hardcode are: data file and images paths, data column name, many plot parameters, etc.

I have seen many Python programmers doing this type of Data Analytics implementation using Python Jupyter Notebook or any modern text editor today. It’s like they don’t understand/know the importance of Object-Oriented Programming design and implementation, Continuous Integration deployment practices, Unit and System Tests, etc.

Frequency Distribution Main Library

We need to create a reusable and extensible library to considerably reduce the Data Analytics development time and necessary code. I have developed a frequency_distribution_superclass.py module that contains the frequency distribution class library FrequencyDistributionLibrary(object) shown in Code Listing 2.

[codesyntax lang=”python” lines=”normal”]

import os
import sys
import traceback
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import config

class FrequencyDistributionLibrary(object)

   """

   generic frequency distribution superclass library
   """        
   def __init__(self):
       pass
        
   def print_exception_message(self, message_orientation = "horizontal"):

       """

       print full exception message

       :param message_orientation: horizontal or vertical

       :return none
       """

       try:

           exc_type, exc_value, exc_tb = sys.exc_info()
           file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]            
           time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))
           file_name = " [File Name]: " + str(file_name)
           procedure_name = " [Procedure Name]: " + str(procedure_name)
           error_message = " [Error Message]: " + str(exc_value)        
           error_type = " [Error Type]: " + str(exc_type)                    
           line_number = " [Line Number]: " + str(line_number)                
           line_code = " [Line Code]: " + str(line_code)
           if (message_orientation == "horizontal"):

               print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
           elif (message_orientation == "vertical"):
               print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
           else:
               pass                    
       except Exception:
           pass

       
   def get_project_directory_path(self):

       """

       get project directory path from the calling file
       """
       project_directory_path = None
       try:  
           project_directory_path = os.path.dirname(sys.argv[0])            
       except Exception:
           self.print_exception_message()                    
       return project_directory_path


   def format_float_number(self, decimal_point, real_value):

       """
       format float numbers with digits
       :param decimal_point:
       :param real_value:
       :return formatted float number
       """
       format_value = 0.0
       try:
           if decimal_point == 1:
               format_value = float("{0:.1f}".format(real_value))
           elif decimal_point == 2:
               format_value = float("{0:.2f}".format(real_value))
           elif decimal_point == 3:
               format_value = float("{0:.3f}".format(real_value))
           elif decimal_point == 4:
               format_value = float("{0:.4f}".format(real_value))
           elif decimal_point == 5:
               format_value = float("{0:.5f}".format(real_value))
           else:
               format_value = float("{0:.3f}".format(real_value))
       except Exception:                                                          
           self.print_exception_message()
       return format_value
 

   def load_x_y_axis_data(self, data_file_name, column_name, group_by_colum = None, column_name_class = None):

       """
       define x and y axis data
       :param data_file_name:
       :param column_name:
       :param group_by_colum:
       :return x and y axis data
       """
       x_axis = []
       y_axis = []        
       try:    
           data_frame = pd.read_csv(filepath_or_buffer = data_file_name, sep = ",")         
           if (group_by_colum is not None):                
               data_frame = data_frame.groupby(group_by_colum)                                
           data_serie = data_frame[column_name].value_counts(normalize = True)      
           if (group_by_colum is not None):   
               for x, y in data_serie.iteritems():     
                   if x[0] == column_name_class:
                       x_axis.append(x[1])           
                       y_axis.append(self.format_float_number(1, y * 100))                                               
           else:
               for x, y in data_serie.iteritems():
                   x_axis.append(x)        
                   y_axis.append(self.format_float_number(1, y * 100))                            
       except Exception:
           self.print_exception_message()
       return x_axis, y_axis


   def print_summary_table(self, first_column_name, second_column_name, x_axis, y_axis):
       """
       print tabular summary table
       :param first_column_name: class column
       :param second_column_name: frequency numerical column
       :param x_axis: x axis data
       :param y_axis: y axis data
       :return none
       """
       try:  
           print("{}\t{}".format(first_column_name, second_column_name))
           for x, y in zip(x_axis, y_axis):
               print("{}\t\t{}".format(x, str(y) + "%"))
       except Exception:
           self.print_exception_message()
        

   def build_bar_chart_vertical(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend):        

       """
       build vertical bar chart
       :param x_axis: x axis data
       :param y_axis: y axis data
       :param image_file_name: image file path and name
       :return none
       """
       try:
           colors = []
           for x_value in x_axis:
               if x_value == config.error_class:
                   colors.append('r')
               elif x_value == config.warning_class:
                   colors.append('y')
               else:
                   colors.append('g')          
           plt.style.use(config.plot_style)       
           x_pos = np.arange(len(x_axis))         
           rects = plt.bar(x_pos, y_axis, width = 0.7, color = colors, align = "center", alpha = 0.7, label = plot_legend)
           for rect in rects:
               rec_x = rect.get_x()
               rec_width = rect.get_width()        
               rec_height = rect.get_height()  
               height_format = self.format_float_number(1, rec_height)      
               plt.text(rec_x + rec_width / 2, rec_height , str(height_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')            plt.xticks(x_pos, x_axis)   
           plt.xlabel(plot_xlabel)
           plt.ylabel(plot_ylabel)      
           plt.title(plot_title)    
           plt.legend(loc = 1)    
           plt.tight_layout()
           plt.savefig(image_file_name, dpi = 100)
           plt.show()       
       except Exception:                                                          
           self.print_exception_message()
          

   def build_bar_chart_horizontal(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend):        

       """
        build horizontal bar chart
       :param x_axis: x axis data
       :param y_axis: y axis data
       :param image_file_name: image file path and name
       :return none
       """
       try:  
           plt.style.use(config.plot_style)  
           x_pos = np.arange(len(x_axis))                     
           colors = ["r"]    
           rects = plt.barh(x_pos, y_axis, color = colors, align = "center", alpha = 0.8, label = plot_legend)    
           for rect in rects:    
               rec_y = rect.get_y()
               rec_width = int(rect.get_width())
               width_format = self.format_float_number(1, rec_width)   
               rec_height = rect.get_height()        
               plt.text(rec_width - 0.8,  rec_y + rec_height / 2, str(width_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')           
           plt.yticks(x_pos, x_axis)   
           plt.xlabel(plot_xlabel)
           plt.ylabel(plot_ylabel)      
           plt.title(plot_title)    
           plt.legend(loc = 1)    
           plt.tight_layout()
           plt.savefig(image_file_name, dpi = 100)
           plt.show()   
       except Exception:                                                          

           self.print_exception_message()

[/codesyntax]

Code Listing 2. Frequency distribution superclass FrequencyDistributionLibrary(object).

This library contains six main functions used in the paper for any complete Frequency Distribution Analysis:

  1. print_exception_message(self, message_orientation = “horizontal”)
  2. format_float_number(self, decimal_point, real_value)
  3. load_x_y_axis_data(self, data_file_name, column_name, group_by_colum = None, column_name_class = None)
  4. print_summary_table(self, first_column_name, second_column_name, x_axis, y_axis)
  5. build_bar_chart_vertical(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend)       
  6. build_bar_chart_horizontal(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend)     

In Part 2 we’ll be covering how to inherit from this library to create a subclass module. Real business examples of Frequency Distribution Analysis will be provided.

Tags: Frequency Distribution AnalysisMatplotlibNumpyPandaspython

Related Posts

AI Asmongold video: In the Athene AI Show, a Twitch streamer's funny deepfake revealed and people love it. So how did this happen? Keep reading and find out.

AI Asmongold may have been one of the very first examples of AI streamers

February 6, 2023
Google starts testing its ChatGPT rival AI chatbot called Apprentice Bard

Google starts testing its ChatGPT rival AI chatbot called Apprentice Bard

February 3, 2023
Artificial intelligence in education: Examples

How AI improves education with personalized learning at scale and other new capabilities

February 3, 2023
What is ChatGPT Plus, and how to get it? Learn its features, price, and how to join ChatGPT Plus waitlist. Is it worth it? Keep reading and find out

ChatGPT Plus: How does the paid version work?

February 2, 2023
AI Text Classifier: OpenAI's ChatGPT detector can distinguishes AI-generated text

AI Text Classifier: OpenAI’s ChatGPT detector indicates AI-generated text

February 2, 2023
BuzzFeed ChatGPT integration: Buzzfeed stock surges in enthusiasm over OpenAI

BuzzFeed ChatGPT integration: Buzzfeed stock surges after the OpenAI deal

February 2, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

AI Asmongold may have been one of the very first examples of AI streamers

Mastering the art of efficiency through business process transformation

Google starts testing its ChatGPT rival AI chatbot called Apprentice Bard

How AI improves education with personalized learning at scale and other new capabilities

Cyberpsychology: The psychological underpinnings of cybersecurity risks

ChatGPT Plus: How does the paid version work?

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.