Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Programming with R – How to Get a Frequency Table of a Categorical Variable as a Data Frame

by Chaitanya Sagar
June 10, 2019
in Data Science, Data Science 101, Resources
Home Topics Data Science
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Categorical data is a kind of data which has a predefined set of values. Taking “Child”, “Adult” or “Senior” instead of keeping the age of a person to be a number is one such example of using age as categorical. However, before using categorical data, one must know about various forms of categorical data.


On November 25th-26th 2019, we are bringing together a global community of data-driven pioneers to talk about the latest trends in tech & data at Data Natives Conference 2019. Get your ticket now at a discounted Early Bird price!


First of all, categorical data may or may not be defined in an order. To say that the size of a box is small, medium or large means that there is an order defined as small<medium<large. The same does not hold for, say, sports equipment, which could also be categorial data, but differentiated by names like dumbbell, grippers or gloves; that is, you can order the items on any basis. Those which can be ordered are known as “ordinal” while those where there is no such ordering are “nominal” in nature.

Many a time, an analyst changes the data from numerical to categorical to make things easier. Besides using “Adult”, “Child” or “Senior” class instead of age as a number, there can also be special cases such as using “regular item” or “accessory” for equipment. In many problems, the output is also categorical. Whether a customer will churn or not, whether a person will buy a product or not, whether an item is profitable etc. All problems where the output is categorical are known as classification problems. R provides various ways to transform and handle categorical data.

A simple way to transform data into classes is by using the split and cut functions available in R or the cut2 function in Hmisc library.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Let’s use the iris dataset to categorize data. This dataset is available in R and can be called by using ‘attach’ function. The dataset consists of 150 observations over 5 features – Sepal Length, Sepal Width, Petal Length, Petal Width and species.

attach(iris) #Call the iris dataset

x=iris #store a copy of the dataset into x

#using the split function
list1=split(x, cut(x$Sepal.Length, 3)) #This will create a list of 3 split on the basis of sepal.length
summary(list1) #View the class ranges for list1
Length Class         Mode
(4.3,5.5] 6          data.frame list
(5.5,6.7] 6          data.frame list
(6.7,7.9] 6          data.frame list
#using Hmisc library
library(Hmisc)
list2=split(x, cut2(x$Sepal.Length, g=3)) #This will also create a similar list but with left boundary included
summary(list2) #View the class ranges for list2
Length Class          Mode
[4.3,5.5) 6          data.frame list
[5.5,6.4) 6          data.frame list
[6.4,7.9] 6          data.frame list

The first list, list1 divides the dataset into 3 groups based on range of sepal length equally divided. The second list, list 2 also divides the dataset into 3 groups based on sepal length but it tries to keep equal number of values in each group. We can check this using the range function.

#Range of sepal.length
range(x$Sepal.Length) #The output is 4.3 to 7.9

We can see that the list 1 consists of three groups – the first group has the range 4.3-5.5, the second one has the range 5.5-6.4 and the third one has the range 6.5-7.9. There is, however, one difference between the output of list1 and list2. List1 allows the range in the three groups to be equal. On the other hand, list2 allows the number of values in each group to be balanced. An alternative code to the following is to just add the group range as another feature in the dataset

x$class <- cut(x$Sepal.Length, 3) #Add the class label instead of creating a list of data
x$class2 <- cut2(x$Sepal.Length, 3) #Add the class label instead of creating a list of data

If the classes are to be indexed as numbers 1,2,3… instead of their actual range, we can just convert our output as numeric. Using the indexes is also easier than the range of each group.

x$class=as.numeric(x$class)

In our example, the class values will now be transformed to either of 1,2 or 3. Suppose we now want to find the number of values in each class. How many rows fall into class 1? Or class 2? We can use the table() function present in R to give us that count.

class_length=table(x$group)
class_length #The sizes are 59,71 and 20 as indicated in the output below
1  2  3
59 71 20

This is a good way to get a quick summary of the classes and their sizes. However, this is where it ends. We cannot make further computations or use this information in our dataset. Moreover, class_length is a table and needs to be transformed to a Data Frame before it can be useful. The issue is that transforming a table into Data Frame will create the variable names as Var1 and Freq as table does not retain the original feature name.

#Transforming the table to a Data Frame
class_length_df=as.data.frame(class_length)
Class_length_df #The output is:
Var1 Freq
1    1   59
2    2   71
3    3   20
#Here we see that the variable is named as Var1. We need to rename the variable using the names()
function
names(class_length_df)[1]=”group” #Changing the first variable Var1 to group
class_length_df
  group Freq
1     1   59
2     2   71
 3     3   20

In this case where we have a few variables, we can easily rename the variable but this is very risky in a large dataset where one can accidentally rename another important feature.

As I said, there is more than 1 way to do the same thing in R. All this hassle could have been avoided if there had been a function that will generate our class size as a Data Frame to start with. The “plyr” package has the count() function which accomplishes this task. Using the count function in plyr package is as simple as passing the original Data Frame and the variable we want to use the count for.

#Using the plyr library
library(plyr)
class_length2=count(x,”group”) #Using the count function
class_length2 #The output is:
  group freq
1     1   59
2     2   71
3     3   20

The same output, in less number of steps. Let’s verify our output

#Checking the data type of class_length2

class(class_length2) #Output is data.frame

The plyr package is very useful when it comes to categorical data. As we see, the count() function is really flexible and can generate the Data Frame we want. It is now easy to add the frequency of the categorical data to the original Data Frame x.

Comparison

The table() function is really useful as a quick summary and, with a little work, can produce an output similar to that given by the count() function. When we go a little further towards N-way tables, the table function transformed to Data Frame works just as count() function

#Using the table for 2 way
two_way=as.data.frame(table(subset(x,select=c(“class”,”class2″))))
two_way
   class    class2 Freq
1 (4.3,5.5] [4.3,5.5)   52
2 (5.5,6.7] [4.3,5.5)    0
3 (6.7,7.9] [4.3,5.5)    0
4 (4.3,5.5] [5.5,6.4)    7
5 (5.5,6.7] [5.5,6.4)   49
6 (6.7,7.9] [5.5,6.4)    0
7 (4.3,5.5] [6.4,7.9]    0
8 (5.5,6.7] [6.4,7.9]   22
9 (6.7,7.9] [6.4,7.9]   20

two_way_count=count(x,c(“class”,”class2″))
two_way_count
    class    class2 freq
1 (4.3,5.5] [4.3,5.5)   52
2 (4.3,5.5] [5.5,6.4)    7
3 (5.5,6.7] [5.5,6.4)   49
4 (5.5,6.7] [6.4,7.9]   22
5 (6.7,7.9] [6.4,7.9]   20

The difference is still noticeable. While both the outcomes are similar, the count() function omits the values which are null or have a size of zero. Hence, the count() function gives a cleaner output and outperforms the table() function which gives frequency tables of all possible combinations of the variables. What if we want the N-way frequency table of the entire Data Frame? In this case, we can simply pass the entire Data Frame into table() or count() function. However, the table() function will be very slow in this case as it will take time for calculating frequencies of all possible combinations of features whereas the count() function will only calculate and display the combinations where the frequency is non-zero.

#For the entire dataset
full1=count(x) #much faster
full2=as.data.frame(table(x))

What if we want to display our data in a cross-tabulated format instead of displaying as a list? We have a function xtabs for this purpose.

cross_tab = xtabs(~ class + class2, x)
cross_tab
class2
class       [4.3,5.5) [5.5,6.4) [6.4,7.9]
 (4.3,5.5]        52         7         0
 (5.5,6.7]         0        49        22
 (6.7,7.9]         0         0        20

However, the class type of this function is xtabs table.

class(cross_tab)
“xtabs” “table”

Converting the same as a Data Frame regenerates the same output as the table() function does

y=as.data.frame(cross_tab)
y
class    class2 Freq
1 (4.3,5.5] [4.3,5.5)   52
2 (5.5,6.7] [4.3,5.5)    0
3 (6.7,7.9] [4.3,5.5)    0
4 (4.3,5.5] [5.5,6.4)    7
5 (5.5,6.7] [5.5,6.4)   49
6 (6.7,7.9] [5.5,6.4)    0
7 (4.3,5.5] [6.4,7.9]    0
8 (5.5,6.7] [6.4,7.9]   22
9 (6.7,7.9] [6.4,7.9]   20

There is another difference when we use cross-tabulated output for N-way classification when N>3. As we can show only 2 features in cross-tabulated format, xtabs divides the data based on the third variable and displays cross-tabulated outputs for each value of the third variable. Illustrating the same for class, class2 and Species:\

threeway_cross_tab = xtabs(~ class + class2 + Species, x)
threeway_cross_tab

, , Species = setosa

          class2
class       [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5]        45         2         0
(5.5,6.7]         0         3         0
(6.7,7.9]         0         0         0

, , Species = versicolor

          class2
class       [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5]         6         5         0
(5.5,6.7]         0        28         8
(6.7,7.9]         0         0         3

, , Species = virginica

          class2
class       [4.3,5.5) [5.5,6.4) [6.4,7.9]
(4.3,5.5]         1         0         0
(5.5,6.7]         0        18        14
(6.7,7.9]         0         0        17

The output become larger and difficult to read as N increases for an N-way cross tabulated output. In this situation again, the count() function seamlessly produces a clean output which is easily visualizable.

threeway_cross_tab_df = count(x, c(‘class’, ‘class2’, ‘Species’))
threeway_cross_tab_df
      class    class2    Species freq
1  (4.3,5.5] [4.3,5.5)     setosa   45
2  (4.3,5.5] [4.3,5.5) versicolor    6
3  (4.3,5.5] [4.3,5.5)  virginica    1
4  (4.3,5.5] [5.5,6.4)     setosa    2
5  (4.3,5.5] [5.5,6.4) versicolor    5
6  (5.5,6.7] [5.5,6.4)     setosa    3
7  (5.5,6.7] [5.5,6.4) versicolor   28
8  (5.5,6.7] [5.5,6.4)  virginica   18
9  (5.5,6.7] [6.4,7.9] versicolor    8
10 (5.5,6.7] [6.4,7.9]  virginica   14
11 (6.7,7.9] [6.4,7.9] versicolor    3
12 (6.7,7.9] [6.4,7.9]  virginica   17

The same output is presented in a concise way by count(). The count() function in plyr package is thus very useful when it comes to counting frequencies of categorical variables.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: data scienceProgramming with RR

Related Posts

Runway AI Gen-2 makes text-to-video AI generator a reality

Runway AI Gen-2 makes text-to-video AI generator a reality

March 21, 2023
What is containers as a service (CaaS): Examples

Maximizing the benefits of CaaS for your data science projects

March 21, 2023
We explained how to use Microsoft 365 Copilot in Word, PowerPoint, Excel, Outlook, Teams, Power Platform, and Business Chat. Check out!

Microsoft 365 Copilot is more than just a chatbot

March 20, 2023
What is storage automation

Mastering the art of storage automation for your enterprise

March 17, 2023
Can Komo AI be the alternative to Bing?

Can Komo AI be the alternative to Bing?

March 17, 2023
GPT-4 powered LinkedIn AI assistant explained. Learn how to use LinkedIn writing suggestions for headlines, summaries, and job descriptions.

LinkedIn AI won’t take your job but will help you find one

March 16, 2023

Comments 1

  1. Gaurav says:
    6 years ago

    Great post, I’m a big fan of R! I found this cool new tool called displayr, its pretty good with R visuals https://www.displayr.com/features/

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Runway AI Gen-2 makes text-to-video AI generator a reality

Maximizing the benefits of CaaS for your data science projects

Microsoft 365 Copilot is more than just a chatbot

The silent spreaders: How computer worms can sneak into your system undetected?

Mastering the art of storage automation for your enterprise

Can Komo AI be the alternative to Bing?

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.