Table of Contents

### 1. C4.5

** What does it do?** C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

**Wait, what’s a classifier?** A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.

**What’s an example of this?** Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO2max, family history, etc. These are called attributes.

*Now:* Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won’t get cancer. C4.5 is told the class for each patient.

*And here’s the deal: *Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

**Cool, so what’s a decision tree?** Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:

- Patient has a history of cancer
- Patient is expressing a gene highly correlated with cancer patients
- Patient has tumors
- Patient’s tumor size is greater than 5cm

*The bottom line is: *At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.

**Is this supervised or unsupervised?** This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn’t learn on its own that a patient will get cancer or won’t get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

**You might be wondering how C4.5 is different than other decision tree systems?**

- First, C4.5 uses information gain when generating the decision tree.
- Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
- Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
- Finally, incomplete data is dealt with in its own ways.

**Why use C4.5?** Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

**Where is it used?** A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Checkout how I used C5.0 (latest version of C4.5)

### 2. Support vector machines

**What does it do?** Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

**Whoa, a hyper-what?** A hyperplane is a function like the equation for a line, . In fact, for a simple classification task with just 2 features, the hyperplane can be a line.

As it turns out…SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions…

…SVM figures out the best hyperplane which separates your data into the 2 classes.

**Do you have an example?** Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren’t too mixed together, you could take a stick and without moving the balls, separate them with the stick.

*You see: *When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.

**What do the balls, table and stick represent?** The balls represent data points, and the red and blue color represent 2 classes. The stick represents the hyperplane which in this case is a line.

**And the coolest part? **SVM figures out the function for the hyperplane.

**What if things get more complicated?** Right, they frequently do. If the balls are mixed together, a straight stick won’t work. Here’s the work-around: Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air. You might be wondering if this is cheating: Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.

**How does SVM do this?** By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we’re in 3 dimensions, the hyperplane must be a plane rather than a line.

I found this visualization super helpful:

Reddit also has 2 great threads on this in the ELI5 and ML subreddits.

**How do balls on a table or in the air map to real-life data?** A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.

*Here’s the deal: *If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.

*The bottomline is: *SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.

**Margins are often associated with SVM?** **What are they?** The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.

*The key is: *SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.

**Where does SVM get its name from?** Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.

**Is this supervised or unsupervised?** This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.

**Why use SVM?** SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.

**Where is it used?** There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.

### 3. Apriori

**What does it do?** The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

**What are association rules?** Association rule learning is a data mining technique for learning correlations and relations among variables in a database.

**What’s an example of Apriori?** Let’s say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.

*Here’s the best part: *By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.

*The power of this is: *You can find those items that tend to be purchased together more frequently than other items — the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.

*For example: *You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to “see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely what Apriori helps with!

**You might be wondering how Apriori works?** Before getting into the nitty gritty of algorithm, you’ll need to define 3 things:

- The first is the
**size**of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.? - The second is your
**support**or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset. - The third is your
**confidence**or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.

The basic Apriori algorithm is a 3 step approach:

**Join.**Scan the whole database for how frequent 1-itemsets are.**Prune.**Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.**Repeat.**This is repeated for each itemset level until we reach our previously defined size.

**Is this supervised or unsupervised?** Apriori is generally considered an unsupervised learning approach, since it’s often used to discover or mine for interesting patterns and relationships.

But wait, there’s more…

Apriori can also be modified to do classification based on labelled data.

**Why use Apriori?** Apriori is well understood, easy to implement and has many derivatives.

On the other hand…

The algorithm can be quite memory, space and time intensive when generating itemsets.

**Where is it used?** Plenty of implementations of Apriori are available. Some popular ones are the ARtool, Weka, and Orange.

This is a condensed republished post. For the full Top 10 list check out Rayli.net.

image credit: Ronnie Parsons

this article? Subscribe to our weekly newsletter to never miss out!