Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Understanding Dimensionality Reduction and its Applications

byDmitry Storcheus
January 27, 2016
in Articles
Home Resources Articles
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Dimensionality reduction as means of feature extraction

Feature extraction is a very broad and essential area of data science. It’s goal is to take out salient and informative features from input data, so that they can be used further in predictive algorithms. Modern data scientists observe large amounts of data, which is hard to process at once: data can be raw, unstructured, high dimensional, or noisy. Thus, extracting salient features is vital for successful applications of machine learning algorithms. Feature extraction is a widely discussed research topic. As a primary chair of the Feature Extraction: Modern Questions and Challenges workshop at NIPS-2015, I’ve distinguish three key areas of feature extraction: variable selection, dimensionality reduction and representation learning. Variable selection is concerned with selecting the best set of variables out of available ones; dimensionality reduction studies how to shrink the size of data while preserving the most important information, and lastly representation learning aims at learning informative representation of data with neural networks. For now we will focus on dimensionality reduction with examples and applications.

Principal Component Analysis

There are many diverse examples of high dimensional datasets that are difficult to process at once: videos, emails, user logs, satellite observations, and even human gene expressions. For such data we need to throw away unnecessary and noisy dimensions and keep only the most informative ones. A classic and well-studied algorithm for reducing dimension is Principal Component Analysis (PCA), with its nonlinear extension Kernel PCA (KPCA). Assuming that data is real-valued, the goal of PCA is to project input data onto a lower dimensional subspace, preserving as much variance within the data as possible.

An example of dimensionality reduction: email classification

Let’s set up a specific example to illustrate how PCA works. Assume that you have a database of emails and you want to classify (using some machine learning numerical algorithm) each email as spam/not spam. To achieve this goal, you construct a mathematical representation of each email as a bag-of-words vector. This is a binary vector, where each position corresponds to a specific word from an alphabet. For an email, each entry in the bag-of-words vector is the number of times a corresponding word appears in an email (0 if it does not appear at all).

Assume you have constructed a bag-of-words from each email, and as a result you have a sample of bag-of-words vectors x1…. xm. However, not all dimensions (words) of your vectors are informative for the spam/not spam classification. For instance, words “lottery”, “credit”, “pay” would be better features for spam classification than “dog”, “cat”, “tree”. For a mathematical way to reduce dimension we will use PCA.

For PCA you should construct an m-by-m covariance matrix from your sample x1…. xmand compute its eigenvectors and eigenvalues. Next sort the resulting numbers in a decreasing order and choose p top eigenvalues. Applying PCA to your sample of vectors is projecting them onto eigenvectors corresponding to top p eigenvalues. Now, your output data is the projection of original data onto p eigenvectors, the dimension of projected data has been reduced to p.

A reader might wonder, what is special about projecting bag-of-word vectors onto the top eigenvectors of covariance matrix? How does it help to extract the most informative part of original data? This is illustrated on a 2-dimensional picture below, where the blue points are 2-dimensional (for simplicity) observations.

pic2

The eigenvectors of covariance matrix have a special property that they point towards the directions of the most variance within the data. As you can see on the picture, the 1st dimension vector points towards the direction of the highest variance and the 2nd dimension vector points towards the highest variance in the subspace, orthogonal to the 1st vector. Thus, projecting onto top eigenvectors preserves maximum variance, and roughly speaking, capturing more variance means capturing more information to analyze.

pic3

Graph#1: Exponential Decay 3                                                               Graph#2: Exponential Decay 7

Another question is how to choose the number of top eigenvectors to project on? According to my experience, a good way to choose it to plot the eigenvalues and find the point on the plot, where the eigenvalues start to decay exponentially. The eigenvalue plots for two different datasets (left and right) are illustrated on the charts above. On the left chart the point of exponential decay is 3, and on the right chart it is 7, which means that one should select 3 top eigenvalues for the left dataset and 7 for the right one. Also, you should think out of the box and see if PCA in itself is an appropriate method for your problem. With the example above, the left plot of eigenvalues shows a fast exponential decay, thus PCA is great for that problem. However, the eigenvalues on the right decay is almost linear, so PCA is not recommended.

Finally, after you have computed the low-dimensional PCA projection of your bag-of-words vectors, you can use this projection instead of original emails in classification algorithms, such as Logistic Regression or Support Vector Machine to classify the emails as spam/not spam. When projections are used instead of original emails, algorithm training will be much faster and overfitting will be reduced.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Tags: Data analysisGooglesurveillance

Related Posts

What 53,000 Churches Reveal About the Digital Transformation of Faith Communities

What 53,000 Churches Reveal About the Digital Transformation of Faith Communities

June 19, 2026
Xenco Medical wins back-to-back honors with Fast Company’s 2026 World Changing Ideas Award and Time Magazine 2026 Impact Award

Xenco Medical wins back-to-back honors with Fast Company’s 2026 World Changing Ideas Award and Time Magazine 2026 Impact Award

June 17, 2026
Data Sovereignty and Document Security: Where Does the Data Actually Live?

Data Sovereignty and Document Security: Where Does the Data Actually Live?

June 15, 2026
How Public Web Data Can Strengthen Environmental Protection

How Public Web Data Can Strengthen Environmental Protection

June 10, 2026
How automation tools are being integrated into professional networking

How automation tools are being integrated into professional networking

May 31, 2026
Autonomous agentic UI orchestration for high-throughput enterprise ecosystems

Autonomous agentic UI orchestration for high-throughput enterprise ecosystems

May 31, 2026
Please login to join discussion

LATEST NEWS

Apple touchscreen MacBook could launch with M5 Pro chips

Apple touchscreen MacBook could launch with M5 Pro chips

OpenAI limits ChatGPT 5.6 access to government-approved users first

Apple to skip M6 Pro and Max chips and launch M7 in 2027

IBM unveils world’s first sub-1nm chip with new nanostack architecture

Apple raises prices across Macs, iPads and home devices

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Autoppt

Otter.ai

Slideoo

Disney Pixar AI Generator

Codebay

Newo

BlackInk.AI

WatchMyCompetitor

TokkingHeads

Fellow.app

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.