Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

The Applications of Machine Learning Through Unstructured Text Data

byNick Pendar
October 14, 2015
in Artificial Intelligence
Home News Artificial Intelligence
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Unstructured text data represents the biggest data set available to enterprises, yet most are unable to process the vast amount of data they collect to get any meaningful insight. Up to 80 percent of data available to enterprises is unstructured data, and comes in a variety of forms, such as intellectual property, financial statements, CRM notes, news, analyst reports and social media posts. If analyzed correctly, enterprises stand to gain knowledge on everything from customer sentiment to service level optimization. With the right tools, businesses can implement a wide range of applications that draw on past experiences to make better business decisions in the future.

Enterprises can realize the true potential of their unstructured text data by employing a machine-learning model. If trained on the appropriate data, a machine learning model can be very helpful in streamlining business processes and decision making. However, creating appropriate training sets for the right machine learning problem is easier said than done.

Take for instance a recent case study of training supervised machine-learning models to classify tweets with no prior training sets.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

At Skytree, we were presented with the problem of tagging tweets by the categories they belonged to. At first glance, this seemed like a classic text categorization problem, but after a deeper look into the data, several major challenges immediately presented themselves. Tweets are very short and contain very little lexical signal, resulting in the copious use of alternate spellings and abbreviations, which increases noise in the data. To further complicate the task, there were no training sets for the target categories to train a model. One possible approach to this problem, without the use of machine learning, would have been to manually create keyword sets that target each category. This is not desirable, however, because manually creating keyword sets is a time-consuming process that produces inadequate results. This means that the keywords are often incomplete and therefore ambiguous, thus missing potential matches leading to false positives. The task also required a high-precision model that was suitable to deploy to a production environment quickly, and it had to be adaptive so that it would improve with feedback over time.

To create a seed high-precision model, a training set was still needed (i.e., a set of tweets labeled as positive examples of a category and a set labeled as negative examples). Instead of creating the training set manually, we decided to approach the task from a different angle. Starting with a Wikipedia article, we extracted a large number of keywords and traversed the knowledge base from a starting point and collected article titles up to a specified depth to create a training set. For example, to train a classifier for the category “NBA,” we would start traversing Wikipedia at the article NBA and collect titles that would include names of NBA teams, players, stadiums and so on. Tweets containing any of the keywords were defined as positive and the rest as negative. While this is obviously a false assumption, the rationale was that with a large enough training set, we could build a classifier that generalizes beyond the initial keywords. After creating the training set this way, we trained a machine-learning model tuned for high precision and ran the resulting model on new and completely unseen tweets. As suspected, the resulting model did exhibit very high precision on new data; almost every tweet that it labeled as positive was indeed positive.

The question however was, “Is this model any better than the initial keyword set?” In order to answer this question, we trained models on a topic, withheld one or more keywords and saw whether the classifier would label tweets containing those keywords and no other keyword as positive. For example, we trained a classifier for the category “NBA” without using the word “NBA” as a keyword, or a classifier for the category “sports” without using the keyword “baseball.” In all these cases, the classifier was able to recover relevant tweets that contained the omitted keywords and more. Additionally, these initial models had sufficient precision and recall to use in production.

The results we gathered from this exercise would not have been possible without taking advantage of a large dataset. The dataset provided the machine-learning algorithm with enough linguistic variation and related lexical patterns to allow it to pick up additional reliable signals.

With unstructured text data tasks on large datasets such as tweets, manually categorizing keywords would take a lifetime to accomplish. External knowledge sources such as Wikipedia can bootstrap the learning process and aid in the curation of training data in cases where the task at hand requires a vast amount of world knowledge otherwise inaccessible to machine learning systems. By pulling in large unstructured text datasets to create training sets, machine learning can distinguish signal from noise. The key to deriving strong value out of unstructured text datasets is to approach the task with what is available, rather than build manually annotate training data from the ground up.

Tags: DataSkytreeTwitter

Related Posts

Does your AI clock in without you?

Does your AI clock in without you?

June 3, 2026
Anthropic invites 150 more organizations into Project Glasswing

Anthropic invites 150 more organizations into Project Glasswing

June 3, 2026
Microsoft unveils Project Solara for an agent-first future

Microsoft unveils Project Solara for an agent-first future

June 3, 2026
OpenAI expands Codex with enterprise plug-ins and new Sites feature

OpenAI expands Codex with enterprise plug-ins and new Sites feature

June 3, 2026
Google will let websites opt out of AI search results

Google will let websites opt out of AI search results

June 3, 2026
Best AI game maker tools and guide to AI game development

Best AI game maker tools and guide to AI game development

June 2, 2026
Please login to join discussion

LATEST NEWS

Why Telegram Mini Apps have become the optimal ecosystem for launching AI SaaS products

Crypto investors are watching one date closely in 2026

How Telegram Creators test post visibility before running growth campaigns

Does your AI clock in without you?

Why secure software delivery depends on better release management

Sony reveals God of War: Laufey for PS5

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Veed.io

Paper Pilot

IsOn24

Magnific

DADABOTS

Rosebud AI

Prome

Pageon AI

Vyond

Centauri AI

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.