Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Mind Your Internet Manners: When, Where and How to Crawl for Data

by Ran Geva
March 22, 2018
in BI & Analytics, Technology & IT
Home Topics Data Science BI & Analytics
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

A wealth of information, an ocean of data – and more funny cat videos than you could watch in a lifetime. The internet is all that and more, at the service of humanity that seeks to know more, do more and be more than ever before.

Much of that data is out there for the benefit of web users – within limits, of course. Some websites are happy to share their data with others; some aren’t. Websites that provide services to users – stock tips, information about jobs and salaries, TV or movie recommendations – need to fetch the data that visitors will be seeking in order to make the service useful, and they get that data by sending out web crawlers. Other sites privately use the data they collect to advance their own businesses, in turn sharing data they generate for the benefit of mankind.

Table of Contents

  • Trust makes the internet go ’round
  • When trust is broken
  • Walking the thin line

Trust makes the internet go ’round

Of course, a proper web crawling system observes the “rules of the Internet road” – those rules laid out in the Robots.txt file – avoiding collecting data from sites that have logins (which indicates that the data is not there to be freely collected). Comparable to many other aspects of life, sharing web data is based on trust; bad players who break that trust make things harder for the decent folk who collect data according to the rules. Failure to observe those rules creates internet chaos, and destroys the bonds of trust between data owners and the Internet community. Plus, it’s just bad manners.

The internet community needs to keep that trust intact. Data site crawling democratizes access to information and makes otherwise difficult-to-access data easily available to people who need it. Government data on jobs and salaries, for example, could be used by an employment site to give users an idea of what a realistic salary is for their specific profession based on experience and location. Investment analysis sites would crawl a site that has information about stock prices, history, trends, etc., and use that data in planning forecasts.

When trust is broken

A recent case involving a company called HiQ Labs and LinkedIn illustrates what could go wrong in the trust relationship. HiQ has been scraping the public profiles of LinkedIn users to keep track of their careers, gathering data from public profiles only. However, LinkedIn took offense to this, claiming that its data was not there for crawlers to “raid.” It should be noted that LinkedIn keeps its data behind a login screen, indicating that it indeed has rules that it expects crawlers to observe. LinkedIn accused HiQ of violating the Computer Fraud and Abuse Act (CFAA), committing the internet equivalent of wire fraud.

HiQ has claimed that it did nothing wrong, and that it did not violate any laws or agreements. According to attorneys for HiQ, “To choke off speech and the precursor of speech, the gathering of facts and the analysis of information, is a dangerous path down which we should not go.” As egregious as the LinkedIn people consider HiQ’s tactics, the court has so far agreed with HiQ, saying that LinkedIn’s claim of violations of the CFAA is out of place. LinkedIn, which feels it has a strong case, is appealing.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Walking the thin line

This just goes to highlight the thin line between legitimate crawling versus impolite (at the very least) scraping. As mentioned, the internet is a cooperative in a sense – sites that provide services, as well as others who have a need for data, must cooperate with those providing the data. In the final analysis, the “transaction” of crawling/sharing is dependent on the goodwill of both sides. Information is there to be used, not abused, and if the latter happens, it ruins it for the rest of us.

It would be a worthy idea for those who believe in responsible crawling to work together to root out those who give them a bad reputation. Done properly, web crawling opens up information, promotes freedom and enhances democracy.

Like this article? Subscribe to our weekly newsletter to never miss out!

Related Posts

What is containers as a service (CaaS): Examples

Maximizing the benefits of CaaS for your data science projects

March 21, 2023
What is storage automation

Mastering the art of storage automation for your enterprise

March 17, 2023
What is 5G ultra wideband?

Reconceptualizing urban infrastructure in the age of 5G networks

March 10, 2023
What is DevOps as a Service: Companies, models

How can DevOps as a Service flourish efficiency in your business?

February 27, 2023
Data integration vs business intelligence

A comprehensive look at data integration and business intelligence

February 21, 2023
DaaS: The ultimate solution to hardware hassles

DaaS: The ultimate solution to hardware hassles

February 14, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Adobe Firefly AI: See ethical AI in action

A holistic perspective on transformational leadership in corporate settings

Runway AI Gen-2 makes text-to-video AI generator a reality

Maximizing the benefits of CaaS for your data science projects

Microsoft 365 Copilot is more than just a chatbot

The silent spreaders: How computer worms can sneak into your system undetected?

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.