A wealth of information, an ocean of data – and more funny cat videos than you could watch in a lifetime. The internet is all that and more, at the service of humanity that seeks to know more, do more and be more than ever before.
Much of that data is out there for the benefit of web users – within limits, of course. Some websites are happy to share their data with others; some aren’t. Websites that provide services to users – stock tips, information about jobs and salaries, TV or movie recommendations – need to fetch the data that visitors will be seeking in order to make the service useful, and they get that data by sending out web crawlers. Other sites privately use the data they collect to advance their own businesses, in turn sharing data they generate for the benefit of mankind.
Trust makes the internet go ’round
Of course, a proper web crawling system observes the “rules of the Internet road” – those rules laid out in the Robots.txt file – avoiding collecting data from sites that have logins (which indicates that the data is not there to be freely collected). Comparable to many other aspects of life, sharing web data is based on trust; bad players who break that trust make things harder for the decent folk who collect data according to the rules. Failure to observe those rules creates internet chaos, and destroys the bonds of trust between data owners and the Internet community. Plus, it’s just bad manners.
The internet community needs to keep that trust intact. Data site crawling democratizes access to information and makes otherwise difficult-to-access data easily available to people who need it. Government data on jobs and salaries, for example, could be used by an employment site to give users an idea of what a realistic salary is for their specific profession based on experience and location. Investment analysis sites would crawl a site that has information about stock prices, history, trends, etc., and use that data in planning forecasts.
When trust is broken
A recent case involving a company called HiQ Labs and LinkedIn illustrates what could go wrong in the trust relationship. HiQ has been scraping the public profiles of LinkedIn users to keep track of their careers, gathering data from public profiles only. However, LinkedIn took offense to this, claiming that its data was not there for crawlers to “raid.” It should be noted that LinkedIn keeps its data behind a login screen, indicating that it indeed has rules that it expects crawlers to observe. LinkedIn accused HiQ of violating the Computer Fraud and Abuse Act (CFAA), committing the internet equivalent of wire fraud.
HiQ has claimed that it did nothing wrong, and that it did not violate any laws or agreements. According to attorneys for HiQ, “To choke off speech and the precursor of speech, the gathering of facts and the analysis of information, is a dangerous path down which we should not go.” As egregious as the LinkedIn people consider HiQ’s tactics, the court has so far agreed with HiQ, saying that LinkedIn’s claim of violations of the CFAA is out of place. LinkedIn, which feels it has a strong case, is appealing.
Complete our SAP x Data Natives CDO Club survey now, and help us to help you
Walking the thin line
This just goes to highlight the thin line between legitimate crawling versus impolite (at the very least) scraping. As mentioned, the internet is a cooperative in a sense – sites that provide services, as well as others who have a need for data, must cooperate with those providing the data. In the final analysis, the “transaction” of crawling/sharing is dependent on the goodwill of both sides. Information is there to be used, not abused, and if the latter happens, it ruins it for the rest of us.
It would be a worthy idea for those who believe in responsible crawling to work together to root out those who give them a bad reputation. Done properly, web crawling opens up information, promotes freedom and enhances democracy.
Like this article? Subscribe to our weekly newsletter to never miss out!