Analyze the WWW and dark web with structured web data feeds.

By Eran Levy, Webhose.io

The web has become an omnipresent force in our lives, with almost every conceivable activity - from learning to shopping to socializing – moving partially or wholly to the online realm. But researchers, data scientists and data journalists see the web as more than just the sum total of services it provides; rather, they see an untapped potential for data mining and analysis. After all, if the online is where much of life happens, surely the data these activities generate can give us some insight into life itself?

While this notion seems sensible enough, it turns out that actually analyzing the web presents unique challenges to any would-be analyst. One of the foremost of these would be acquiring the data – which is dispersed between thousands or millions of websites, generally as unstructured textual content. This data needs to be extracted and converted into a machine-readable format before any serious analysis can be performed – which is where Webhose.io and other web data providers come in.

The search for a scalable solution

Data is extracted from websites through a process of crawling the site, grabbing the relevant content, and then parsing it into a format and database where it can be mined for analytical insights.

Generally, there are three main ways one could go about doing this:

  • Building a solution in-house – while it's easy enough to string together a few lines of Python or Javascript to scrape a specific site, this grows more complex when dealing with largescale operations, which would involve many different sites, content formats, languages, and so on.
  • Use a third party scraping tool – these are pieces of software that someone has already coded, which one can "point" at a certain site or group of sites in order to grab specific data. These solutions are often very effective in their ability to transform web content into a specific, spreadsheet-style format; however they are naturally also limited in scope and require that the analyst pre-defines the site or sites they wish to monitor.
  • Rely on a web data-as-a-service provider – these are companies that do all the "heavy-lifting" involved in crawling the World Wide Web at scale on their end, and then provide access to the crawled and parsed data via an API. This allows any analyst to access massive amounts of web data by writing a few lines of code, and easily incorporate this data into other products, research or analysis.

While the first two solutions are great for smaller scale operations, or when you're trying to look at changes in a specific site, a data journalist who wants to understand global trends across the worldwide web, or to derive insights from the mass of conversations that happens in news sites and online discussion forums, would probably prefer the third option.

You can read more about the differences between these the approaches in this guide to web data extraction

Leveling the playing field

The applications of broad access to structured web data go even further when one considers that currently only a few massive companies actually have access to this type of data. These would be organizations such as Google and Microsoft that are already crawling the web at a massive scale, and endlessly analyzing and querying it to perfect their search and advertising services.

However, these companies choose to keep the data on their servers and generally do not grant other companies access to the raw datasets, but only to its processed results (for example, the results of a search query on Google). Thus, a team of data scientists or data journalists does not have the ability to explore and investigate web data, and to build their own innovative stories, research or products based on this potentially invaluable data source.

The Webhose.io API aims to change this state of affairs and give every interested party access to the same data that the engineering teams at Google and Microsoft have. This is achieved through flexible pricing plans, starting with a free tier that includes 1000 free monthly API calls.

Even a two-person startup, or a data journalist working from home, can thus access the massive, constantly updated database of crawled web data which powers Webhose's existing customers – including many leading web monitoring and data analytics companies such as Salesforce, Sysomos, Meltwater and over 200 others.

The next frontier: Shedding light on the dark web

In addition to news, forums and e-commerce sites on the public web, in 2017 Webhose.io has introduced the Dark Web Data Feed – which employs the same crawling and API infrastructure to give cyber analysts and researchers access to structured data from the TOR network. This product is already being used by cybersecurity organizations in the public and private sector, and can be made available to independent researchers upon request.

About the author

Eran Levy has been working in the big data industry for the past 5 years, and currently heads data driven digital marketing at Webhose.io - a leading provider of structured web data at scale.

To start analyzing the web as a massive data repository, create a free Webhose.io account.