20/6/2016

Portia

 

Scrape websites visually.

Portia is an open source scraper that helps non-coding journalists pull data from websites. It is a browser based, visual tool and run as a part of the Scrapinghub platform.

To begin, you simply need to create a template by clicking the elements on pages you would like to scrape, and Portia will create a spider to scrape similar pages from the website.

For example, the following video demonstrates how Portia can be used to build a spider for allrecipes.com:

Portia is particularly useful if you need to crawl paginated listings. These are results that are spread across multiple pages - a common feature of most e-commerce sites. The crawler can be adjusted to avoid visiting unnecessary pages by using the target categories as the start URL and altering follow patterns. As a result, Portia helps you extract data much more efficiently compared to other scrapers that require you to crawl unrelated pages.

Currently, Portia supports a number of large Javascript frameworks, like Backbone, Angular, and Ember, and the team is working on an update that will support React based websites in the near future.

Visit the Portia webpage here or its GitHub page here.

Comments