11/8/2015

Tabula 1.0 released: Tool helps to extract data from PDFs

 

If you ever got data in a PDF and wanted to extract a table with numbers, then you know the problem. While text can be extracted (sort of), data tables end up very, very messy. Without a tool journalists are forced to tediously type in the numbers by hand. 

Tabula comes to the rescue: The free tool is installed on your computer and works through a local browser. How it works? Upload any PDF, then draw a rectangle over the table you want to extract. The data can then be exported to CSV and into any spreadsheet program. 

Tabula just got an update and reached version number 1.0, after being around for roughly two years in previous versions. The tool is available for download and install on Windows, Mac and Linux. 

The biggest new feature of version 1.0 is the overhauled user interface. Quote from the release notes:  "The new interface improves page selection and streamlines a typical user’s workflow."

Please note that Tabula will only work for "real" PDFs, not scanned images. If you have a document of the latter type there is only OCR software. Or alternatively: You are back to typing in the numbers by hand. Let's hope the latter does not come our way. 

Work on the tool has been supported by the Knight Foundation prototype fund. Read the release notes here or download the tool from the Tabula Homepage.

Comments