Company Information Extractor


2016 was a huge year for corporate watchdog journalism. One need only mention two words - Panama Papers - to know why. And, in their wake, there has been significant growth in journalistic company reporting and tools to support it.

One such tool is the Company Information Extractor from Dato Capital. The Extractor works in-browser, and allows you to pull company and director names from documents. All you have to do is enter a website URL, upload a file, or enter text directly. As it stands, the tool can read and extract information from PDF, Word, Excel, HTML and TXT files.

Once you've inputed the content you want to search, the system scans for mentions of companies and directors against a daily updated database of 14 million companies and 12 million directors from the United Kingdom, Spain, Luxembourg, Panama, Gibraltar, BVI, Cayman Islands and the Netherlands.

We spoke to Eduardo Amo, Dato Capital’s Managing Director, to find out more.

DDJ: What is the idea behind the Company Information Extractor?

Eduardo Amo: Over the past few years we developed several text processing systems to find connections between companies of different jurisdictions and we thought it would be useful for our users to take advantage of these systems. Many customers spend a lot of time finding where a company is incorporated, looking for typos in company names and compiling lists of companies and directors which are mentioned in large text documents, so we wanted to speed up these processes and open a new way for accessing our database at the same time. By giving public access to the tool we receive a very valuable feedback and interesting ideas for improving it.

What types of data does it provide access to?

The tool provides a list of companies and directors with links to their profiles. Along with the names, the country where the company is incorporated and the country where the director has the appointments are listed. We are working in an option for extracting the list in a reusable format such as CSV.

Where is data sourced from?

The data is sourced from Dato Capital company database which currently has over 14 million companies and 12 million directors from 8 countries. The sources for the database data are mainly public records and company registers.

How can journalists benefit from tool?

Journalists can save time when they have to look for company or director names in court documents or other unstructured text files such as emails or logs. The tool performs OCR so this can also apply to image files and PDF documents. In addition, they can review their own texts looking for typos.

Daniele Grasso, a Data Journalist from El Confidencial (Spain) is actively using it. When I asked him about how he uses it, he said:

“The data unit at El Confidencial has also a core part in the biggest investigation that are developed in the newsroom. A tool like this is going to be really usefull to cut time-consuming activities of searching people and companies in, for example, extense court records. Moreover in a moment in which the use of the data is becoming worldwide part of the investigative journalism process.”

Image: Highlight of detected companies in a document.

How was the tool developed?

The tool is part of Dato Capital internal text processing systems and is based in approximate string matching algorithms. It’s constantly upgraded as we are adding more countries and company types to the database. This is very important because we can’t detect many companies from jurisdictions not present in our database and if a company whose name is a substring of the original name is present, a false positive result is listed.

How are you working on sourcing data and expanding your country coverage?

We choose every new country depending on our client needs. Worldwide changes in financial regulations, company acts and tax laws affects these decisions. Our primary goal is to save to for our clients time so we also think where we improve document delivery time and how well connected is the new jurisdiction among the existing ones.

Our data is sourced from official publications so the list of data sources is already pre-defined. The hard work is in integrating these heterogeneous data models into our systems and in accessing the data. Usually we make agreements with the official sources in order to reduce friction. Scheduling is important because in large countries some datasets can take months to be processed.

As we try to have director search in every country, we have to carefully study each Data Protection Act to comply with them and avoid removal of important business information from our website.

What you made any interesting discoveries or stumbled upon surprising information during your time working on the tool?

An interesting discovery is the large number of legit companies with generic names such as “FINANCIAL INVESTMENTS LIMITED” or “LA FUNDACION”.

What is your favourite thing about the Extractor?

My favourite thing is its simplicity. You just upload a file and the results are shown, either is a TXT, PDF or TIFF, etc... Another thing I like is the positive feedback we are receiving and how it help us improve our services.

Explore the Company Information Extractor here.