4/3/2016

Libpostal

 

A C library for multilingual street address parsing and normalization.

How many times have you searched for an address that cannot be found? The ways we express places and addresses can vary significantly across each country, region, language and culture – both 45 W 5th St and 45 West Fifth Street are the same place, expressed differently, for instance. Yet, a standardized expression of location is crucial for successful geocoding.

Recognizing this problem, Al Barrentine developed Libpostal – a parser trained on OpenStreetMap data that allows you to geocode addresses around the world, regardless of language or expressive differences.

The aim of Libpostal is to develop canonical address strings by normalizing abbreviations across languages (like our 45 W 5th St example above) and parse these to create a library that can match different expressions of an address to the one geolocation.

To normalize addresses, Libpostal leverages several per-language dictionaries, which index phrases with their associated abbreviations for a number of languages.

The team then built a language classifier, using around 80 million language-labelled address strings  from OpenStreetMap data as a basis to learn common address features for particular languages. Using this language classifier, Libpostal knows which abbreviation dictionary to draw from when certain linguistic features are detected and, as a result, it is able to connect multilingual and abbreviated expressions of an address with its singular geolocation.

Following this learning process, the Libpostal parsing library can predict input text in 60 languages for addresses in more than 100 countries – with a 98.9% accuracy rate.

Although it is not a full geocoder, Libpostal leverages its C library to preprocess free-form address data into normalized forms that can be compared and indexed by other geocoding applications.

The library supports Python, Ruby, Go, Java, PHP, and NodeJS, and bindings can also be written in other languages depending on your needs.

Libpostal’s source code is available at GitHub and it has also been integrated into Mapzen Search.

Read more about Libpostal here.

Comments