17/4/2018

New York Seeks Haikus: A Data Through Design exhibit

 

By Jeremy Neiman, with editing by Abigail Pope-Brooks

Six years ago, New York City passed a law requiring city agencies to make their data publicly available. Since then, over 1,600 datasets have been made available on the city’s open data portal and new data is constantly being made available.

Open Data Week -- a city-wide endeavor by the NYC Mayor’s Office of Data Analytics -- was a celebration of this progress, and coincided with the Data Through Design exhibit. The goal of this exhibit was to challenge artists to use this open data to tell insightful and interesting stories about the city.

As a native New Yorker, data scientist and civil servant, this challenge piqued my interest, as it is my job to use the city’s open data in creative new ways.

It takes 400,000 government employees to keep New York City (the United States’ most populated metro area) running. From maintaining infrastructure, providing emergency services, and creating innovative methods for reducing waste, homelessness and crime, it all comes down to them.

So, for Data Through Design, I wanted to tell the story of those 400,000 individuals working behind the scenes and what they do to keep my hometown alive and thriving. Eventually, this concept evolved into a program that algorithmically generated haikus based on job descriptions for city government jobs.

New York City Seeks

The data comes from the NYC Jobs dataset. It contains current job postings available on the City of New York’s official jobs site. Most of the data on the open data portal is about what happens in the city and what is done by the city. This dataset offers a glimpse into who and what is behind those things.

Each job posting contains information such as agency, salary range and posting date. For this project, I used three columns: civil service title, job description and job qualifications. For example:

Additionally, I used a dataset containing the number of syllables in words from Jim Kang’s phonemenon. I had to modify the data to include non-standard words such as agency acronyms (i.e. NYPD, 4 syllables and DOT, 3 syllables).

Generating haikus

The goal was to use NYC job descriptions to create haikus. Originally a Japanese poetic form, haikus are poems which contain three lines with five syllables in the first line, seven in the second and five in the third.

To produce a haiku, I used a custom markov chain method. A markov chain is a technique to generate a sequence given the current value and the probabilities of what values would follow the current one. In this case, given a word, what words are likely to follow?

The first step is to determine these probabilities. I divided the data up by civil service title (Computer Systems Manager, Painter, Civil Engineer, etc.) and for each, built a separate corpus of data from the text in the job description and preferred skills fields. Then I split the corpus into sentences and divided the sentence into words and counted the number of times A followed B.

The following example shows the most common words to follow “data” for a Computer Systems Manager. Given a table like this, the markov chain will pick a random next word weighted by the probabilities. It will then take that resulting word and repeat the process again and again.

Since I was generating haikus, I had a strict syllable constraint. So, I only considered the next word if it fit within the syllable limit. For example, if I was on the first line (5 syllables) and my current word was “data”, I wouldn’t choose “analysis” or “integration” as the next word because it would put the line over 5 syllables.

During this process the generator would at times write itself into an impossible state, when there were no valid choices within the syllable limit. In this case, it would go back and try a new word to see if it would lead to a valid haiku.

The haikus came out of the process in a raw state - all lower case, no punctuation and sometimes they just weren’t very good. The biggest problem I found was that, because haikus are so short, the results were often incomplete thoughts.

The markov chain would be in the middle of a sentence when it hit the syllable count and stopped short. I tried to correct for that by only ending with words that could be logical ending words, but it didn’t work for every situation.

For example, both of the following end with “design and construction process” but only the first is a complete sentence:

 

Work with project leads

to create a design and

construction process

 

and

 

The new york city

department of design and

construction process

 

Some of these results were actually amusing:

 

The environment

and the environment and

the environment.

 

and

 

The city of New

York City: open data,

open government.

 

Through a semi-manual, semi-automated editing process, I cleaned up the haikus to get presentable results. The final piece had 750+ haikus.

Adding audience participation

I now had the ability to generate haikus; but how should I present them? I wanted to give some insight into how the algorithm worked. To do that, I decided to show the haikus as they were iteratively created. Each word the algorithm tries is shown, and when it hits a fail state it deletes words and tries again.

Here is how that looks:

Exhibit organizer Michelle Ho suggested printing the haikus as they were generated on labels so they could be taken home as souvenirs for Data Through Design exhibit goers.

I added a button that would generate and print the next haiku when pressed:

The result was a whimsical and interactive look at some of the responsibilities and skills of the city’s civil servants.

Here are a few favorites:

The New York City

government is a plus but

is not required.

 

Finance is seeking

a dynamic intern to

function as a team.

 

Ensure data is

accurate, neat, timely and

ready for audit.

 

Profound knowledge of

trunk water and wastewater

collection system.

 

Who wants to become

part of an information

security staff?

About the author

Jeremy Neiman is a carbon-based bipedal life form descended from an ape from an utterly insignificant little blue-green planet orbiting a small yellow sun in the western spiral arm of the Milky Way. He is an award-winning poet, best known for his entry to the Martin Luther King Poetry Contest in the fourth grade. Having mastered poetry at such a young age, he moved onto new and different pursuits. Jeremy currently works as a data scientist at the NYC Department of Sanitation (DSNY) and previously for IBM's Smarter Cities initiative. His other endeavors can be found on his website here.

Explore the code and the haikus here.

Comments