Broca is an unconventional tool, developed by Francis Tseng and Alex Spangher, that allows you to experiment with natural language processing (NLP) approaches in the absence of suitable conventional libraries.

"Conventional NLP methods—bag-of-words or vector space representations of documents, for example—generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren’t available in popular NLP libraries," writes Tseng.

To this end, Broca works off a method driven by pipes, representing stages of input and outputs in the NLP process, that are chained into pipelines.

For example, a pipeline that contains a preprocessing pipe and a vectorizing pipe would look like this:


At the core of Broca, explains Tseng, is its ability to rapidly prototype NLP approaches that contain shared components. In doing so, Broca produces a multi-pipeline that avoids redundant processing by freezing each pipe's output to disk.

"These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is “defrosted” and returned, saving unnecessary processing time. This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed."

Image: Source.

Although still in development, the library currently contains pipes for tokenization, implementing dismabiguated core semantics, common preprocessors, and more.

Visit the Broca GitHub page here.