No New Messages
author avatar
Syed Reza
1450056758915 PROJECT


There is a seldom discussed issue in the computational linguistics community and it has to do with how Corpora are created and shared. It is an issue put forward and described in great detail by Professor Andrew Rosenberg of the CUNY Speech Research lab in his 2012 paper "Rethinking the Corpus: Moving towards Dynamic Linguistic Resources".

In short, linguistic Corpora are datasets of raw audio and some sort of corresponding labeling or markup. These datasets are what an algorithm would train against and what future algorithms benchmark against (in order to compare results). And so the Corpus lies at the heart of linguistic research, acting as both the data source and the standard measuring instrument of improvement.

However, at the moment, there isn't a good way of collaboratively building this vital resource. Independent labs often create these datasets at a great cost. There is the cost of gathering the data - which involves finding speakers of various classifications (native, non-native, male, female, etc). Data-gathering can involve paying the speaker for his or her time. Then there is the cost of getting highly trained linguists and acute listeners to correctly label the collected speech data. Throw in whatever operational cost is involved, and the speech corpora turns out to be a substantially expensive resource to produce.

Of course, Software Developers, collaboratively build things all the time, through Version Control Systems. Such system have proven to be a flexible way to collaborate on shared resources. Furthermore, web applications like Github make git available to a larger audience, providing friendly views for certain file-types like Markdown.

And so in building Reciprosody, I wrapped up the functionality of SVN in a user-friendly web interface. In order to make Corpora more approachable and easier to browse, I wrote a JS view for the popular Praat TextGrid format for annotated speech. Because corpora often involve very large files, I wrote a custom uploader which uses a chunking resumable upload system. This uploader allows the user to resume the upload from where he or she left off. These massive uploads are un-archived server-side, and checked into a backend SVN repository.

Reciprosody aims to not only make building the corpus easier, but also to make it more available to researchers. By providing a platform for faster edits, it is hoped that corpus errors can be mitigated faster. Versioning the corpus allows researchers to provide measurements against different versions of the same corpus, a sliding window of results, allowing older versions to fall out of use as newer versions take their place. Rather than the corpus remaining a static resource, a platform such as Reciprosody allows the corpus to grow with the times. That's the hope.

Work on Reciprosody was done under the guidance of Professor Andrew Rosenberg at the CUNY Speech Lab. Support for this work is provided by the National Science Foundation.


Reciprosody Source Code