ANNOUNCE: brillig 0.3 - not quite the Brill tagger

wren ng thornton wren at freegeek.org
Wed Sep 7 19:37:14 BST 2011


On 9/7/11 12:40 PM, Rogan Creswick wrote:
> Is anyone else interested in supporting the Apache UIMA CAS format(s)?
> I'm not a *huge* fan of the gritty system design details in UIMA (it
> seems absurdly difficult to actually use an analysis engine / pear in
> an application) but at least the file format for annotations is
> somewhat standardized.

Do you have a reference to the specifications? If it's sufficiently 
standardized I could put something together.

I already have parsers for a number of common tagging formats (or parser 
formats treated as mere tagging formats):

* "Brown format", i.e. the format people usually mean when they talk 
about the Brown corpus, rather than the actual format used for 
originally distributing the Brown corpus
* CoNLL-X shared task format
* NeGra Export Format for Annotated Corpora, version 3
* TnT

and the beginnings of a framework for being able to swap them around 
without a care. Once I get a break from teaching long enough to post my 
tagger to Hackage, this'll be in there too.

However, some formats like those called "Penn Treebank format" aren't 
actually standardized sufficiently to permit an actual implementation; 
everybody's Penn POS annotations are different. The actual treebank 
format is fine, it's just the POS formats which are intractable.

-- 
Live well,
~wren



More information about the NLP mailing list