ANNOUNCE: brillig 0.3 - not quite the Brill tagger

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Wed Sep 7 19:55:07 BST 2011


On Wed, Sep 07, 2011 at 02:37:14PM -0400, wren ng thornton wrote:
> On 9/7/11 12:40 PM, Rogan Creswick wrote:
> >Is anyone else interested in supporting the Apache UIMA CAS format(s)?
> >I'm not a *huge* fan of the gritty system design details in UIMA (it
> >seems absurdly difficult to actually use an analysis engine / pear in
> >an application) but at least the file format for annotations is
> >somewhat standardized.
> 
> Do you have a reference to the specifications? If it's sufficiently
> standardized I could put something together.
> 
> I already have parsers for a number of common tagging formats (or
> parser formats treated as mere tagging formats):
> 
> * "Brown format", i.e. the format people usually mean when they talk
> about the Brown corpus, rather than the actual format used for
> originally distributing the Brown corpus
> * CoNLL-X shared task format
> * NeGra Export Format for Annotated Corpora, version 3
> * TnT
> 
> and the beginnings of a framework for being able to swap them around
> without a care. Once I get a break from teaching long enough to post
> my tagger to Hackage, this'll be in there too.

UIMA is *much* more than just a framework for unifying tagsets or tagger
interfaces.

It isn't even specifically focused on text or NLP applications. Theoretically,
you could annotate videos or sound files with it. It just provides a common
infrastructure for annotating *data* of any kind.

The basic components are:

- The CAS, which contains the data and the annotations
- The Annotation Engines, which read the CAS, then put annotations in it. They
  have access to the complete CAS, which means they have access to the previous
  annotator's output
- The annotation type system: defined at compile-time, annotators can either
  consume raw data, or other annotators' annotations, as defined by the type
  system. The type system also defines how the annotations look like, and what
  they can contain. Technically they're plain old Java objects, which means
  they're quite limited (no multiple inheritance, etc.)

A typical process would be: put plain text into the CAS. Run a tokenizer over
it, which will populate the CAS with token annotations. Run a sentence boundary
detector to add sentence annotations. Write a PoS-AE (analysis engine) that
looks at all the token annotations within each sentence annotation and adds
PoS-tag information to the Token objects. Etc.

You could read through the official documentation to get an idea of what UIMA is
all about http://uima.apache.org/documentation.html

There was a publication somewhere about UIMA, written by Tilo Götz and Oliver
Suhre … Ah, here it is:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.5824&rep=rep1&type=pdf

This should get you started. I don't think there's really a "specification" for
UIMA.

Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20110907/2d24e41a/attachment-0001.pgp>


More information about the NLP mailing list