ANNOUNCE: brillig 0.3 - not quite the Brill tagger

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Wed Sep 7 19:25:13 BST 2011


On Wed, Sep 07, 2011 at 09:40:43AM -0700, Rogan Creswick wrote:
> 2011/9/7 Eric Kow <eric.kow at gmail.com>:
> > But hopefully I won't have to, because I was actually just saying
> > something incredibly simple and non-technical, that the brillig
> > executable could just provide a thin wrapper around different kinds of
> > taggers (as alternatives to each other, completely disjoint).
> > You know, files go in, tags come out...
> 
> Is anyone else interested in supporting the Apache UIMA CAS format(s)?
> I'm not a *huge* fan of the gritty system design details in UIMA (it
> seems absurdly difficult to actually use an analysis engine / pear in
> an application) but at least the file format for annotations is
> somewhat standardized.

Oh yes, I am, very much so. I've been toying with the notion to write something
UIMA-equivalent for Haskell. You know, where we can actually make the type
system not be the abominable monstrosity Java forces onto UIMA.

We don't need to provide all the bells and whistles UIMA does, but just the CAS
and some sort of common abstraction for analysis engines and the types you use
in a project to represent annotations.

> It would also be nice to provide some sort of a bridge to another rich
> set of NLP libraries, while the Haskell infrastructure is getting off
> the ground.

That would be either NLTK (python) or OpenNLP (Java.)

Huge effort.

> (In a tangential note: This thread has been great for bringing some
> tagging libraries to my attention... I didn't realize there were so
> many options already!)

Consider the TreeTagger:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

I find it to be one of the best available.

Regards,
Aleks
> > but this was before I looked
> > at the training file format and understood that this is what sequor
> > provides.  Oh well, this probably makes brillig just a bit redundant in
> > infrastructure terms. :-)
> >
> >> For what it's worth, I just trained Sequor  (using several spelling
> >> features as encoded in the data/mlcomp2.features template) on the
> >> initial 90% of the Brown corpus, and tested on the final 10%, and got
> >> an accuracy of 96.2%. Training takes several hours, but tagging runs
> >> at more than 3000 words/second.
> >
> > Cool!
> >
> > PS. can we have a small release with '-rtsopts'?
> >
> > --
> > Eric Kow <http://erickow.com>
> >
> > _______________________________________________
> > NLP mailing list
> > NLP at projects.haskell.org
> > http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
> >
> >
> 
> _______________________________________________
> NLP mailing list
> NLP at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20110907/c8dcb479/attachment.pgp>


More information about the NLP mailing list