Ann: Chatter - a simple library for language processing

Rogan Creswick creswick at gmail.com
Tue Nov 19 18:40:23 GMT 2013


On Tue, Nov 19, 2013 at 2:48 AM, Grzegorz Chrupała <G.A.Chrupala at uvt.nl>wrote:

> Nice!
>
> Regarding working with Text in the Tokenize lib, I'm just wondering,
> wouldn't it be just as efficient to just use "pack . tokenize .
> unpack"? There is quite a bit of character-by-character processing
> involved in tokenization anyway.
>

I *think* there's a performance benefit to using Text over String, but I
haven't benchmarked the Text version of tokenize against your String
version. They are pretty similar, but IIRC, there were some simplifications
in 2-3 places because the Text API offered some more natural tools for
string manipulation.

There isn't a lot of content out there for Data.Text performance, but I did
find a reference to an old post of Brian's that compares String (named
'list', if I'm reading it correctly), ByteString and Text:

http://web.archive.org/web/20100222031602/http://www.serpentine.com/blog/2009/12/10/the-performance-of-data-text/

I'll try and run some criterion benchmarks of the tokenizers, but it may be
a few days to a week before I get a chance to do it right.

--Rogan



> --
> Grzegorz
>
>
> On Mon, Nov 18, 2013 at 10:53 PM, Rogan Creswick <creswick at gmail.com>
> wrote:
> > I've been working on a simple NLP library over the past month or two,
> and I
> > think it may finally be useful to others.  I would love to hear comments,
> > criticisms, contributions, etc... ;)
> >
> > My main objective was to make it extremely easy to do basic NLP tasks in
> > Haskell, such as POS tagging and document similarity. (and later,
> Chunking,
> > NER, co-ref resolution, etc...).
> >
> > The best example of this is Part-of-speech tagging with Chatter:
> >
> > {{{
> > cabal install chatter
> > ghci
> >> :m +NLP.POS
> >> t <- defaultTagger
> >> tagStr t "This is a test."
> > "This/dt is/bez a/at test/nn ./."
> > }}}
> >
> > Chatter provides POS tagging (with backoff taggers, and a ~83% accurate
> > trained default tagger), TF-IDF measures, and cosine document similarity.
> >
> > It also currently contains an adapted version of the Tokenize library,
> > because I wanted to tokenize Text.  That's a short-term solution; I
> haven't
> > had time to make a patch to the tokenize lib.
> >
> > Links:
> >  - Hackage: http://hackage.haskell.org/package/chatter-0.0.0.2
> >  - Github: http://github.com/creswick/chatter
> >
> > --Rogan
> >
> >
> > _______________________________________________
> > NLP mailing list
> > NLP at projects.haskell.org
> > http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20131119/b27f942f/attachment.htm>


More information about the NLP mailing list