Ann: Chatter - a simple library for language processing

Rogan Creswick creswick at gmail.com
Tue Nov 19 23:06:43 GMT 2013


On Tue, Nov 19, 2013 at 10:40 AM, Rogan Creswick <creswick at gmail.com> wrote:

> On Tue, Nov 19, 2013 at 2:48 AM, Grzegorz Chrupała <G.A.Chrupala at uvt.nl>wrote:
>>
>> Regarding working with Text in the Tokenize lib, I'm just wondering,
>> wouldn't it be just as efficient to just use "pack . tokenize .
>> unpack"?
>
>

> I'll try and run some criterion benchmarks of the tokenizers, but it may
> be a few days to a week before I get a chance to do it right.
>

I actually found some time at lunch today, and had a 17-million token
corpus of Linux mailing list message bodies to tokenize as a test suite.
 The results below are run on 1/4 of that corpus (I was in a bit of a
hurry, but criterion is pretty confident in the precision of the timings.).

I compared the tokenize :: Text -> [Text] I created with:

strTokenizer :: Text -> [Text]
strTokenizer txt = map T.pack (StrTok.tokenize $ T.unpack txt)

where StrTok.tokenize is the tokenize :: String -> [String] from the
tokenize library.

It looks like using Text directly is just about 19% faster.  Here are the
criterion reports:

benchmarking tokenizing/Text Tokenizer
collecting 100 samples, 1 iterations each, in estimated 468.8980 s
mean: 4.629273 s, lb 4.619905 s, ub 4.641448 s, ci 0.950
std dev: 54.34576 ms, lb 43.32373 ms, ub 83.77020 ms, ci 0.950

benchmarking tokenizing/String Tokenizer
collecting 100 samples, 1 iterations each, in estimated 523.9081 s
mean: 5.697734 s, lb 5.683067 s, ub 5.709531 s, ci 0.950
std dev: 66.58823 ms, lb 42.25746 ms, ub 109.9214 ms, ci 0.950

The specific numbers are, of course, not very important.  The means are
about 1 second apart (4.6 vs. 5.6) and the upper/lower bounds on the means
don't intersect (4.64 vs 5.63).

Note that these comparisons all assume that you *have* Text, instead of
String.  Given that we can often choose, we should also compare without the
pack/unpack and see how much of the performance difference is due to that
layer of wrapping.

--Rogan



> --Rogan
>
>
>
>> --
>> Grzegorz
>>
>>
>> On Mon, Nov 18, 2013 at 10:53 PM, Rogan Creswick <creswick at gmail.com>
>> wrote:
>> > I've been working on a simple NLP library over the past month or two,
>> and I
>> > think it may finally be useful to others.  I would love to hear
>> comments,
>> > criticisms, contributions, etc... ;)
>> >
>> > My main objective was to make it extremely easy to do basic NLP tasks in
>> > Haskell, such as POS tagging and document similarity. (and later,
>> Chunking,
>> > NER, co-ref resolution, etc...).
>> >
>> > The best example of this is Part-of-speech tagging with Chatter:
>> >
>> > {{{
>> > cabal install chatter
>> > ghci
>> >> :m +NLP.POS
>> >> t <- defaultTagger
>> >> tagStr t "This is a test."
>> > "This/dt is/bez a/at test/nn ./."
>> > }}}
>> >
>> > Chatter provides POS tagging (with backoff taggers, and a ~83% accurate
>> > trained default tagger), TF-IDF measures, and cosine document
>> similarity.
>> >
>> > It also currently contains an adapted version of the Tokenize library,
>> > because I wanted to tokenize Text.  That's a short-term solution; I
>> haven't
>> > had time to make a patch to the tokenize lib.
>> >
>> > Links:
>> >  - Hackage: http://hackage.haskell.org/package/chatter-0.0.0.2
>> >  - Github: http://github.com/creswick/chatter
>> >
>> > --Rogan
>> >
>> >
>> > _______________________________________________
>> > NLP mailing list
>> > NLP at projects.haskell.org
>> > http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20131119/b4ba4b37/attachment.htm>


More information about the NLP mailing list