Ann: Chatter - a simple library for language processing

Tue Nov 19 23:45:10 GMT 2013

On Tue, Nov 19, 2013 at 3:06 PM, Rogan Creswick <creswick at gmail.com> wrote:

> On Tue, Nov 19, 2013 at 10:40 AM, Rogan Creswick <creswick at gmail.com>wrote:
>
>> On Tue, Nov 19, 2013 at 2:48 AM, Grzegorz Chrupała <G.A.Chrupala at uvt.nl>wrote:
>>
>>> Regarding working with Text in the Tokenize lib, I'm just wondering,
>>> wouldn't it be just as efficient to just use "pack . tokenize .
>>> unpack"?
>>
>>
>
>> I'll try and run some criterion benchmarks of the tokenizers, but it may
>> be a few days to a week before I get a chance to do it right.
>>
>
> I actually found some time at lunch today, and had a 17-million token
> corpus of Linux mailing list message bodies to tokenize as a test suite.
>  The results below are run on 1/4 of that corpus (I was in a bit of a
> hurry, but criterion is pretty confident in the precision of the timings.).
>
> I compared the tokenize :: Text -> [Text] I created with:
>
> strTokenizer :: Text -> [Text]
> strTokenizer txt = map T.pack (StrTok.tokenize $ T.unpack txt)
>
> where StrTok.tokenize is the tokenize :: String -> [String] from the
> tokenize library.
>
> It looks like using Text directly is just about 19% faster.  Here are the
> criterion reports:
>
> benchmarking tokenizing/Text Tokenizer
> collecting 100 samples, 1 iterations each, in estimated 468.8980 s
> mean: 4.629273 s, lb 4.619905 s, ub 4.641448 s, ci 0.950
> std dev: 54.34576 ms, lb 43.32373 ms, ub 83.77020 ms, ci 0.950
>
> benchmarking tokenizing/String Tokenizer
> collecting 100 samples, 1 iterations each, in estimated 523.9081 s
> mean: 5.697734 s, lb 5.683067 s, ub 5.709531 s, ci 0.950
> std dev: 66.58823 ms, lb 42.25746 ms, ub 109.9214 ms, ci 0.950
>
> The specific numbers are, of course, not very important.  The means are
> about 1 second apart (4.6 vs. 5.6) and the upper/lower bounds on the means
> don't intersect (4.64 vs 5.63).
>
> Note that these comparisons all assume that you *have* Text, instead of
> String.  Given that we can often choose, we should also compare without the
> pack/unpack and see how much of the performance difference is due to that
> layer of wrapping.
>

Here are results that include running the tokenize package tokenizer
(String -> [String]) on string content directly, without packing:

benchmarking tokenizing/String Tokenizer (nopacking/unpacking)
collecting 100 samples, 1 iterations each, in estimated 612.1870 s
mean: 5.385096 s, lb 5.378611 s, ub 5.394577 s, ci 0.950
std dev: 39.68656 ms, lb 29.19886 ms, ub 57.89271 ms, ci 0.950

benchmarking tokenizing/Text Tokenizer
collecting 100 samples, 1 iterations each, in estimated 496.5222 s
mean: 4.864673 s, lb 4.857420 s, ub 4.874651 s, ci 0.950
std dev: 43.27871 ms, lb 33.95738 ms, ub 54.68661 ms, ci 0.950

benchmarking tokenizing/String Tokenizer (wrapped in Text.unpack/pack)
collecting 100 samples, 1 iterations each, in estimated 589.2096 s
mean: 5.841642 s, lb 5.836346 s, ub 5.849926 s, ci 0.950
std dev: 33.41265 ms, lb 23.13976 ms, ub 49.96165 ms, ci 0.950

It looks like Text is faster, but not by as much of a margin (~9% vs
~18-19%).

I've pushed the corpora and the benchmarking suite to the chatter git repo
if you want to take a look at the specifics!

(github url: https://github.com/creswick/chatter)

--Rogan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20131119/2d59562d/attachment-0001.htm>