CLT Toolkit

wren ng thornton wren at freegeek.org
Sat Nov 27 21:15:00 EST 2010


On 11/27/10 4:06 AM, Daniël de Kok wrote:
> On Nov 27, 2010, at 7:21 AM, wren ng thornton wrote:
>> algorithms. I started the project because the CCG supertaggers available
>> (C&C Tools; OpenCCG) are too integrated in their own projects to
>> facilitate doing my kind of research, and also because the current
>> standard for HMM tagging (TnT) is closed source. So the goal (as a
>> Haskell library) is to make it as openly reusable as possible.
>
> For what it is worth, my Citar (C++) and Jitar (Java) taggers are nearly identical to TnT:
>
> - They use a trigram HMM model.
> - Linear interpolation smoothing is used for estimating the probability of trigrams.
> - Unknown word probabilities are estimated using suffixes.
> - There are some tricks that are not described in Brandt's paper that are necessary to achieve the same performance (e.g. use different estimators for unknown words that are capitalized/uncaptialized).
>
> https://github.com/danieldk/citar
> https://github.com/danieldk/jitar
>
> Both are available under an opensource license.

Good to know. I'll definitely take a look at them.

Part of my goal with the library, though, is to have it serve as more of 
a toolkit so that people can experiment with different smoothing and 
backoff methods as well as different inference algorithms. The 
modularity over inference algorithms is necessary for my particular 
research, though modularity over models is nice for other kinds of 
research (and trivial to do in Haskell).

-- 
Live well,
~wren



More information about the NLP mailing list