High-precision arithmetic for calculating N-gram probabilities

Dan Maftei ninestraycats at gmail.com
Thu Oct 13 15:19:48 BST 2011


Hello all!

The need for high-precision arithmetic arose when I began randomly
generating output from a trigram letter model. Briefly, given a "search
space," defined as a subset of the trigram probabilities map such that every
trigram begins with the same two-letter history, I walk through said space,
trigram by trigram, checking a randomly generated number between 0.0 and 1.0
(inclusive) against the running sum of probabilities. This running sum is
logically equivalent to Map.fold (+) 0 searchSpace. Whereas the fold returns
1 as expected, my function gives 0.9999999998 after having summed every
probability. Presumably this is due to the order in which my probabilities,
stored as Doubles, are summed.

I wager high-precision arithmetic is a problem not just for N-gram models,
but for NLP in general. (I am new to NLP so this is an educated guess :-)
Domain-specific solutions to this problem should be mentioned in the NLP for
the Woking Programmer book. For example, since N-gram probabilities are
merely ratios of integer counts, I will re-write my model using the Rational
data type instead of Double. I predict this will be faster than Doubles with
less space overhead, and more precise.

What solutions have others came up with?

Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20111013/cb2cf8e3/attachment.htm>


More information about the NLP mailing list