High-precision arithmetic for calculating N-gram probabilities

David Hall dlwh at cs.berkeley.edu
Thu Oct 13 17:34:58 BST 2011


On Thu, Oct 13, 2011 at 7:19 AM, Dan Maftei <ninestraycats at gmail.com> wrote:
> Hello all!
>
> The need for high-precision arithmetic arose when I began randomly
> generating output from a trigram letter model. Briefly, given a "search
> space," defined as a subset of the trigram probabilities map such that every
> trigram begins with the same two-letter history, I walk through said space,
> trigram by trigram, checking a randomly generated number between 0.0 and 1.0
> (inclusive) against the running sum of probabilities. This running sum is
> logically equivalent to Map.fold (+) 0 searchSpace. Whereas the fold returns
> 1 as expected, my function gives 0.9999999998 after having summed every
> probability. Presumably this is due to the order in which my probabilities,
> stored as Doubles, are summed.
>
> I wager high-precision arithmetic is a problem not just for N-gram models,
> but for NLP in general. (I am new to NLP so this is an educated guess :-)
> Domain-specific solutions to this problem should be mentioned in the NLP for
> the Woking Programmer book. For example, since N-gram probabilities are
> merely ratios of integer counts, I will re-write my model using the Rational
> data type instead of Double. I predict this will be faster than Doubles with
> less space overhead, and more precise.

That might work here, but in practice you use Doubles and accept that
they're imprecise, and so you never check for exact equality, ever.
When you start smoothing and using the clever backoff models, you'll
find that you want to do that anyway. Also, you'll probably want to
use log(double) and operate using + as times and logSum (== a + log(1
+ exp(b-a)) as sum. This will avoid underflow.

Fun fact: many modern language models actually use something like 5
bits of precision. Language models are inaccurate enough (the
difference between 1E-6 and 1.1E-6 isn't that big, relative to how
messy and approximate the n-gram assumption really is) that it just
doesn't matter.

I'll be surprised if Rationals are faster and/or more useful than
Doubles, anyway.

-- David

>
> What solutions have others came up with?
>
> Dan
>
> _______________________________________________
> NLP mailing list
> NLP at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>
>



More information about the NLP mailing list