High-precision arithmetic for calculating N-gram probabilities

Thu Oct 13 19:40:15 BST 2011

Yeah, log sums are actually a necessity when calculating perplexity. :)

If I ever get around to profiling Rationals vs. Doubles I'll let people know
what I found. But upon reflection, I see that they shouldn't make a
significant difference in performance.

However, I can't get around the fact that randomizing output requires
checking for equality, if only <=. You have a search space, say, all
trigrams starting with "th", and a random probability fits somewhere inside
that space. To find out where it fits, I need to check the random
probability against the sum of trigram probabilities. Are there other ways?

Dan

On Thu, Oct 13, 2011 at 5:34 PM, David Hall <dlwh at cs.berkeley.edu> wrote:

> On Thu, Oct 13, 2011 at 7:19 AM, Dan Maftei <ninestraycats at gmail.com>
> wrote:
> > Hello all!
> >
> > The need for high-precision arithmetic arose when I began randomly
> > generating output from a trigram letter model. Briefly, given a "search
> > space," defined as a subset of the trigram probabilities map such that
> every
> > trigram begins with the same two-letter history, I walk through said
> space,
> > trigram by trigram, checking a randomly generated number between 0.0 and
> 1.0
> > (inclusive) against the running sum of probabilities. This running sum is
> > logically equivalent to Map.fold (+) 0 searchSpace. Whereas the fold
> returns
> > 1 as expected, my function gives 0.9999999998 after having summed every
> > probability. Presumably this is due to the order in which my
> probabilities,
> > stored as Doubles, are summed.
> >
> > I wager high-precision arithmetic is a problem not just for N-gram
> models,
> > but for NLP in general. (I am new to NLP so this is an educated guess :-)
> > Domain-specific solutions to this problem should be mentioned in the NLP
> for
> > the Woking Programmer book. For example, since N-gram probabilities are
> > merely ratios of integer counts, I will re-write my model using the
> Rational
> > data type instead of Double. I predict this will be faster than Doubles
> with
> > less space overhead, and more precise.
>
> That might work here, but in practice you use Doubles and accept that
> they're imprecise, and so you never check for exact equality, ever.
> When you start smoothing and using the clever backoff models, you'll
> find that you want to do that anyway. Also, you'll probably want to
> use log(double) and operate using + as times and logSum (== a + log(1
> + exp(b-a)) as sum. This will avoid underflow.
>
> Fun fact: many modern language models actually use something like 5
> bits of precision. Language models are inaccurate enough (the
> difference between 1E-6 and 1.1E-6 isn't that big, relative to how
> messy and approximate the n-gram assumption really is) that it just
> doesn't matter.
>
> I'll be surprised if Rationals are faster and/or more useful than
> Doubles, anyway.
>
> -- David
>
> >
> > What solutions have others came up with?
> >
> > Dan
> >
> > _______________________________________________
> > NLP mailing list
> > NLP at projects.haskell.org
> > http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20111013/1cc6d959/attachment.htm>