High-precision arithmetic for calculating N-gram probabilities

Sat Oct 15 03:55:52 BST 2011

On Thu, Oct 13, 2011 at 11:40 AM, Dan Maftei <ninestraycats at gmail.com> wrote:
> Yeah, log sums are actually a necessity when calculating perplexity. :)
> If I ever get around to profiling Rationals vs. Doubles I'll let people know
> what I found. But upon reflection, I see that they shouldn't make a
> significant difference in performance.
> However, I can't get around the fact that randomizing output requires
> checking for equality, if only <=. You have a search space, say, all
> trigrams starting with "th", and a random probability fits somewhere inside
> that space. To find out where it fits, I need to check the random
> probability against the sum of trigram probabilities. Are there other ways?
> Dan

<= is generally ok. Typically, you check 2 * abs(high -  low)/(high +
low) <= epsilon, or something.


>
> On Thu, Oct 13, 2011 at 5:34 PM, David Hall <dlwh at cs.berkeley.edu> wrote:
>>
>> On Thu, Oct 13, 2011 at 7:19 AM, Dan Maftei <ninestraycats at gmail.com>
>> wrote:
>> > Hello all!
>> >
>> > The need for high-precision arithmetic arose when I began randomly
>> > generating output from a trigram letter model. Briefly, given a "search
>> > space," defined as a subset of the trigram probabilities map such that
>> > every
>> > trigram begins with the same two-letter history, I walk through said
>> > space,
>> > trigram by trigram, checking a randomly generated number between 0.0 and
>> > 1.0
>> > (inclusive) against the running sum of probabilities. This running sum
>> > is
>> > logically equivalent to Map.fold (+) 0 searchSpace. Whereas the fold
>> > returns
>> > 1 as expected, my function gives 0.9999999998 after having summed every
>> > probability. Presumably this is due to the order in which my
>> > probabilities,
>> > stored as Doubles, are summed.
>> >
>> > I wager high-precision arithmetic is a problem not just for N-gram
>> > models,
>> > but for NLP in general. (I am new to NLP so this is an educated guess
>> > :-)
>> > Domain-specific solutions to this problem should be mentioned in the NLP
>> > for
>> > the Woking Programmer book. For example, since N-gram probabilities are
>> > merely ratios of integer counts, I will re-write my model using the
>> > Rational
>> > data type instead of Double. I predict this will be faster than Doubles
>> > with
>> > less space overhead, and more precise.
>>
>> That might work here, but in practice you use Doubles and accept that
>> they're imprecise, and so you never check for exact equality, ever.
>> When you start smoothing and using the clever backoff models, you'll
>> find that you want to do that anyway. Also, you'll probably want to
>> use log(double) and operate using + as times and logSum (== a + log(1
>> + exp(b-a)) as sum. This will avoid underflow.
>>
>> Fun fact: many modern language models actually use something like 5
>> bits of precision. Language models are inaccurate enough (the
>> difference between 1E-6 and 1.1E-6 isn't that big, relative to how
>> messy and approximate the n-gram assumption really is) that it just
>> doesn't matter.
>>
>> I'll be surprised if Rationals are faster and/or more useful than
>> Doubles, anyway.
>>
>> -- David
>>
>> >
>> > What solutions have others came up with?
>> >
>> > Dan
>> >
>> > _______________________________________________
>> > NLP mailing list
>> > NLP at projects.haskell.org
>> > http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>> >
>> >
>
>