High-precision arithmetic for calculating N-gram probabilities

Sun Oct 16 20:18:22 BST 2011

On Sun, Oct 16, 2011 at 5:26 AM, Dan Maftei <ninestraycats at gmail.com> wrote:
> Thank you Wren and David. Here is what I gleamed so far:
> 1 Standard practice is to use Doubles for probabilities. Is the main reason
> for this because the small precision loss isn't worth reclaiming by using
> more complicated types? Further, do we often sum probabilities in NLP? If
> so, Ratios are a bad idea, due to many gcd calls. But if our only
> manipulation is multiplication, Ratios will not suffer a performance hit.

That's right. Summing is very common because that's how you get
marginal probabilities. I don't have much experience with rationals.

> 2 I'm unclear what log (_ + 1) and exp (_ - 1) refer to, or what is
> inaccurate about the standard log and exp functions. Same goes for what a
> and b represent in logSum (== a + log(1+ exp(b-a)). If I want to multiply
> probabilities, I use exp $ foldl' (+) 0 (map log probs), which is equivalent
> to foldl' (*) 1 probs. Is there something inaccurate or incorrect about
> that? NB: the probability of a large text is intractable, since foldl' (+) 0
> (map log probs) gives an exponent in the thousands. However, the perplexity
> of said text is computable, it just defers the exp function till after we
> multiply the log sums by -1/N: exp $ negate (1 / n) * logSums.

log(_ + 1) just means--it looks pretty Scala-y to me ;)--the log of
some double + 1. 1 + (small double) is basically 1, and so the naive
implementation of log(_ + 1) would be 0, or nearly so. a and b are
just doubles representing (possible unnormalized) log probabilities.

But in reality, log(1 + small double) is close to the small double.
Thus, you should log1p as a better approximation.

Don't call that logSums, call them sumOfLogs, or something. logSums
refers to the log of the sum of the probabilities, which is something
like log $ foldl (+) 0.0 (map (\p -> exp ( p - (foldl1 max logProbs)))
logProbs)

As for your examples, exp $ foldl' (+) 0 (map log probs) is fine,
except, as you pointed out, exp'ing will typically go poorly. foldl'
(*) 1 probs is of course only identical in the world of infinite
precision reals.

> 3 I am still unsure how to neatly randomize output, given the inaccuracy of
> Doubles (I have a working function but it requires a guard which is only
> necessary due to Double imprecision). David gave "2 * abs(high -  low)/(high
> + low) <= epsilon", but I can't figure what the high and low refer to, or
> where to get that epsilon from. Isn't it machine-dependent? If not, I could
> use the value Wiki for double precision types, 1.11e-16.

Err, maybe I should have said "tolerance" instead of epsilon. high and
low are just poorly named variables. x and y may have been better.
That code basically says "ensure that the absolute difference between
x and y is small relative to their average"

epsilon/tolerance is your choice, 1E-6 is usually fine. 1E-4 is ok if
you're feeling sloppy.

-- David

> Dan
>
>>
>> Message: 1
>> Date: Sat, 15 Oct 2011 20:36:45 -0400
>> From: wren ng thornton <wren at freegeek.org>
>> Subject: Re: High-precision arithmetic for calculating N-gram
>>        probabilities
>> To: nlp at projects.haskell.org
>> Message-ID: <4E9A271D.6090505 at freegeek.org>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 10/13/11 12:34 PM, David Hall wrote:
>> > Also, you'll probably want to
>> > use log(double) and operate using + as times and logSum (== a + log(1
>> > + exp(b-a)) as sum. This will avoid underflow.
>>
>> Actually, for (a + log(1 + exp(b - a))) you want to first ensure that a
>>  >= b. Otherwise you'd want to do (b + log(1 + exp(a - b)))
>>
>> Also, there are more accurate implementations of log(1+_) and (exp _-1)
>> than the naive versions which become extremely inaccurate for small
>> arguments.
>>
>>
>> If you want to deal with probabilities in the log-domain, the logfloat
>> package[1] will handle all these details for you. In addition, it
>> defines things using newtypes so that you get type safety instead of
>> mixing up which things are in the log-domain and which are in the
>> normal-domain; and it provides a Num instance so that you can say what
>> you mean instead of what the implementation is doing. In fact, the
>> package was specifically designed for NLP tasks needing to do this sort
>> of thing. While the primary goal is to maximize precision, the secondary
>> goal is to maximize performance within those constraints; and the
>> implementation has been tested thoroughly on these criteria.
>>
>> The only caveat is that the performance tuning was done a while ago,
>> back when GHC had horrible code generation for doubles, and so it uses
>> --fvia-C which has since been deprecated. I've yet to install GHC7 in
>> order to re-tune for the new backends.
>>
>>
>> [1] http://hackage.haskell.org/package/logfloat
>>
>> --
>> Live well,
>> ~wren
>>
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Sat, 15 Oct 2011 20:40:42 -0400
>> From: wren ng thornton <wren at freegeek.org>
>> Subject: Re: High-precision arithmetic for calculating N-gram
>>        probabilities
>> To: nlp at projects.haskell.org
>> Message-ID: <4E9A280A.10609 at freegeek.org>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 10/13/11 2:40 PM, Dan Maftei wrote:
>> > Yeah, log sums are actually a necessity when calculating perplexity. :)
>> >
>> > If I ever get around to profiling Rationals vs. Doubles I'll let people
>> > know
>> > what I found. But upon reflection, I see that they shouldn't make a
>> > significant difference in performance.
>>
>> The big performance problem with Rationals is that they're stored in
>> normal form. That means the implementation has to run gcd every time you
>> manipulate them. If you're really interested in performance, then you'd
>> want to try working with unnormalized ratios rather than using the
>> Rational type. Of course, then you have to be careful about not letting
>> the numbers get too large, otherwise the cost of working with full
>> Integers will be the new performance sinkhole.
>>
>> --
>> Live well,
>> ~wren
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> NLP mailing list
>> NLP at projects.haskell.org
>> http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>>
>>
>> End of NLP Digest, Vol 12, Issue 3
>> **********************************
>
>
> _______________________________________________
> NLP mailing list
> NLP at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>
>