High-precision arithmetic for calculating N-gram probabilities

Sun Oct 30 00:29:27 BST 2011

On Sun, Oct 16, 2011 at 8:18 PM, David Hall <dlwh at cs.berkeley.edu> wrote:

> On Sun, Oct 16, 2011 at 5:26 AM, Dan Maftei <ninestraycats at gmail.com>
> wrote:
> > Thank you Wren and David. Here is what I gleamed so far:
> > 1 Standard practice is to use Doubles for probabilities. Is the main
> reason
> > for this because the small precision loss isn't worth reclaiming by using
> > more complicated types? Further, do we often sum probabilities in NLP? If
> > so, Ratios are a bad idea, due to many gcd calls. But if our only
> > manipulation is multiplication, Ratios will not suffer a performance hit.
>
>
> That's right. Summing is very common because that's how you get
> marginal probabilities. I don't have much experience with rationals.
>
> > 2 I'm unclear what log (_ + 1) and exp (_ - 1) refer to, or what is
> > inaccurate about the standard log and exp functions. Same goes for what a
> > and b represent in logSum (== a + log(1+ exp(b-a)). If I want to multiply
> > probabilities, I use exp $ foldl' (+) 0 (map log probs), which is
> equivalent
> > to foldl' (*) 1 probs. Is there something inaccurate or incorrect about
> > that? NB: the probability of a large text is intractable, since foldl'
> (+) 0
> > (map log probs) gives an exponent in the thousands. However, the
> perplexity
> > of said text is computable, it just defers the exp function till after we
> > multiply the log sums by -1/N: exp $ negate (1 / n) * logSums.
>
>
> log(_ + 1) just means--it looks pretty Scala-y to me ;)--the log of
> some double + 1. 1 + (small double) is basically 1, and so the naive
> implementation of log(_ + 1) would be 0, or nearly so. a and b are
> just doubles representing (possible unnormalized) log probabilities.
>
> But in reality, log(1 + small double) is close to the small double.
> Thus, you should log1p as a better approximation.
>
> Don't call that logSums, call them sumOfLogs, or something. logSums
> refers to the log of the sum of the probabilities, which is something
> like log $ foldl (+) 0.0 (map (\p -> exp ( p - (foldl1 max logProbs)))
> logProbs)
>

Hm, I'm not sure why the map is necessary:

-- log of sum of raw probabilities
logSums = log $ foldl (+) 0.0 probs

Perhaps I misread you.

>
> As for your examples, exp $ foldl' (+) 0 (map log probs) is fine,
> except, as you pointed out, exp'ing will typically go poorly. foldl'
> (*) 1 probs is of course only identical in the world of infinite
> precision reals.
>
>
> > 3 I am still unsure how to neatly randomize output, given the inaccuracy
> of
> > Doubles (I have a working function but it requires a guard which is only
> > necessary due to Double imprecision). David gave "2 * abs(high -
>  low)/(high
> > + low) <= epsilon", but I can't figure what the high and low refer to, or
> > where to get that epsilon from. Isn't it machine-dependent? If not, I
> could
> > use the value Wiki for double precision types, 1.11e-16.
>
> Err, maybe I should have said "tolerance" instead of epsilon. high and
> low are just poorly named variables. x and y may have been better.
> That code basically says "ensure that the absolute difference between
> x and y is small relative to their average"
>
> epsilon/tolerance is your choice, 1E-6 is usually fine. 1E-4 is ok if
> you're feeling sloppy.
>

Code makes sense (other than why you're multiplying by 2), but does it deal
with the rounding errors due to adding Doubles in different orders? Let me
clarify, being specific to make the writing easier on the eyes:

I'm modeling character sequences using trigrams, and I'm looking at the
two-character history 't' 'h'. If I fold the values in this reduced trigram
probabilities map, I get 1, as expected:

> Map.fold (+) 0.0 (Map.filterWithKey thFilter probs)
1

However, consider generating a random character given the history 't' 'h'.
My method (I think this is standard) is to choose a random number between 0
and 1, inclusive, and see where it "lands" inside the reduced probabilities
map. To exemplify, if I have:

't' 'h' 'a' -> 0.2
't' 'h' 'e' -> 0.7
't' 'h' 'o' -> 0.1

and I randomly generate between 0 and 0.2, I output 'a', between 0.2 and
0.9 I choose 'e', and between 0.9 and 1.0, I choose 'o'.

Doing this iteratively means walking through the probabilities map, and
comparing the random number against a running sum of probabilities. Of
course, this is functional programming, so what we do is keep track of our
"search space" by plucking elements from the (ordered) probabilities map
one by one using (Map.elemAt 0) and (Map.deleteAt 0), and use an
accumulator for the running sum.

Herein lies my problem: due to rounding errors caused by Doubles being
added in a different order, the running sum, having looked at every element
in the map, comes out to just less than 1.0. In practice this is only an
issue in case we generate the random number 1.0.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20111030/58836c27/attachment.htm>