High-precision arithmetic for calculating N-gram probabilities

Sun Oct 16 01:36:45 BST 2011

On 10/13/11 12:34 PM, David Hall wrote:
> Also, you'll probably want to
> use log(double) and operate using + as times and logSum (== a + log(1
> + exp(b-a)) as sum. This will avoid underflow.

Actually, for (a + log(1 + exp(b - a))) you want to first ensure that a 
 >= b. Otherwise you'd want to do (b + log(1 + exp(a - b)))

Also, there are more accurate implementations of log(1+_) and (exp _-1) 
than the naive versions which become extremely inaccurate for small 
arguments.

If you want to deal with probabilities in the log-domain, the logfloat 
package[1] will handle all these details for you. In addition, it 
defines things using newtypes so that you get type safety instead of 
mixing up which things are in the log-domain and which are in the 
normal-domain; and it provides a Num instance so that you can say what 
you mean instead of what the implementation is doing. In fact, the 
package was specifically designed for NLP tasks needing to do this sort 
of thing. While the primary goal is to maximize precision, the secondary 
goal is to maximize performance within those constraints; and the 
implementation has been tested thoroughly on these criteria.

The only caveat is that the performance tuning was done a while ago, 
back when GHC had horrible code generation for doubles, and so it uses 
--fvia-C which has since been deprecated. I've yet to install GHC7 in 
order to re-tune for the new backends.

[1] http://hackage.haskell.org/package/logfloat

-- 
Live well,
~wren