I’ve been playing with Microsoft Research’s ngram server in my spare time. I’ll write some more about this in time, I hope.
The API returns base 10 logs for its probabilities. Because the probabilities returned are so very low, it makes more sense to return a log probability. I could be wrong, but I think it’s normal practice to return base 2 logs, or base e logs for log probabilities. They are easily to convert from one base to another, so it doesn’t really matter. But I was wondering why base 10 logs are returned.
The penny dropped when I was just playing around printing out base 10 of some probabilities:
log10(1.0) = 0.0
log10(0.1) = -1.0
log10(0.01) = -2.0
log10(0.001) = -3.0
I felt silly (since log10x = y by definition means x = 10y). But it’s really helpful to look at a number like ‑5.2721823 and know there are 5 zeros to the right of the decimal point. It’s even easier to parse than 0.0000053434.