s-news
[Top] [All Lists]

Re: Strange behavior with fitted lm objects

To: <a296180@mica.fmr.com>
Subject: Re: Strange behavior with fitted lm objects
From: "David M Smith" <dsmith@insightful.com>
Date: Mon, 26 Mar 2001 09:14:55 -0800
Cc: "S-News" <s-news@wubios.wustl.edu>, <support@insightful.com>
Importance: Normal
In-reply-to: <15039.22091.383766.690729@arbres2a>
I'm not sure exactly *why* the fitted values come out differently (in the
15th decimal place), but such problems are difficult to avoid in
floating-point arithmetic.  In general, x == y is not a well-defined
expression for floating point (in S-PLUS's usage: double) numbers x and y.

S-PLUS 6 provides the function all.equal which allows you to compare two
numbers (or generally, two objects) within a tolerance depending on the
machine's floating-point precision.  Following your example:

> fv <- fitted(lm(tempDF$y ~ tempDF$f))
> names(fv) <- NULL  ## all.equal will report differences in names
> all.equal(fv[1], fv[3])
[1] T

As an aside, unique() would perhaps be more useful if it did its comparisons
within floating-point error, but alas it does not.

You can find discussions of floating-point arithmetic in most books on
programming, and the comp.lang.c FAQ discusses it as well.  I've also
included an S-PLUS FAQ which isn't directly related to this problem but
which may be enlightening.

# David Smith


Question

Sometimes numbers which should be equal aren't recognised as such when
tested with ==.  In the example below, the output from seq() seems correct,
but if I try and compare the result with certain numbers it doesn't work
properly.

> index<-seq(0.772,1.0,0.012)
> index
 [1] 0.772 0.784 0.796 0.808 0.820 0.832 0.844 0.856 0.868 0.880 0.892 0.904
0.916 0.928
[15] 0.940 0.952 0.964 0.976 0.988 1.000
> index==0.820
[1] F F F F F F F F F F F F F F F F F F F F
> index==0.832
[1] F F F F F F F F F F F F F F F F F F F F

Answer

S-PLUS (in common with most computer programs) uses a representation known
as "floating point" to store real numbers (floats).  In floating point, real
numbers are represented by a sign S (+1 or -1), a mantissa M (a number
between 1 and 2, with a fixed number of digits), and an exponent E (an
integer which can take values in a range which includes zero).  The value of
a floating point number is then S*M*R^E, where R is a fixed base.  Floating
point numbers are often displayed with a base R=10, but since internally the
numbers are represented using a binary pattern the base R=2 is used.

Floating point is an approximation. The IEEE standard for 32 bit float (as
used by the S-PLUS storage mode single) supports 1 bit of sign, 8 bits of
exponent, and 23 bits of mantissa. Since a normalized binary-point mantissa
always has the form 1.xxxxx... the leading 1 is dropped and you get
effectively 24 bits of mantissa. The number 1000.43 (and many, many others)
is not exactly representable in single or double precision format. 1000.43
is actually represented as the following bitpattern (the "s" shows the
position of the sign bit, the "e"s show the positions of the exponent bits,
and the "m"s show the positions of the mantissa bits):

seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
01000100011110100001101110000101
The shifted mantissa is 1111101000.01101110000101 or 1000 + 7045/16384. The
fractional part is 0.429992675781. With 24 bits of mantissa you only get
about 1 part in 16M of precision for single precision. The double type
provides more precision (53 bits of mantissa).

Because there is usually no exact floating-point representation for any real
number, testing equality of floating-point values is always a dubious
practice, and this is what is causing the original problem. It is probably
the case that there is no exact floating-point representation for the number
being tested for (0.82), and the actual value S-PLUS is using for the
comparison is slightly different from the result of the calculations used in
seq().

When comparing floats, you should always compare with a tolerance, e.g
abs(x-c)<eps to test x==c. Some numbers (e.g. 0 and small integers) always
have exact representations, but even then there may be problems, e.g. in
floating point arithmetic (2/3)*3 is not guaranteed to be equal to 2.
(Incidentally, there is a function in S-PLUS 5.0 (all.equal) which does
allow comparison of floats within a machine-precision tolerance.)

The reason why the output from seq() seems to be exact is due to the fact
that S-PLUS rounds floats when printing according to options(digits).
Setting options(digits) to a large number will reveal that the output from
seq() wasn't quite what you thought it was.

--
David M Smith <dsmith@insightful.com>
S-PLUS Product Marketing Manager, Insightful Corp, Seattle WA
Tel: +1 (206) 283 8802 x360
Fax: +1 (206) 283 0347

MathSoft is now Insightful!  See www.insightful.com for details.

> -----Original Message-----
> From: s-news-owner@lists.biostat.wustl.edu
> [mailto:s-news-owner@lists.biostat.wustl.edu]On Behalf Of David Kane -
> GEODE <David Kane
> Sent: Monday, March 26, 2001 06:56
> To: s-news@lists.biostat.wustl.edu
> Subject: [S] Strange behavior with fitted lm objects
>
>
>
> Hi,
>
> I see some strange behavior when generating the fitted values from an lm
> object. Consider:
> > version
> Version 6.0 Release 1 for Sun SPARC, SunOS 5.6 : 2000
> > set.seed(9)
> > tempDF <- data.frame( y = rnorm(6), f =
> as.factor(paste("level", rep(1:2,3), sep = "")))
> > tempDF
>            y      f
> 1 -0.2009222 level1
> 2  0.4997202 level2
> 3  0.3564666 level1
> 4  0.4703904 level2
> 5 -0.4061262 level1
> 6 -1.6078728 level2
> > fitted(lm(tempDF$y ~ tempDF$f))
>            1          2           3          4           5          6
>  -0.08352725 -0.2125874 -0.08352725 -0.2125874 -0.08352725 -0.2125874
>
> At this point, everything looks OK. The fitted values are the mean of the
> response at the level of each factor.
>
> > tapply(tempDF$y, tempDF$f, mean)
>       level1     level2
>  -0.08352725 -0.2125874
>
> Unfortunately, the results are not *identical* within each level.
> Consider:
>
> > unique(fitted(lm(tempDF$y ~ tempDF$f)))
> [1] -0.08352725 -0.21258740 -0.08352725 -0.21258740 -0.21258740
> > fitted(lm(tempDF$y ~ tempDF$f))[1] == fitted(lm(tempDF$y ~ tempDF$f))[3]
>  1
>  F
> > options(digits = 15)
> > fitted(lm(tempDF$y ~ tempDF$f))[1:3]
>                    1                  2                   3
>  -0.0835272529974701 -0.212587401068558 -0.0835272529974702
>
>
> Any insights on this problem would be much appreciated.
>
> Thanks,
>
> Dave Kane
>
> --
> David Kane
> Geode Capital Management
> 617-563-0122
> david.d.kane@fmr.com
> ---------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news
>


<Prev in Thread] Current Thread [Next in Thread>