s-news
[Top] [All Lists]

Re: Poisson glm with aggregated data.

To: s-news@wubios.wustl.edu
Subject: Re: Poisson glm with aggregated data.
From: Gerald.Jean@spgdag.ca
Date: Thu, 6 Dec 2001 14:17:28 -0500
Hello S-users,

I received only one reply to my posting regarding "Poisson glm with
aggregated data".  Thanks to Brian Ripley for his comments, very usefull as
usual.  His reply follows with my original posting at the end.  As Prof.
Ripley suggested tightening the convergence criterion helped.  But the
coefficients still showed little differences, I guess that this is due to
the reduced variability in the aggregated data, although I thought that
this would only affect the standard errors?

Prof. Ripley questioned modeling the frequencies of claims directly rather
than passing the exposure as an offset to glm so the Poisson hypothesis
would be met more closely.  Here is my reply to him in a private mail.

The continuous variables used in the glm call have the following meanings.

"nsininct" represents the number of claims in a cell.
"unsousi" represents the exposure, the number of units exposed in a cell,
usually number of policies-year; this number is adjusted        by the
actuaries for trends, and other economical factors.
"unsous" represents the raw exposure in a cell.

As far as representing the exposure through an offset, it is something I
have debated, with myself, quite a bit.  Coming from quite a different
modeling flavour (direct marketing) and being the only in house
statistician I decided to stick with the traditional ways acturies have
been modeling frequency of claims; in particular I followed pretty closely
J.M. Brockman and T.S. Wright in their 1992 paper: "Statistical Motor
Rating: Making Effective Use Of Your Data", presented to the Institute of
Actuaries in April of 1992.  In this paper, without going to the details,
they do model the frequencies (number of claims / exposure) directly.  By
curiosity I tried to account for exposure through an offset as you
suggested; I did it with both data sets, the "large" one and the "small"
one.  The coefficients for the "small" data set were fairly close to what I
had obtained with both data sets modeling the frequencies directly, after I
tightened the convergence criterion; but the Std. Error were extremely
small and consequently the t-values just blew up in the thousands; for the
"large" data set the coefficients were not as close to the ones obtained
when using frequencies directly as dependent variable but the Std. Errors
were a little more reasonnable although pretty small.  I know that the
t-values would be lower when we take over-dispersion into account, which I
usually do but didn't do it for this little exercise; the dispersion would
need to be uncomfortably big to bring the very large t-values to more usual
levels?  As an example a variable with a t-value of -0.21216 modeling the
frequency directly now has a t-value of over 200?


I would welcome any thoughts on this topic,


---------------------- Envoyée par Gérald Jean/Assurances/SPGDAG/Desjardins
le 2001/12/06 14:00 ---------------------------


Prof Brian Ripley <ripley@stats.ox.ac.uk>@lists.biostat.wustl.edu le
2001/12/04 17:42:35

Envoyé par :   s-news-owner@lists.biostat.wustl.edu


Pour :    <Gerald.Jean@spgdag.ca>
cc :  <s-news@wubios.wustl.edu>
Objet :   Re: [S] Poisson glm with aggregated data.


glm has a very sloppy default convergence criterion.  You need to tighten
it, for example by epsilon = 1e-10, to get reproducible results.

Also, I think you should be using an offset and not dividing by unsousi,
as it seems unlikely that nsininct / unsousi is Poisson.  Something like

glm(nsininct ~ bmm + zonecong + offset(log(unsousi)), family = poisson(),
    data = inc.all.agg)

It's not clear to me that what you have done does aggregate exactly: the
sum of nsininct / unsousi * unsous is not the (sum of nsininct) /(sum of
unsousi) * (sum of unsous) in general.

On Tue, 4 Dec 2001 Gerald.Jean@spgdag.ca wrote:

> Hi S-users,
>
> S+6, on NT4
>
> I have a large, very large data set consisting of several factor
variables
> and of four continuous variables.  The data comes from transactionnal
data
> and the continuous variables are agrregated (sum) over the factors.  A
> large model has been fitted to this data and as I needed to do further
work
> with only a few of the factors I re-aggregated over those factors.  My
> questions:
>
> Why is it that if I fit a two (in this example) factor model over the big
> data set I get different coefficients than fitting the same model over
the
> smaller data set?
>
> > inc.freq.glm.all <- glm(nsininct / unsousi ~ bmm + zonecong,
> +                         family = poisson(link = log),
> +                         data = inc.all.agg, weights = unsous)
>
> > inc.all.agg.2v <- aggregate(inc.all.agg[, c('unsous', 'unsousi',
> +                                             'nsininct', 'en15incf')],
> +                             by = list(bmm = inc.all.agg[, 'bmm'],
> +                               zonecong = inc.all.agg[, 'zonecong']),
> +                             FUN = sum)
>
> > inc.freq.glm.2v <- glm(nsininct / unsousi ~ bmm + zonecong,
> +                        family = poisson(link = log),
> +                        data = inc.all.agg.2v, weights = unsous)
>
> > ttt.merge <- merge(ttt.all, ttt.2v, by = 'row.names')
> > row.names(ttt.merge) <- ttt.merge[, 1]
> > ttt.merge <- ttt.merge[, -1]
> > round(ttt.merge, 5)
>
>               Value.x Std..Error.x  t.value.x  Value.y Std..Error.y
t.value.y
>  (Intercept) -5.07729      0.01282 -395.94985 -5.07304      0.01291
-393.05116
> bmm Erreur    0.15913      0.03744    4.24999  0.14441      0.03781
3.81901
> bmm Mauvais   0.73437      0.05390   13.62538  0.73928      0.05371
13.76433
> bmm Montreal -0.36010      0.03419  -10.53170 -0.36659      0.03495
-10.48868
> bmm Moyen     0.40942      0.03096   13.22550  0.41195      0.03096
13.30679
>     zonecong -0.01763      0.08312   -0.21216 -0.00122      0.08401
-0.01457
>
> I don't understand why the coefficients are not the same, I checked that
> the sums over the factors are OK and they are.
> By looking at the coefficients and the t-values it almost looks like "bmm
> Montreal" and "bmm Moyen" have been inverted in the aggregation process?
> The division by "unsoui" is not a typo, this variable is an adjusted
> "unsous" for trends, inflation and other factors deemed appropriate by
> actuaries.
>
> Thanks for any insights into this,
>
> Gérald Jean
> Analyste-conseil (statistiques), Actuariat
> télephone            : (418) 835-4900 poste (7639)
> télecopieur          : (418) 835-6657
> courrier électronique: gerald.jean@spgdag.ca
>
> "In God we trust all others must bring data"  W. Edwards Deming
>
> ---------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news
>

--
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

---------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news



<Prev in Thread] Current Thread [Next in Thread>