'Re: [Rd] Discourage the weights= option of lm with summarized data'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-devel
Subject:    Re: [Rd] Discourage the weights= option of lm with summarized data
From:       peter dalgaard <pdalgd () gmail ! com>
Date:       2017-11-28 12:01:24
Message-ID: 1678DE0E-9691-4D20-B23C-790DAAB10EE9 () gmail ! com
[Download RAW message or body]

My local R-devel version now has (in ?lm)

     Non-‘NULL' ‘weights' can be used to indicate that different
     observations have different variances (with the values in
     ‘weights' being inversely proportional to the variances); or
     equivalently, when the elements of ‘weights' are positive integers
     w_i, that each response y_i is the mean of w_i unit-weight
     observations (including the case that there are w_i observations
     equal to y_i and the data have been summarized). However, in the
     latter case, notice that within-group variation is not used.
     Therefore, the sigma estimate and residual degrees of freedom may
     be suboptimal; in the case of replication weights, even wrong.
     Hence, standard errors and analysis of variance tables should be
     treated with care.

OK?

-pd

> On 12 Oct 2017, at 13:48 , Arie ten Cate <arietencate@gmail.com> wrote:
> 
> OK. We have now three suggestions to repair the text:
> - remove the text
> - add "not" at the beginning of the text
> - add at the end of the text a warning; something like:
> 
> "Note that in this case the standard estimates of the parameters are
> in general not correct, and hence also the t values and the p value.
> Also the number of degrees of freedom is not correct. (The parameter
> values are correct.)"
> 
> A remark about the glm example: the Reference manual says: "For a
> binomial GLM prior weights are used to give the number of trials when
> the response is the proportion of successes ....".  Hence in the
> binomial case the weights are frequencies.
> With y <- 0.51 and w <- 100 you get the same result.
> 
> Arie
> 
> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd@gmail.com> wrote:
> > AFAIR, it is a little more subtle than that.
> > 
> > If you have replication weights, then the estimates are right, it is "just" that \
> > the SE from summary.lm() are wrong. Somehow, the text should reflect this. 
> > It is of some importance when you put glm() into the mix, because you can in fact \
> > get correct results from things like 
> > y <- c(0,1)
> > w <- c(49,51)
> > glm(y~1, weights=w, family=binomial)
> > 
> > -pd
> > 
> > > On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate@gmail.com> wrote:
> > > 
> > > Yes.  Thank you; I should have quoted it.
> > > I suggest to remove this text or to add the word "not" at the beginning.
> > > 
> > > Arie
> > > 
> > > On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
> > > <wolfgang.viechtbauer@maastrichtuniversity.nl> wrote:
> > > > Ah, I think you are referring to this part from ?lm:
> > > > 
> > > > "(including the case that there are w_i observations equal to y_i and the \
> > > > data have been summarized)" 
> > > > I see; indeed, I don't think this is what 'weights' should be used for (the \
> > > > other part before that is correct). Sorry, I misunderstood the point you were \
> > > > trying to make. 
> > > > Best,
> > > > Wolfgang
> > > > 
> > > > -----Original Message-----
> > > > From: R-devel [mailto:r-devel-bounces@r-project.org] On Behalf Of Arie ten \
> > > >                 Cate
> > > > Sent: Sunday, 08 October, 2017 14:55
> > > > To: r-devel@r-project.org
> > > > Subject: [Rd] Discourage the weights= option of lm with summarized data
> > > > 
> > > > Indeed: Using 'weights' is not meant to indicate that the same
> > > > observation is repeated 'n' times.  As I showed, this gives erroneous
> > > > results. Hence I suggested that it is discouraged rather than
> > > > encouraged in the Details section of lm in the Reference manual.
> > > > 
> > > > Arie
> > > > 
> > > > ---Original Message-----
> > > > On Sat, 7 Oct 2017, wolfgang.viechtbauer@maastrichtuniversity.nl wrote:
> > > > 
> > > > Using 'weights' is not meant to indicate that the same observation is
> > > > repeated 'n' times. It is meant to indicate different variances (or to
> > > > be precise, that the variance of the last observation in 'x' is
> > > > sigma^2 / n, while the first three observations have variance
> > > > sigma^2).
> > > > 
> > > > Best,
> > > > Wolfgang
> > > > 
> > > > -----Original Message-----
> > > > From: R-devel [mailto:r-devel-bounces@r-project.org] On Behalf Of Arie ten \
> > > >                 Cate
> > > > Sent: Saturday, 07 October, 2017 9:36
> > > > To: r-devel@r-project.org
> > > > Subject: [Rd] Discourage the weights= option of lm with summarized data
> > > > 
> > > > In the Details section of lm (linear models) in the Reference manual,
> > > > it is suggested to use the weights= option for summarized data. This
> > > > must be discouraged rather than encouraged. The motivation for this is
> > > > as follows.
> > > > 
> > > > With summarized data the standard errors get smaller with increasing
> > > > numbers of observations. However, the standard errors in lm do not get
> > > > smaller when for instance all weights are multiplied with the same
> > > > constant larger than one, since the inverse weights are merely
> > > > proportional to the error variances.
> > > > 
> > > > Here is an example of the estimated standard errors being too large
> > > > with the weights= option. The p value and the number of degrees of
> > > > freedom are also wrong. The parameter estimates are correct.
> > > > 
> > > > n <- 10
> > > > x <- c(1,2,3,4)
> > > > y <- c(1,2,5,4)
> > > > w <- c(1,1,1,n)
> > > > xb <- c(x,rep(x[4],n-1))  # restore the original data
> > > > yb <- c(y,rep(y[4],n-1))
> > > > print(summary(lm(yb ~ xb)))
> > > > print(summary(lm(y ~ x, weights=w)))
> > > > 
> > > > Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
> > > > FREQ statement (for summarized data).
> > > > 
> > > > Arie
> > > > 
> > > > ______________________________________________
> > > > R-devel@r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > > 
> > > ______________________________________________
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > 
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd.mes@cbs.dk  Priv: PDalgd@gmail.com
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes@cbs.dk  Priv: PDalgd@gmail.com

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[prev in list] [next in list] [prev in thread] [next in thread]