'[R-sig-ME] negatively biased variance estimates of random factors with few levels'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-sig-mixed-models
Subject:    [R-sig-ME] negatively biased variance estimates of random factors with few levels
From:       "kalman.toth via R-sig-mixed-models" <r-sig-mixed-models () r-project ! org>
Date:       2024-02-18 14:25:56
Message-ID: QyMq0y57wxI-meXZ_y8IZ6e-YgemTWzfY1PSL97htJRJCxrnoxlq1dPF2ORoxFQ9sMtedtpq5P_HDL2-v-DjrZnPyUdLjzWbWjEzff--qsM= () protonmail ! com
[Download RAW message or body]

Dear List Members,

On the glmmFAQ webpage Ben Bolker writes:

"Treating factors with small numbers of levels as random will in the best case lead \
to very small and/or imprecise estimates of random effects; in the worst case it will \
lead to various numerical difficulties such as lack of convergence, zero variance \
estimates, etc.. (A small [simulation exercise](https://rpubs.com/bbolker/4187) shows \
that at least the estimates of the standard deviation are downwardly biased in this \
case; it's not clear whether/how this bias would affect the point estimates of fixed \
effects or their estimated confidence intervals.) In the classical method-of-moments \
approach these problems may not arise (because the sums of squares are always well \
defined as long as there are at least two units), but the underlying problems of lack \
of power are there nevertheless. "

I am struggling to understand the reasons behind the potential bias in REML estimates \
when dealing with a small number of factor levels. To explore this issue, I modified \
Ben Bolker's simulation as follows:

- set the sd1/sd2 ratio to 5/1 (to reduce the number of zero variance estimates and \
eliminate the possibility of causing bias)

- compared the REML estimates with ANOVA because 1) the ANOVA variance estimates are \
expected to be unbiased and 2) for these datasets the two estimates should be the \
same

- extended the simulation to sample sizes 3, 5, 10 and 20

My conclusions based on the extended simulation are:

- Consistency of REML and ANOVA estimates: the estimates are indeed the same

- Negative bias: Both methods give negatively biased estimates!?

- Impact of sample size: the magnitude of bias decreases as the sample size increases \
(11, 7, 4, and 1% for sample sizes 3, 5, 10 and 20, respectively) but sample sizes as \
large as 20 or 50 might be still negatively biased

I am particularly interested in addressing the treatment of factors conceptually \
viewed as random effects but having only a few factor levels. This scenario is common \
in the datasets I work with. I understand that there can be issue with the power and \
the SE can be large if the sample size is small. But why would both ANOVA and REML \
estimates be biased? And what could be done about the bias apart from treating the \
factor as fixed? (I prefer to treat them as fixed if they really have few levels e.g. \
3-4 but not with 8-10 levels)

Regards,

Kalman

Please, find here the modified simulation:

nrep <- 5

simfun <- function(n1 = 20, n2 = nrep, sd1 = 1, sd2 = 0.2) {
d <- expand.grid(f1 = factor(seq(n1)), f2 = factor(seq(n2)))
u1 <- rnorm(n1, mean = 0, sd = sd1)
d$y <- rnorm(n1 * n2, mean = u1, sd = sd2)
d
}
require(lme4)
fitfun <- function(d = simfun()) {
lme <- sqrt(unlist(VarCorr(lmer(y ~ (1 | f1), data = d))))
MSb <- summary(aov(y ~ f1, data = d))[[1]]['Mean Sq']['f1',]
MSw <- summary(aov(y ~ f1, data = d))[[1]]['Mean Sq']['Residuals',]
an <- sqrt((MSb-MSw)/nrep)
list(lme = lme, anova = an)
}

set.seed(101)
nsim = 500
sd_dist0 <- replicate(nsim, fitfun())
sd_dist1 <- replicate(nsim, fitfun(simfun(n1 = 10)))
sd_dist2 <- replicate(nsim, fitfun(simfun(n1 = 5)))
sd_dist3 <- replicate(nsim, fitfun(simfun(n1 = 3)))
sd_List <- list(lme.20 = unlist(sd_dist0[1,]), lme.10 = unlist(sd_dist1[1,]), lme.5 = \
unlist(sd_dist2[1,]), lme.3 = unlist(sd_dist3[1,]), anova.20 = unlist(sd_dist0[2,]), \
anova.10 = unlist(sd_dist1[2,]), anova.5 = unlist(sd_dist2[2,]), anova.3 = \
unlist(sd_dist3[2,]))

#sd_List <- list(n1.5 = sd_dist1, n1.3 = sd_dist2)

plotfun <- function(x, title) {
par(las = 1, bty = "l")
hist(x, breaks = 50, col = "gray", main = title, xlab = "est. sd", freq = FALSE)
}

par(mfrow = c(2, 4))
invisible(mapply(plotfun, sd_List, names(sd_List)))

sapply(sd_List, function(x) mean(x == 0 | is.na(x)))

sfun <- function(x) {
r <- list(mean = mean(x, na.rm=T), mean2 = mean(x[which(x > 1e-5 )]), sem = sd(x, \
na.rm=T)/sqrt(length(x))) r <- with(r, c(r, list(lwr = mean - 2 * sem, upr = mean + 2 \
* sem))) unlist(r)
}
print(s_tab <- sapply(sd_List, sfun), digits = 3)

bias_pct <- round((s_tab["mean", ] - 1) * 100)
bias_pct2 <- round((s_tab["mean2", ] - 1) * 100)
bias_pct
bias_pct2 # bias without zero SD estimates
	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

[prev in list] [next in list] [prev in thread] [next in thread]