Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Model Misspecification and Assumption Violations With the Linear Mixed Model: A Meta-Analysis:

Model Misspecification and Assumption Violations With the Linear Mixed Model: A Meta-Analysis: This meta-analysis attempts to synthesize the Monte Carlo (MC) literature for the linear mixed model under a longitudinal framework. The meta-analysis aims to inform researchers about conditions that are important to consider when evaluating model assumptions and adequacy. In addition, the meta-analysis may be helpful to those wishing to design future MC simulations in identifying simulation conditions. The current meta-analysis will use the empirical type I error rate as the effect size and MC simulation conditions will be coded to serve as moderator variables. The type I error rate for the fixed and random effects will be explored as the primary dependent variable. Effect sizes were coded from 13 studies, resulting in a total of 4,002 and 621 effect sizes for fixed and random effects respectively. Meta-regression and proportional odds models were used to explore variation in the empirical type I error rate effect sizes. Implications for applied researchers and researchers planning new MC studies will be explored. Keywords linear mixed model, longitudinal data, type I error rate, meta-analysis Introduction X β cluster . is the design matrix of covariates, is a vector j j of fixed effects, and e is a vector of within-cluster residuals The linear mixed model (LMM), also commonly referred to (i.e., residuals for every observation). as a multilevel model (Goldstein, 2010) or hierarchical linear The random components of the LMM (i.e., and j ) model (Raudenbush & Bryk, 2002), is an extension of the are commonly assumed to be identically and independently multiple regression model to account for cluster dependency normally distributed with means of zero and a specified vari- arising from nested designs. Included within nested designs ance matrix. These common assumptions can be summed up are longitudinal designs where repeated measurements are N (, 0 G) as follows: b ∼ iid and e ∼ iid N (, 0 σ ) nested within an individual. This data setup will serve as the j j (Raudenbush & Bryk, 2002). In addition, the independence primary focus of this article. These series of models were assumption of the within-cluster residuals (i.e., j ) is condi- first introduced in the early 1980s (Laird & Ware, 1982), and tional on the random effects specified in the model (Browne the rapid improvement in computational power has helped & Goldstein, 2010). These random effects are what account these models become a popular data analysis method for for the dependency due to repeated measures, although there researchers. is the ability to allow the within-cluster residuals to be cor- The LMM takes the following general matrix form: related due to the time lag in repeated measurements, a phe- nomenon called serial correlation (Diggle, Heagerty, Liang, YX =+ β Zb + e . jj jj j & Zeger, 2002). This serial correlation may be especially (1) important if the time lag between measurement occasions is short (Browne & Goldstein, 2010). This is very similar to the multiple regression model, except Zb b now there are additional terms, . are random effects, jj j which serve as additional residual terms and represent clus- The University of Iowa, Iowa City, USA ter-specific deviations from the average growth curve, and Corresponding Author: represents the design matrix for the random effects. The Brandon LeBeau, The University of Iowa, 311 Lindquist Center, Iowa City, rest of the terms in the model are identical to a multiple IA 52242, USA. regression, where represents the dependent variable for Email: brandon-lebeau@uiowa.edu Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 SAGE Open The statistical assumptions for the LMM are difficult to effect distributions (LeBeau, 2012, 2013; Maas & Hox, assess analytically due to the computationally intensive and 2004, 2005), the effect of serial correlation (Browne, Draper, iterative procedure of obtaining model estimates. In addi- Goldstein, & Rasbash, 2002; Ferron, Dailey, & Yi, 2002; tion, when normality of the random components is not Kwok, West, & Green, 2007; LeBeau, 2012; Murphy & assumed, the mathematics becomes increasingly more dif- Pituch, 2009), missing data (Black, Harel, & McCoach, ficult and intractable. As a result, Monte Carlo (MC) meth- 2011; Kwon, 2011; Mallinckrodt, Clark, & David, 2001), ods have been used to explore the relationship between and estimation method (Delpish, 2006; Overall & Tonidandel, assumption violations and model performance (Skrondal, 2010). The random effect distributions simulated tend to be 2000). MC studies have the advantage of strong internal normally distributed, skewed (such as a chi-square distribu- validity due to the researcher directly manipulating the con- tion), or have heavy tails (such as a t or Laplace distribution). ditions of interest. The direct manipulation is not unlike The serial correlation structures also tend to fall into three true experiments, where the researcher can isolate the categories, independent structures, autoregressive (AR) type source of problems when estimating parameters (Skrondal, structures, or banded structures (such as the moving average 2000). models). The major drawback in MC studies is the potential lack of Most MC studies include sample size as a simulation con- external validity (Skrondal, 2000). This weakness stems dition, where, surprisingly, there is little variation in sample from the MC results being conditional on the conditions cho- sizes used. The number of repeated measures is commonly sen to study. For example, if researchers conducting an MC less than 10 and the number of clusters rarely is larger than study only simulate the random effects coming from a nor- 100. The choice of sample size conditions is likely a function mal or chi-square(1) distributions, the question must be of using maximum likelihood for estimation. Maximum like- asked, can the study results be generalized beyond those two lihood is an asymptotic estimation method, as such, under- distributions. Hoaglin and Andrews (1975) started a discus- standing how the estimation method behaves for small sion of best practices when reporting and conducting MC samples is informative, and increasing sample size would studies. More recently, Paxton, Curran, Bollen, Kirby, and likely improve estimation. Chen (2001) and Skrondal (2000) have offered design con- The number of fixed and random effects is another siderations to improve external validity. Much of the recom- aspect of the simulation design that is chosen by the mendations surround reducing the number of replications to researcher. Unfortunately, these are two design choices increase the coverage of the simulation conditions (Skrondal, that are commonly not manipulated directly. Instead, the 2000). number of fixed effects and random effects are held con- Although the papers by Paxton et al. (2001) and stant across studies. In addition, there is even less variation Skrondal (2000) offer suggestions for improving new MC in the number of fixed and random effects chosen by studies, the design considerations from past MC studies researchers. It is uncommon for the number of fixed effects can not be altered to improve the external validity. As such, to be larger than six and the number of random effects to the current study aims to leverage prior MC studies to help be larger than two. improve the external validity of these studies, better under- The advantage of the current meta-analysis is to attempt stand gaps in simulation conditions, and succinctly inform to combine unique design choices made by independent applied researchers of assumption violations that can researchers. This can offer some insight into data conditions greatly affect the study results. This article aims to accom- that were not manipulated directly in a single MC study plish these three goals by quantitatively synthesizing the (such as the number of fixed effects), but do vary across stud- MC longitudinal LMM literature with a meta-analysis. The ies. In addition, the slightly distinctive design choices made meta-analysis allows for the pooling of study conditions for a single manipulated condition (such as sample size or across numerous MC studies to increase sample size and random effect distribution) can again be combined to explore depth of coverage of simulation conditions. In addition, whether one condition has a larger effect than others. This many MC studies only report descriptive statistics for the can aid applied researchers in their ability to understand study results and may miss complex interaction effects implications when certain model assumptions are not met in found through inferential modeling. A meta-regression practice. In addition, it can help inform researchers designing was performed to overcome this limitation of some of the MC studies about gaps in the literature that would be worthy MC literature. A similar study for the one- and two-factor of further study. ANOVA models was done by Harwell, Rubinstein, Hayes, and Olds (1992). Assumption Violations With the LMM Results of the MC studies with assumption violations can be Common Simulation Conditions With the LMM grouped into three categories. Estimation of the fixed effects MC studies exploring assumption violations with the LMM tends to be unbiased regardless of the assumption violations have focused primarily on the impact of nonnormal random (e.g., Kwok et al., 2007; LeBeau, 2013; Maas & Hox, 2004; LeBeau et al. 3 Murphy & Pituch, 2009). This has been shown with nonnor- Method mal random effect distributions (LeBeau, 2013; Maas & Data Collection Hox, 2004), different sample sizes (LeBeau, 2013; Maas & Hox, 2004, 2005), with the presence of serial correlation Articles, dissertations, conference papers, or unpublished (Kwok et al., 2007; LeBeau, 2012; Murphy & Pituch, 2009), documents were gathered to attempt to answer the research and with different estimation algorithms (Delpish, 2006). questions above. Documents were selected if they are simu- Therefore, if the researcher is solely interested in estimates lation studies that reported empirical type I error rates for the of the fixed effects, then little care to assumption violations fixed effects. Only studies with continuous outcome vari- is needed. ables were included to keep the comparison consistent. The However, if the researcher is interested in estimates of the simulation studies must include longitudinal data conditions, random effects or inference with the LMM, then researchers specifically multiple measurement occasions for individuals need to pay specific attention to assumption violations. that are often smaller than cross-sectional models (Singer & Nonnormal random effects can inflate estimates of the ran- Willett, 2003). dom effects, especially in small sample size conditions Based on the above criteria, the population of studies is (LeBeau, 2013; Maas & Hox, 2004). In addition, not model- defined as all possible MC LMM studies exploring data con- ing serial correlation when present can also cause an infla- ditions similar to longitudinal studies using a continuous tion in the random effects and underestimate the standard dependent variable. An initial search was performed in errors for the fixed effects (Kwok et al., 2007; LeBeau, 2012; March 2012 and follow-up searches were performed in April Murphy & Pituch, 2009). This can lead to an inflation in the 2013 and June 2014. Articles were selected for relevance empirical type I error rate. Finally, misspecifying, specifi- based on their title. Abstracts from the articles selected by cally underspecifying, the random effect structure can also their titles were read to determine their relevance. If the lead to severe inflation of the empirical type I error rates for study met the criteria established above, it was set aside to be fixed effects (LeBeau, 2012). read and code information from the study. These results suggest that checking of model assumptions A Boolean search was used with the Eric, PsycInfo, and is important when researchers are interested in conducting Dissertation Abstract databases to search for documents to inference, which likely encompasses most applied research- be coded. The Boolean search string took the following ers. In addition, inflated estimates of the variance of the ran- form: (“Monte Carlo” or simulation) and (“linear mixed dom components can lead researchers to include predictors model” or “hierarchical linear model” or “mixed effects” or to explain variation, when in reality, this variation is smaller “mixed-effects” or longitudinal or LMM or HLM or LMEM) than expected. A quantitative synthesis can be informative and generalized and (nonlinear or Bayesian or SEM or for applied researchers to show which assumption violations “structural equation model”). are crucial to achieving valid inferences. The MC literature The search identified 223 articles for review. Of those 223 can also be better informed through the ability of this study articles, a total of 25 were selected for inclusion and were to include moderator variables that were not directly manip- read further to include in the meta-analysis. There were three ulated within a study, but were between studies, such as primary reasons why studies were excluded: (a) The article number of fixed effects. did not report empirical type I error rates (or it was not an outcome), (b) the study was cross sectional, or (c) the study was done in the structural equation modeling framework. Research Questions After studies were found using the above manner, Google Based on the prior MC literature, the following research Scholar was used to find articles that cited the articles found questions were explored in the current meta-analysis: above. Footnote chasing was also used by exploring the titles in the reference list of the articles selected. The titles and Research Question 1: Is there evidence that the type I abstracts of studies identified through Google Scholar were error rate is different from the nominal rate of 0.05 for screened for inclusion. fixed and random effects? From each of the sampled studies, effect sizes and a num- Research Question 2: To what extent does the indepen- ber of study characteristics were coded to help with the data dence assumption of the within-cluster residuals affect the analysis step of the meta-analysis. Three independent read- empirical type I error rate? ers, who had completed PhD training or were in a PhD pro- Research Question 3: To what extent does the normality gram related to quantitative methods, completed the coding assumption of the random effects affect the empirical type of the studies included in this meta-analysis. One individual I error rate? coded all the studies and the other two individuals indepen- Research Question 4: To what extent do the MC study dently coded approximately half of the studies to evaluate characteristics moderate the relationships found in the coding consistency. Internal consistency was high on all above questions? coded variables, including the primary dependent variable, 4 SAGE Open Figure 1. Data structure for model. the empirical type I error rate. For the primary dependent tests when the null hypothesis is true should be very close to variable, 98% of the values were coded the same across cod- the α value set by the researcher. Deviations in the propor- ers. Those that did not match were reviewed by looking at tion of tests that reject a true null hypothesis from the α the original study for verification of the correct values. value reflect problems in estimation; as a result, hypothesis tests are too conservative (proportion less than the ) or liberal (proportion greater than the α ). Data Evaluation Studies were first checked to ensure that they contained the Independent Variables dependent variable of interest, the empirical type I error rate. If the studies do not contain the empirical type I error rate or The primary independent variables were the conditions that if the study used an empirical type I error rate other than the MC studies directly manipulated. Variables commonly 0.05, the study was not coded. The studies were then checked manipulated are the cluster sample size (i.e., how many indi- for methodological/coding flaws. Evidence that the data viduals), presence of serial correlation, what kind of serial were generated accurately was explored first. If studies correlation structure is assumed, and the number of measure- appear to show inaccuracies when the model assumptions ment occasions. Other conditions that are commonly not have been met compared with the body of MC literature, manipulated within an MC study that were coded included they will be excluded from the sample of studies due to how many fixed effects are in the model, the number of ran- severe methodological or coding flaws. No studies were dom effects in the model, the number of replications within a removed due to methodological or coding flaws. cell of the study design, estimation method (e.g., full infor- All studies were read one at a time to code the variables of mation maximum likelihood, restricted maximum likelihood interest. Each MC study contributes many effect sizes to the [REML]), and whether the design was balanced or meta-analysis; however, due to the quasi-random number unbalanced. generation within each study, the effect sizes are assumed to The random effects are commonly assumed to follow a be independent within a study. However, there may be coder normal distribution, and violating this assumption has been or author effects that need to be considered, and this potential studied thoroughly with numerous MC studies. The simu- dependency was adjusted by using an LMM and is discussed lated random effect distribution was coded. In addition to the in more detail below. name of the coded distribution, the theoretical and empirical Accuracy in coding was checked to ensure that all the skewness and kurtosis values were coded for the random effect sizes were coded properly. Summary statistics and effects distribution. These independent variables were used plots were used to examine the distribution of effect sizes to help determine whether the skewness or kurtosis of the looking for possible extreme values because of erroneous distribution has a larger impact on the type I error rate. coding. Very large or small empirical type I error rates were checked against the values published in the manuscripts. Data Analysis After errors were corrected, exploratory and inferential data analyses were used to attempt to explain variation in effect Exploratory data analyses were used to explore variation in sizes. The coded variables and analyses are described in the effect sizes. If significant variation in the empirical type more detail below. I error rates was found, an LMM was used to see whether any moderator variables explain variation in the type I error rates. The LMM was chosen due to the hierarchical structure of the Dependent Variable empirical type I error rates, which is illustrated in Figure 1. The primary dependent variable in the current meta-analysis The empirical type I error rates are proportions. As a was the empirical type I error rate for each condition in the result, the variance is a function of the specific value of the MC studies. The type I error rate is commonly reported as empirical type I error rates and the sampling distribution is the proportion of tests that reject a true null hypothesis. unlikely to be normally distributed. Therefore, the empirical Statistical theory informs us that the proportion of rejected type I error rates were transformed using the Freeman–Tukey LeBeau et al. 5 transformation (Freeman & Tukey, 1950). This transforma- study-specific residual terms that are assumed to follow a tion takes the following form: normal distribution with mean zero and variance . The represent known sampling variances that are assumed to follow a normal distribution with mean zero and known     1 x x +1 k k t = arcsin + arcsin ,     v calculated from Equation 3 above. Within a study, the     21 n + n +1 k k     empirical type I error rates were treated as independent due (2) to the quasi-random number generation used by MC stud- t x ies (Rubinstein & Kroese, 2016). The moderators chosen where is the transformed proportion, is the number of k k n to be included in the LMM were informed by the explor- type I errors, and is the total number of replications. atory data analysis. Many articles reported the empirical type I error rate as a First, an omnibus model with no covariates was used to proportion, not the number of type I errors for each cell of the explore the heterogeneity in the effect sizes. The Q test was MC design. To calculate the transformation, the number of used to assess the amount of heterogeneity (Cooper, Hedges, empirical type I error rates made in each cell of the design & Valentine, 2009). If this test was significant, covariates was found by taking xn =× π , where π is the empirical kk k κ were added to attempt to explain variation in the effect sizes type I error rate reported by the study. t with a meta-regression (Cooper et al., 2009). The covariates The variance of the transformed proportions ( ) is will take the form of simulation conditions that were coded and discussed above. Descriptive analyses will help to inform v = , which covariates are included in the model. Significant pre- 42 ×+ n () (3) dictors will be identified when the z value is greater than 2.33 in absolute value, representing a p value less than .01. The v n where is the variance and is the number of replica- k k level of significance was selected to help control for com- tions. Miller (1978) defined a back-transformation to con- pounding type I error rates from many tests of predictors in vert back into the raw proportion metric defined as the analysis and better reflect covariates of practical follows: significance. To assess the amount of explanatory power of the predic-  2  R tors, an statistic defined by Aloe, Becker, and Pigott   Meta         (2010) will be used. The statistic takes the following form: sin t −         sin t  k      π =− 11 sin coss tt −+ in , ()   kk k   2  n  (4)  k  g cond   R =− 1 ,       Meta   g     uncond   (6)   g where and represent the conditional and cond uncond unconditional estimates of the between study variation. where is the back-transformed empirical type I error rate, A similar analysis to that described above was also done t n is the transformed value from Equation 2, and is the k k using the empirical type I error rate for the random effects. In number of replications. Results will be back-transformed to this analysis, the dependent variable was for the empirical the empirical type I error rate metric for use in figures and type I error of the random effects and independent variables tables. were the simulation conditions described in more detail above. There were fewer studies that studied the empirical Inferential model. The LMM was fitted with REML as this type I error rate for the random effects; therefore, many study has been shown to produce less biased estimates of the ran- conditions coded did not have variation and were omitted dom components (Fitzmaurice, Laird, & Ware, 2004; from the model. Raudenbush, 2009). The LMM took the following general form: Proportional Odds Models (POMs). POMs (Yee, 2015) were also explored to further attempt to understand variation in the empirical type I error rates for the fixed effects. A new depen- tX =+ ββ ++  β Xb ++ e . kk 01 1 tktk k (5) dent variable was defined that represented three ordinal cat- egories. These ordinal categories represented conservative In Equation 5, represents the transformed empirical tests (with a level of significance less than .05), accurate type I error rate coded from the articles. is an intercept tests, and liberal tests (with a level of significance greater ββ ,,  and represent the relationship between the pre- 1 t than .05). XX ,,  , dictor variables, and the dependent variable, kk 1 t To determine which group each observation belonged in, t . Finally, this model contains b , which represent confidence intervals (CIs) were created for the empirical k k 6 SAGE Open type I error rate reported for each condition. These took the This included the fitting of the model shown in Equation 5 following form: and the calculation of the Freeman–Tukey transformation, back-transformation, and the variance shown in Equations 2 through 4. POMs were fitted with the VGAM package (Yee, ππ () 1− kk CI =± π 2 × , 2010, 2015). Figures were generated with the ggplot2 pack- age within R (Wickham, 2009). (7) Limitations. There are at least three limitations of this study. where represents the empirical type I error rate for the The first stems from the nature of the meta-analysis, in that, fixed effects and represents the number of replications studies that did not report the empirical type I error rate can- for each simulation condition coded. If 0.05 is contained by not be included in this meta-analysis. Second, the studies the CI, then the dependent variable was coded as 1. If 0.05 included in this meta-analysis are not a random sample of was greater than the upper bound of the CI, the dependent the population of MC studies; therefore, the results are fixed variable took a value of 0 (an example of a conservative test); to the MC conditions coded from the included studies. The if it was less than the lower bound of the CI, the dependent extent to which the studies not included (due to omission or variable took a value of 2 (an example of a liberal test). There did not report the empirical type I error rate) are signifi- was a total of 162 (4%) that were below the lower bound cantly different than the included studies could bias the of the CI, 2,897 (72%) that were within the confidence results. For this reason, some care needs to be taken when band, and 943 (24%) that were above the confidence interpreting the results and the external validity may be band. affected. Both limitations overlap as the MC studies that did This variable was then modeled using a POM that had not report the type I error rate were also more likely to be three categories. This model took the following general journal articles. The degree to which these studies are differ- form: ent than the ones included in this meta-analysis may bias the results. Finally, many of the coded studies included in the   PY ≤ c () log =+ αβX , meta-analysis were published in social science journals or   ck   1−≤ PY c ()   were other document types completed within a social sci- (8) ence context. The degree to which the data conditions included in the primary studies is different in disciplines where the log odds of being less than or equal to a given outside of social science domains may affect the external category is modeled as a function of the covariates ( ). validity. In the current example, where , there were two cumula- c= 3 tive logits that were modeled simultaneously, L =+ log(ππ π ) L =+ log(ππ π ) and . 11 23 21 23 Results The POM assumes that the regression weights are consis- tent between the two cumulative logits defined above, Fixed Effects referred to as the parallel regression assumption (Yee, 2015). This assumption will be explored empirically using nested Summary information about the 13 articles coded in the model chi-square comparisons (Yee, 2015). If there is evi- meta-analysis is shown in Table 1 (two articles, Kwon [2011] dence that the slopes vary, these will be allowed to vary and LeBeau [2012], are represented in four rows of Table 1 between the two cumulative logits modeled. This strategy as they had two studies within a single article). As can be revealed that the covariate, model term (i.e., intercept, linear seen from Table 1, there are a total of 4,002 empirical type I slope for time, other within slopes), did not satisfy the paral- error rates for the fixed effects with an average unweighted lel regression assumption; as a result, these were allowed to type I error rate of 0.063% and 95% CI = [0.055, 0.070]. The vary between the cumulative logits described above. distribution of weighted effect sizes was highly concentrated A POM was also fitted for the empirical type I error rate between 0 and 0.1, but there were effect sizes greater than of the random effects. There was a total of 160 (26%) 0.15 and even one effect size larger than 0.4. that were below the lower bound of the CI, 320 (52%) Additional summary statistic information for weighted that were within the confidence band, and 132 (22%) that back-transformed empirical type I error rates separated by were above the confidence band. As in fixed-effects analysis, various potential moderators can be seen in Table 2. From the parallel regression assumption was not tenable for the the table, there appears to be differences based on many of covariate reflecting the variance component (i.e., within- these moderators. For example, missing a random effect cluster residual, variance of intercept, etc.), and these were appears to have a strong impact on type I error rate with a allowed to vary between the cumulative logits. mean of 0.078 compared with 0.055 when all random effects are modeled. Figure 2 shows the interaction between missing Software. Data analysis was performed with R (R Core a random effect and the fixed effect term (e.g., intercept, Team, 2015) using the metafor package (Viechtbauer, 2010). time, or other within). As can be seen from Figure 2, the LeBeau et al. 7 Table 1. Article Summary Information of Unweighted Empirical Type I Error Rates for the Fixed-Effects and Other Monte Carlo Conditions. Author Source K Repl Avg T1 Med T1 Min T1 Max T1 FE Range CS Range wCS Black (2011) Jour 40 1,000 0.088 0.075 0.000 0.310 4 (50, 50) (20, 20) Browne (2000) Jour 20 930 0.058 0.053 0.003 0.099 2 (12, 48) (18, 18) Browne (2002) Jour 14 10,000 0.053 0.054 0.047 0.057 2 (65, 65) (62, 62) Delpish (2006) Diss 64 500 0.054 0.052 0.044 0.091 4 (30, 100) (30, 30) Ferron (2002) Jour 192 10,000 0.054 0.052 0.045 0.079 4.75 (30, 500) (3, 12) Kwon (2011) Diss 324 250 0.051 0.048 0.000 0.448 6 (40, 160) (45, 45) Kwon (2011) Diss 540 250 0.049 0.044 0.008 0.192 10 (40, 160) (45, 45) LeBeau (2013) Conf 244 300 0.059 0.057 0.033 0.100 4 (30, 50) (6, 12) LeBeau (2012) Diss 1,500 500 0.063 0.061 0.015 0.119 5 (25, 50) (6, 8) LeBeau (2012) Diss 750 500 0.082 0.070 0.024 0.274 5 (25, 25) (6, 8) Maas (2004) Jour 144 1,000 0.059 0.058 0.038 0.088 4 (30, 100) (5, 50) Maas (2005) Jour 108 1,000 0.054 0.054 0.037 0.075 4 (30, 100) (5, 50) Mallinckrokt Jour 32 3,000 0.059 0.058 0.050 0.072 4 (100, 100) (7, 7) (2001) Murphy (2009) Jour 64 10,000 0.059 0.052 0.047 0.125 4 (30, 200) (5, 8) Overall (2010) Jour 66 1,500 0.062 0.063 0.039 0.087 4.33 (100, 100) (9, 9) Total — 4,002 1,123 0.063 0.058 0.000 0.448 — — — Note. K = number of effect sizes; Repl = replication; T1 = type I error rate; Med = median; Min = minimum; Max = maximum; FE = fixed effects; CS = cluster size; wCS = within-cluster size; Jour = journal article; Diss = dissertation. empirical type I error is only inflated for the fixed effect empirical type I error rates. The sample size differs com- terms associated with time, the others are similar to one pared with Table 1 due to missing data on the cluster and another. within-cluster sample sizes arising from reporting practices. Table 2 also shows that there are some differences in the For example, some articles did not provide tables for the empirical type I error rate for differing simulated random entire factorial research design. Instead, in the reported tables effect distributions. The empirical type I error rate for the from the article, there was some aggregation over simulation Laplace and chi-square(1) distributions were inflated at conditions. 0.067 and 0.066, respectively, compared with the other two distributions at 0.054 and 0.056 for normal and uniform, Inferential statistics. Expanding on the significant Q test, sim- respectively. The distributions of the empirical type I error ulation conditions were added to the model to attempt to rate for the simulated random effect distributions also showed explain variation in the empirical type I error rate. The pre- evidence of being positively skewed. This can be seen from dictors explained significant variation as shown by the sig- Q () 43 = 7,, 108 the median being less than the mean and with large maxi- nificant chi-square test for moderators, mum values most notably for the normal, Laplace, and chi- Although the predictors are shown to pR <= ., 0001 .049 Meta square(1) distributions. be highly significant based on the moderator chi-square test, The effect of sample size is difficult to see from Table 2. the explanatory power of the model is small. This could be For many sample sizes, the empirical type I error rate is close attributable to the small amount of variation between studies. to the theoretical value of 0.05. Some deviations from this The significant predictors can be seen in Table 3. Note, a occurs when the cluster sample size is 25 and the within- handful of predictors were significant, but had back-trans- cluster sample size is 6, 8, and 20. There may be more com- formed estimates of zero to three decimal places (i.e., 0.000). plicated effects that underlie the data here, and these The average empirical type I error rate for the intercept differences will be explored in more detail with the inferen- term (i.e., initial status) is very close to 0.05 at 0.048. This tial model. suggests that, on average, the empirical type I error rate con- Finally, considering the variance and the number of repli- trol is very good for the reference group. More specifically, cations, the weighted average empirical type I error rate was this is for dissertations, a normal random effect distribution, 0.058 with a 95% CI = [0.054, 0.062]. In addition, the omni- and independent fitted and generated serial correlation struc- Qp (, 4 001), =< 22 798,.0001, bus Q test was significant, tures. More simply, this would represent the situations where suggesting that there is significant variation in the empirical model assumptions have been adequately met. Assumption type I error rates. The estimate for the between study vari- violations, such as nonnormal random effects, does not affect ance (i.e., ) was 0.003 for the omnibus model using 3,842 the empirical type I error rate for the intercept. This can be 8 SAGE Open Table 2. Summary Statistics of Weighted Back-Transformed Empirical Type I Error Rate for the Fixed Effects by Parameter, Level of Parameter, Article Source, Cluster Size, Within-Cluster Size, Serial Correlation, Random Effect Distribution, and Missing Random Effect. Moderator Avg T1 Med T1 Min T1 Max T1 K Term Intercept .056 .055 .003 .448 1,835 Time .060 .057 .000 .274 1,693 Within .065 .066 .023 .110 474 Level parameter Level 1 .061 .059 .003 .274 2,044 Level 2 .057 .055 .000 .448 1,958 Cluster sample size 12 .059 .079 .003 .099 10 25 .072 .067 .021 .274 1,500 30 .061 .060 .036 .124 264 40 .045 .045 .009 .177 288 48 .048 .048 .036 .058 10 50 .058 .058 .000 .310 922 65 .053 .053 .046 .056 14 80 .048 .049 .003 .224 288 100 .054 .052 .036 .086 258 160 .052 .049 .013 .448 288 200 .050 .049 .046 .057 32 500 .050 .050 .046 .056 40 Random effect distribution Chi-square(1) .066 .062 .021 .268 846 Laplace .067 .064 .015 .272 846 Normal .054 .053 .000 .448 2,262 Uniform .056 .056 .037 .068 48 Article source Dissertations .059 .057 .003 .448 3,178 Journal .058 .056 .000 .310 680 Conference paper .059 .057 .033 .100 144 Missing random effect No .055 .054 .000 .448 3,252 Yes .078 .070 .024 .274 750 Within-cluster sample size 3 .051 .050 .047 .057 24 4 .055 .054 .044 .078 72 5 .057 .056 .040 .082 92 6 .066 .063 .023 .274 1,221 7 .058 .056 .049 .071 32 8 .067 .062 .015 .272 1,205 9 .060 .062 .038 .086 66 12 .060 .060 .034 .100 96 18 .054 .053 .003 .099 20 20 .078 .079 .000 .310 40 30 .054 .052 .037 .091 124 45 .049 .049 .003 .448 864 50 .054 .055 .036 .086 60 62 .053 .053 .046 .056 14 Note. T1 = type I error rate; Med = median; Min = minimum; Max = maximum; K = number of effect sizes; Intercept = regression terms modeling the intercept; Time = regression terms modeling time; Within = regression terms of another within-cluster variable. seen from Table 3, where the back-transformed values asso- On average, the time or linear trend fixed effect did not ciated with the intercept are zero to two or three decimal show evidence of being significantly different from the ini- places. tial status. However, there was evidence of inflation, on LeBeau et al. 9 Figure 2. Density plot of weighted empirical type I error rates by missing a RE and fixed effect term. Note. RE = random effect.  )  Table 3. Fixed Effect Meta-Regression Results for Significant Predictors in the Transformed βt and Type I Error Rate Metrics . βπ   ( ) βπ βπ Variable 99% CI Z βt k k Intercept 0.224 0.048 [0.019, 0.090] 6.296 w/ sample size −0.025 −0.000 [−0.000, −0.000] −3.629 Fit SC UN 0.051 0.002 [0.001, 0.002] 24.716 Time: Miss RE 0.171 0.028 [0.025, 0.031] 41.378 Fit SC ARMA: Miss RE −0.024 −0.000 [−0.000, −0.000] −5.439 Within: Fit SC ARMA 0.028 0.000 [0.000, 0.000] 8.097 Within: Fit SC MA2 0.035 0.000 [0.000, 0.001] 9.473 Time: Fit SC AR1:Miss RE −0.119 −0.000 [−0.000, −0.000] −19.527 Within: Fit SC AR1:Miss RE 0.027 0.001 [0.003, 0.001] 3.438 Time: Fit SC ARMA:Miss RE −0.115 −0.000 [−0.000, −0.000] −18.905 Within: Fit SC ARMA:Miss RE 0.028 0.001 [0.002, 0.001] 3.596 Time: Fit SC MA1:Miss RE −0.084 −0.000 [−0.000, −0.000] −12.953 Time: Fit SC MA2:Miss RE −0.105 −0.000 [−0.000, −0.000] −16.131 Within: Fit SC MA2:Miss RE 0.027 0.002 [0.004, 0.001] 3.352 Note. Reference groups were dissertations, normal random effect distributions, intercept fixed effect terms, and independent fitted and generated serial correlation structures. CI = confidence interval; RE = random effect; Fit = fitted; SC = serial correlation; ARMA = autoregressive moving average; MA = moving average; Miss = missing; w/ = within; colon (:) = an interaction. 10 SAGE Open Figure 3. Interaction plot showing three variable interaction between fixed effect term, fitted serial correlation structure (Toeplitz and unstructured omitted from figure), and misspecification of the random effect structure. Note. AR = autoregressive; ARMA = autoregressive moving average; MA = moving average; RE = random effect. average, when the random effect distribution was misspeci- when the fitted serial correlation structure is independent fied, an increase of 0.028 in the empirical type I error rate (i.e., assuming no serial correlation underlies the data). metric. This results in an estimated empirical type I error Including some form of serial correlation does help to limit rate of 0.076, an approximately 50% increase in the empiri- the strong inflation however. cal type I error rate metric. Other within individual variables behave very similar to There is evidence that including some form of serial cor- the intercept term. There are significant terms labeled as relation does improve the empirical type I error rate, but does “Within” in Table 3, but many of these have very small esti- not fully overcome the inflation due to misspecification of mates suggesting terms of little practical significance. the random effect structure. This can be seen by the negative coefficients for the time by fitted serial correlation by mis- POM. Results for the final POM can be seen in Figure 4 for specification of the random effect three-way interaction. covariates of primary interest. This plot depicts an interac- Estimates for these terms are not as large as the time by mis- tion between the generated and fitted serial correlation struc- specification of the random effect two-way interaction, sug- ture, the model term, whether the random structure was gesting an adjustment but not full correction of the inflation. misspecified, and the two cumulative logits shown with solid L L The first-order autoregressive (AR(1)) and first-order autore- lines ( ) and dashed lines ( ). As can be seen for the 1 2 gressive moving average (ARMA(1, 1)) seem to do the best intercept term (left most panel), the second cumulative logit job of correcting the inflation, shown in Figure 3. The figure is very close to 1, indicating that very few of the observations shows the fixed effects associated with time and a mis- were in the third group. Furthermore, the probability of being specification of the random effect structure, the empirical in Group 1 is also very small, indicating that fixed effect type I error rate is inflated. The effect is particularly large terms associated with the intercept tend to have the nominal LeBeau et al. 11 Figure 4. Interaction plot showing results of the partial proportional odds model by the fitted serial correlation structure, generated serial correlation structure, fixed effect term, and misspecification of the random effect structure. Note. P(Y ≤ 1) represents the probability of being a conservative test (a level of significance less than .05) compared with accurate or liberal tests (a level of significance greater than .05). P(Y ≤ 2) represents the probability of being a conservative or accurate test compared with liberal tests. AR = autoregressive; ARMA = autoregressive moving average; MA = moving average; RE = random effect. Table 4. Article Summary Information of Unweighted Empirical Type I Error Rates for the Random Effects and Other Monte Carlo Conditions. Author Source K Repl Avg T1 Med T1 Min T1 Max T1 RE Range CS Range wCS Black (2011) Jour 20 1,000 0.153 0.080 0.000 0.860 2 (50, 50) (20, 20) Browne (2000) Jour 20 930 0.103 0.091 0.032 0.186 2 (12, 48) (18, 18) Browne (2002) Jour 11 1,000 0.057 0.059 0.040 0.075 3 (65, 65) (62, 62) Delpish (2006) Diss 48 500 0.100 0.052 0.045 0.400 2 (30, 100) (30, 30) Kwon (2011) Diss 162 250 0.043 0.048 0.000 0.132 2 (40, 160) (45, 45) Kwon (2011) Diss 162 250 0.043 0.048 0.000 0.124 2 (40, 160) (45, 45) Maas (2004) Jour 108 1,000 0.097 0.051 0.005 0.429 2 (30, 100) (5, 50) Maas (2006) Jour 81 1,000 0.067 0.064 0.035 0.116 2 (30, 100) (5, 50) Total — 621 561 0.066 0.052 0.000 0.860 — — — Note. K = number of effect sizes; Repl = replication; T1 = type I error rate; Med = median; Min = minimum; Max = maximum; RE = random effects; CS = cluster size; wCS = within-cluster size; Jour = journal article; Diss = dissertation. The middle panel in Figure 4 shows the model results for rate enclosed in the 95% CI, indicating adequate type I fixed effects associated with the linear slope. As can be seen error rate coverage. 12 SAGE Open Table 5. Random Effect Meta-Regression Results for Significant Predictors in the Transformed () β and Type I Error Rate Metrics () β π . ^ ^ ^ Variable () β 99% CI () β Z π π k k Intercept 0.334 0.106 [0.055, 0.172] 8.246 Var b 0.067 0.003 [0.002, 0.004] 24.168 Var −0.055 −0.000 [−0.000, −0.000] −19.909 Number FE 6 −0.134 −0.000 [−0.000, −0.000] −2.806 Number FE 10 −0.134 −0.000 [−0.000, −0.000] −2.796 Note. Reference groups were dissertations, two fixed and random effects, within-cluster residuals. CI = confidence interval; FE = fixed effect. from the figure, there is an interaction effect where mis- a 95% CI = [0.048, 0.087]. The omnibus Q test was also specification of the random effect structure significantly (( Qp 491), =< 14 891,. 0 001) significant , suggesting that decreases the probability of being in Group 1 or 2. This sug- there is significant variation that may be able to be explained gests that the probability of having an inflated type I error by study conditions that were coded. The estimate of the rate is larger than .75 in many cases. When all the random between study variation was 0.003. effect terms are correctly modeled, the probability of having an inflated type I error rate decreases significantly as shown Inferential models. With the significant Q test, a meta- in the bottom plot of the middle panel. regression was performed to attempt to explain the signifi- Finally, the rightmost panel in Figure 4 shows the effect cant variation in the empirical random effect type I error for terms associated with other Level 1 slopes not including rates. Significant predictors at the .01 level are shown in the intercept or linear slope for time. Compared with the Table 5. The conditional model explained significant varia- intercept, the probability of being in both Groups 1 and 2 is tion as shown by the significant chi-square moderator test, slightly smaller, suggesting that the probability of being in Qp () 92 =< ,, 633 .; 001 R = .759 . Table 5 shows the M Meta Group 3 is higher than the intercept and the linear slope when weighted empirical type I error rate was 0.106 for the refer- the random effect structure is correctly modeled. ence condition being those with average sample size, disser- Throughout all the conditions, the AR(1) fitted serial cor- tations, two random and fixed effects, and for the variance of relation structure did increase the probability of being in the within-cluster residuals. The other two significant predic- Group 1 or 2, especially for the linear slope and other within tors shown in the table are adjustments for the variance of the slopes (middle and right panels of Figure 4). The indepen- random effect terms. Specifically, the empirical type I error dent generated serial correlation structures also provided the rate of the variance of the random intercepts tends to be highest probability of being in Group 1 or 2, a situation that slightly inflated compared with the within-cluster residuals. is not surprising, given that these are cases where serial cor- In contrast, the empirical type I error rate of the variance of relation is not present in the data, and the random effects the random slopes tends to be smaller (i.e., less biased) com- should adequately represent the dependency due to repeated pared with the within-cluster residuals. Of note, the cluster measures. Finally, Figure 4 also depicts across all conditions sample size was also significant, however, this term was not that the probability of being in Group 1 (type I error rate reported in the table as the parameter estimate was −0.000 (in smaller than the nominal rate) is not common. Liberal tests the transformed scale) and not deemed significant in are a much larger concern than conservative significant tests, practice. particularly when serial correlation is present in the underly- ing data. POM. Results for the POM can be seen in Figure 5 for sig- nificant covariates. This plot shows the effect of the number of fixed effects and the random effect term on the probability Random Effects of the empirical type I error rate to be conservative, accurate, or liberal. In this figure, the solid lines represent the proba- The unweighted empirical type I error rate for the random bility of a conservative test, the dashed lines represent the effects can be seen in Table 4. Of the 13 articles coded, only probability of an accurate or conservative test, and one minus seven reported empirical type I error rates for the random the probability of the dashed lines represent liberal tests. effects, resulting in a total of 621 effect sizes. On average, Increasing the number of fixed effects significantly increases the unweighted empirical type I error rate was at 0.066 with the probability of being in Group 1 or 2 compared with evidence of positive skew and large variation as shown by Group 3, suggesting that these tests are more likely to be the minimum and maximum values. accurate or conservative rather than liberal, with the proba- Considering the number of replications and variance, the bility being very close to 1. However, with only a couple weighted average empirical type I error rate was 0.066 with LeBeau et al. 13 Figure 5. Interaction plot showing results of the partial proportional odds model by the random effect term and number of fixed effects. Note. P(Y ≤ 1) represents the probability of being a conservative test (a level of significance less than .05) compared with accurate or liberal tests (a level of significance greater than .05). P(Y ≤ 2) represents the probability of being a conservative or accurate test compared with liberal tests. fixed effect terms in the model, the probability of a liberal There are a handful of takeaway messages from this test is much larger, as likely as .7 for the significant test asso- meta-analysis. First, not modeling a random effect when ciated with the variance of the intercept. that term is underlying the data leads to inflated type I error The within-cluster residuals (labeled as Res) and the vari- rates. For example, in the study by LeBeau (2012), the data ance of the intercept tend to be accurate tests (most likely to were generated with a random effect for time, but this ran- be in Group 2), particularly when more fixed effects are dom effect was omitted during model fitting. This leads to included in the model. In contrast, there is a rather large inflated type I error rates for the fixed effect parameters probability (about .90) of the empirical type I error rate to be associated with that random effect. This inflation of the type conservative for the variance of the slope when more than six I error rate could lead researchers to reject true null hypoth- fixed effects are included in the model as shown in Figure 5. eses more often than one would expect given their specified α level. Given the level of significance is one aspect that researchers have direct control over, the disconnect between Discussion the empirical value and the one specified is problematic for The purpose of this meta-analysis was to help improve the applied researchers. The problem of misspecification of the external validity of MC LMM studies, better understand random effect structure in this way may occur most often gaps in simulation conditions, and inform applied research- when models fail to converge. In this situation, fixing a ran- ers of assumption violations that can affect the study results. dom effect to zero, can drastically improve the convergence A meta-analysis of the empirical type I error rates from MC rate and may be a step taken by applied researchers. studies was performed and the manipulated and nonmanipu- Understanding ways to overcome this inflation or better lated study conditions were combined to empirically under- detecting serial correlation would be meaningful for the stand which are most important. literature. 14 SAGE Open Fitting a serial correlation structure has the effect of help- A second recommendation is to check for the presence ing to correct the inflation in type I error rates when a ran- of serial correlation in the data, especially when the repeated dom effect is missing. However, fitting the serial correlation measurements are measured close in time. Adding serial structure does not fully recover the type I error rate to nomi- correlation to the LMM can help to statistically adjust for a nal levels. In addition, by looking at the parameter estimates potential source of random effect misspecification and help and Figure 3, the AR(1) and ARMA(1, 1) structures seem to alleviate severe empirical type I error rate inflation. Finally, do a better job of reducing the type I error rate than the this meta-analysis also found that simpler serial correlation MA(1) or MA(2) serial correlation structures (−.013 and structures (AR(1)) perform just as well as more compli- −.012 vs. −.0006 and −.010, respectively). This may be due cated structures (ARMA(1, 1)) as shown in Figure 3 and to AR(1) and ARMA(1, 1) structures better representing lon- Figure 4. gitudinal data structures (i.e., correlation decreasing between Finally, statistical tests for the random effects also show measurement occasions as the time lag increases), whereas evidence of being too liberal on average (see Table 5), par- the MA structures may better align with other nested data ticularly with few fixed effects included in the model (see structures. Figure 5). This can be problematic if applied researchers are No relationship was found between the random effect dis- counting on these tests for inclusion of random effects in the tribution and the type I error rate for the fixed effects, which LMM. However, as shown by the fixed effect results, it is has been found previously in the literature (LeBeau, 2013; extremely problematic to underspecify the random effect Maas & Hox, 2004, 2005). Similarly, the generated serial structure; therefore, including more random effects than correlation structure does not have an impact on the type I needed is likely less problematic as long as the model still error rates of the fixed effects. This suggests that the random converges. Research into this would be worthy of further effects do an adequate job of accounting for the dependency study. due to repeated measurements, regardless of the type of serial correlation structure underlying the data. Informing Future MC Studies Finally, the small effects related to the sample sizes is an interesting finding. With maximum likelihood being an This meta-analysis can help inform additional areas of study. asymptotic estimation method, the larger sample sizes may None of the MC studies allowed for variables to vary besides have provided better estimates. The small effect helps to the initial status and linear trend. The implications for the inform researchers with relatively small sample sizes that the empirical type I error rates when the trajectory of additional type I error rate can be held in check with few observations variables is allowed to vary between clusters would be inter- and relatively few clusters (12 clusters was the smallest con- esting to explore, especially in relation to cases when the ran- dition in the current meta-analysis). However, sample size dom structure is misspecified. may play an important role in the estimates of the variances Second, the number of fixed effects in the MC studies of the random components and would be worthy of further coded were homogeneous and were not significant in the study. final inferential model. As can be seen in Table 1, most studies had between four and six fixed effects and only a single study included more with 10 fixed effects. Many Informing Applied Researchers applied studies using the LMM tend to have more than four There are three recommendations for applied researchers to six fixed effects, for example, Harwell, Post, Medhanie, with respect to empirical type I error rates for the LMM. Dupuis, and LeBeau (2013) has 33 covariates included in a First, results suggest that conservative tests are less com- three-level longitudinal LMM. Better understanding the mon with the LMM for the fixed effect terms (see Figure implications when there are many fixed effects related to 4). Instead, significant tests are much more likely to be lib- assumption violations and small sample size conditions eral. One aspect to be particularly careful with is when the would be a welcome addition to the MC literature. random effect for time is not included in the final model. Relatedly, little attention in the MC LMM literature has This could be a case where the random effect structure is been given to three-level models and would be worthy of misspecified, and severe inflation of the empirical type I more attention. error rate may be present for fixed effects associated with the linear slope. In these situations, exploring variation in Declaration of Conflicting Interests the trajectories of individuals would be helpful to see The author(s) declared no potential conflicts of interest with respect whether there is variation in the data not being captured by to the research, authorship, and/or publication of this article. the LMM. If there is evidence of misspecification of the random effect structure and the inclusion of the random Funding effect for the linear slope leads to nonconvergence, using robust standard errors may be a way to alleviate elevated The author(s) received no financial support for the research, author- ship, and/or publication of this article. empirical type I error rates. LeBeau et al. 15 References Laird, N., & Ware, J. (1982). Random-effects models for longitudi- nal data. Biometrics, 38, 963-974. References marked with an asterisk indicate studies included in the *LeBeau, B. (2012). The impact of ignoring serial correlation meta-analysis in longitudinal data analysis with the linear mixed model: Aloe, A. M., Becker, B. J., & Pigott, T. D. (2010). An alterna- A Monte Carlo study (Doctoral dissertation). University of tive to R2 for assessing linear models of effect size. Research Minnesota, Twin Cities. Synthesis Methods, 1, 272-283. *LeBeau, B. (2013, April). Impact of non–normal level one and *Black, A. C., Harel, O., & McCoach, D. B. (2011). Missing data two residuals on the linear mixed model. Paper presented at the techniques for multilevel data: Implications of model mis- American Educational Research Association Conference, San specification. Journal of Applied Statistics, 38, 1845-1865. Francisco, CA. *Browne, W., & Draper, D. (2000). Implementation and *Maas, C., & Hox, J. (2004). The influence of violations of performance issues in the Bayesian and likelihood fitting assumptions on multilevel parameter estimates and their stan- of multilevel models. Computational Statistics, 15, dard errors. Computational Statistics & Data Analysis, 46, 391-420. 427-440. *Browne, W., Draper, D., Goldstein, H., & Rasbash, J. (2002). *Maas, C., & Hox, J. (2005). Sufficient sample sizes for Bayesian and likelihood methods for fitting multilevel mod- multilevel modeling. Methodology: European Journal of els with complex level-1 variation. Computational Statistics & Research Methods for the Behavioral and Social Sciences, Data Analysis, 39, 203-225. 1(3), 86-92. Browne, W., & Goldstein, H. (2010). MCMC sampling for a multi- *Mallinckrodt, C. H., Clark, W. S., & David, S. R. (2001). Type I level model with nonindependent residuals within and between error rates from mixed effects model repeated measures ver- cluster units. Journal of Educational and Behavioral Statistics, sus fixed effects ANOVA with missing values imputed via last 35, 453-473. observation carried forward. Drug Information Journal, 35, Cooper, H., Hedges, L. V., & Valentine, J. C. (2009). The hand- 1215-1225. book of research synthesis and meta-analysis. New York, NY: Miller, J. J. (1978). The inverse of the Freeman–Tukey double Russell Sage Foundation. arcsine transformation. The American Statistician, 32, *Delpish, A. (2006). Comparison of estimators in hierarchical lin- 138-138. ear modeling: Restricted maximum likelihood versus bootstrap *Murphy, D., & Pituch, K. (2009). The performance of multi- via minimum norm quadratic unbiased estimators (Doctoral level growth curve models under an autoregressive moving dissertation). Florida State University, Tallahassee. average process. The Journal of Experimental Education, 77, Diggle, P., Heagerty, P., Liang, K.-Y., & Zeger, S. (2002). Analysis 255-284. of longitudinal data. Oxford, UK: Oxford University Press. *Overall, J. E., & Tonidandel, S. (2010). The case for use of simple *Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying difference scores to test the significance of differences in mean the first-level error structure in two-level models of change. rates of change in controlled repeated measurements designs. Multivariate Behavioral Research, 37, 379-403. Multivariate Behavioral Research, 45, 806-827. Fitzmaurice, G., Laird, N., & Ware, J. (2004). Applied longitudinal Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. analysis. Hoboken, NJ: Wiley-IEEE. (2001). Monte Carlo experiments: Design and implementation. Freeman, M. F., & Tukey, J. W. (1950). Transformations related to Structural Equation Modeling, 8, 287-312. the angular and the square root. The Annals of Mathematical Raudenbush, S. W. (2009). Analyzing effect sizes: Random-effects Statistics, 21, 607-611. models. The Handbook of Research Synthesis and Meta- Goldstein, H. (2010). Multilevel statistical models. West Sussex, Analysis, 2, 295-316. UK: Wiley. Raudenbush, S. W., & Bryk, A. (2002). Hierarchical linear mod- Harwell, M. R., Post, T. R., Medhanie, A., Dupuis, D. N., & LeBeau, els: Applications and data analysis methods. Thousand Oaks, B. (2013). A multi-institutional study of high school mathemat- CA: Sage. ics curricula and college mathematics achievement and course R Core Team. (2015). R: A language and environment for statisti- taking. Journal for Research in Mathematics Education, 44, cal computing. Vienna, Austria: R Foundation for Statistical 742-774. Computing. Available from https://www.R-project.org/ Harwell, M. R., Rubinstein, E. N., Hayes, W. S., & Olds, C. C. Rubinstein, R. Y., & Kroese, D. P. (2016). Simulation and the (1992). Summarizing Monte Carlo results in methodological Monte Carlo method (Vol. 10). Hoboken, NJ: John Wiley. research: The one- and two-factor fixed effects ANOVA cases. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data Journal of Educational Statistics, 17, 315-339. analysis: Modeling change and event occurrence. Oxford, UK: Hoaglin, D. C., & Andrews, D. F. (1975). The reporting of com- Oxford University Press. putation-based results in statistics. The American Statistician, Skrondal, A. (2000). Design and analysis of Monte Carlo experi- 29, 122-126. ments: Attacking the conventional wisdom. Multivariate Kwok, O., West, S., & Green, S. (2007). The impact of misspecify- Behavioral Research, 35, 137-167. ing the within-subject covariance structure in multiwave longi- Viechtbauer, W. (2010). Conducting meta-analyses in R with the tudinal multilevel models: A Monte Carlo study. Multivariate metafor package. Journal of Statistical Software, 36(3), 1-48. Behavioral Research, 42, 557-592. Retrieved from http://www.jstatsoft.org/v36/i03/ *Kwon, H. (2011). A Monte Carlo study of missing data treatments Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. for an incomplete level-2 variable in hierarchical linear models New York, NY: Springer. (Doctoral dissertation). The Ohio State University, Columbus. 16 SAGE Open Yee, T. W. (2010). The VGAM package for categorical data analy- research interests include longitudinal data, reproducible sis. Journal of Statistical Software, 32(10), 1-34. workflows, and research software development. Yee, T. W. (2015). Vector generalized linear and additive models: Yoon Ah Song is a graduate student at the University of Iowa in the With an implementation in R. New York, NY: Springer. Educational Measurement and Statistics program. She is interested in item response theory and linear mixed models. Author Biographies Wei Cheng Liu is a graduate of the Educational Measurement and Brandon LeBeau is an assistant professor of Educational Statistics program at the University of Iowa. He is interested in pro- Statistics and Measurement at the University of Iowa. His gram evaluation. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png SAGE Open SAGE

Model Misspecification and Assumption Violations With the Linear Mixed Model: A Meta-Analysis:

SAGE Open , Volume 8 (4): 1 – Dec 22, 2018

Loading next page...
 
/lp/sage/model-misspecification-and-assumption-violations-with-the-linear-mixed-8aFWnAGVgJ
Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Inc, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
2158-2440
eISSN
2158-2440
DOI
10.1177/2158244018820380
Publisher site
See Article on Publisher Site

Abstract

This meta-analysis attempts to synthesize the Monte Carlo (MC) literature for the linear mixed model under a longitudinal framework. The meta-analysis aims to inform researchers about conditions that are important to consider when evaluating model assumptions and adequacy. In addition, the meta-analysis may be helpful to those wishing to design future MC simulations in identifying simulation conditions. The current meta-analysis will use the empirical type I error rate as the effect size and MC simulation conditions will be coded to serve as moderator variables. The type I error rate for the fixed and random effects will be explored as the primary dependent variable. Effect sizes were coded from 13 studies, resulting in a total of 4,002 and 621 effect sizes for fixed and random effects respectively. Meta-regression and proportional odds models were used to explore variation in the empirical type I error rate effect sizes. Implications for applied researchers and researchers planning new MC studies will be explored. Keywords linear mixed model, longitudinal data, type I error rate, meta-analysis Introduction X β cluster . is the design matrix of covariates, is a vector j j of fixed effects, and e is a vector of within-cluster residuals The linear mixed model (LMM), also commonly referred to (i.e., residuals for every observation). as a multilevel model (Goldstein, 2010) or hierarchical linear The random components of the LMM (i.e., and j ) model (Raudenbush & Bryk, 2002), is an extension of the are commonly assumed to be identically and independently multiple regression model to account for cluster dependency normally distributed with means of zero and a specified vari- arising from nested designs. Included within nested designs ance matrix. These common assumptions can be summed up are longitudinal designs where repeated measurements are N (, 0 G) as follows: b ∼ iid and e ∼ iid N (, 0 σ ) nested within an individual. This data setup will serve as the j j (Raudenbush & Bryk, 2002). In addition, the independence primary focus of this article. These series of models were assumption of the within-cluster residuals (i.e., j ) is condi- first introduced in the early 1980s (Laird & Ware, 1982), and tional on the random effects specified in the model (Browne the rapid improvement in computational power has helped & Goldstein, 2010). These random effects are what account these models become a popular data analysis method for for the dependency due to repeated measures, although there researchers. is the ability to allow the within-cluster residuals to be cor- The LMM takes the following general matrix form: related due to the time lag in repeated measurements, a phe- nomenon called serial correlation (Diggle, Heagerty, Liang, YX =+ β Zb + e . jj jj j & Zeger, 2002). This serial correlation may be especially (1) important if the time lag between measurement occasions is short (Browne & Goldstein, 2010). This is very similar to the multiple regression model, except Zb b now there are additional terms, . are random effects, jj j which serve as additional residual terms and represent clus- The University of Iowa, Iowa City, USA ter-specific deviations from the average growth curve, and Corresponding Author: represents the design matrix for the random effects. The Brandon LeBeau, The University of Iowa, 311 Lindquist Center, Iowa City, rest of the terms in the model are identical to a multiple IA 52242, USA. regression, where represents the dependent variable for Email: brandon-lebeau@uiowa.edu Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 SAGE Open The statistical assumptions for the LMM are difficult to effect distributions (LeBeau, 2012, 2013; Maas & Hox, assess analytically due to the computationally intensive and 2004, 2005), the effect of serial correlation (Browne, Draper, iterative procedure of obtaining model estimates. In addi- Goldstein, & Rasbash, 2002; Ferron, Dailey, & Yi, 2002; tion, when normality of the random components is not Kwok, West, & Green, 2007; LeBeau, 2012; Murphy & assumed, the mathematics becomes increasingly more dif- Pituch, 2009), missing data (Black, Harel, & McCoach, ficult and intractable. As a result, Monte Carlo (MC) meth- 2011; Kwon, 2011; Mallinckrodt, Clark, & David, 2001), ods have been used to explore the relationship between and estimation method (Delpish, 2006; Overall & Tonidandel, assumption violations and model performance (Skrondal, 2010). The random effect distributions simulated tend to be 2000). MC studies have the advantage of strong internal normally distributed, skewed (such as a chi-square distribu- validity due to the researcher directly manipulating the con- tion), or have heavy tails (such as a t or Laplace distribution). ditions of interest. The direct manipulation is not unlike The serial correlation structures also tend to fall into three true experiments, where the researcher can isolate the categories, independent structures, autoregressive (AR) type source of problems when estimating parameters (Skrondal, structures, or banded structures (such as the moving average 2000). models). The major drawback in MC studies is the potential lack of Most MC studies include sample size as a simulation con- external validity (Skrondal, 2000). This weakness stems dition, where, surprisingly, there is little variation in sample from the MC results being conditional on the conditions cho- sizes used. The number of repeated measures is commonly sen to study. For example, if researchers conducting an MC less than 10 and the number of clusters rarely is larger than study only simulate the random effects coming from a nor- 100. The choice of sample size conditions is likely a function mal or chi-square(1) distributions, the question must be of using maximum likelihood for estimation. Maximum like- asked, can the study results be generalized beyond those two lihood is an asymptotic estimation method, as such, under- distributions. Hoaglin and Andrews (1975) started a discus- standing how the estimation method behaves for small sion of best practices when reporting and conducting MC samples is informative, and increasing sample size would studies. More recently, Paxton, Curran, Bollen, Kirby, and likely improve estimation. Chen (2001) and Skrondal (2000) have offered design con- The number of fixed and random effects is another siderations to improve external validity. Much of the recom- aspect of the simulation design that is chosen by the mendations surround reducing the number of replications to researcher. Unfortunately, these are two design choices increase the coverage of the simulation conditions (Skrondal, that are commonly not manipulated directly. Instead, the 2000). number of fixed effects and random effects are held con- Although the papers by Paxton et al. (2001) and stant across studies. In addition, there is even less variation Skrondal (2000) offer suggestions for improving new MC in the number of fixed and random effects chosen by studies, the design considerations from past MC studies researchers. It is uncommon for the number of fixed effects can not be altered to improve the external validity. As such, to be larger than six and the number of random effects to the current study aims to leverage prior MC studies to help be larger than two. improve the external validity of these studies, better under- The advantage of the current meta-analysis is to attempt stand gaps in simulation conditions, and succinctly inform to combine unique design choices made by independent applied researchers of assumption violations that can researchers. This can offer some insight into data conditions greatly affect the study results. This article aims to accom- that were not manipulated directly in a single MC study plish these three goals by quantitatively synthesizing the (such as the number of fixed effects), but do vary across stud- MC longitudinal LMM literature with a meta-analysis. The ies. In addition, the slightly distinctive design choices made meta-analysis allows for the pooling of study conditions for a single manipulated condition (such as sample size or across numerous MC studies to increase sample size and random effect distribution) can again be combined to explore depth of coverage of simulation conditions. In addition, whether one condition has a larger effect than others. This many MC studies only report descriptive statistics for the can aid applied researchers in their ability to understand study results and may miss complex interaction effects implications when certain model assumptions are not met in found through inferential modeling. A meta-regression practice. In addition, it can help inform researchers designing was performed to overcome this limitation of some of the MC studies about gaps in the literature that would be worthy MC literature. A similar study for the one- and two-factor of further study. ANOVA models was done by Harwell, Rubinstein, Hayes, and Olds (1992). Assumption Violations With the LMM Results of the MC studies with assumption violations can be Common Simulation Conditions With the LMM grouped into three categories. Estimation of the fixed effects MC studies exploring assumption violations with the LMM tends to be unbiased regardless of the assumption violations have focused primarily on the impact of nonnormal random (e.g., Kwok et al., 2007; LeBeau, 2013; Maas & Hox, 2004; LeBeau et al. 3 Murphy & Pituch, 2009). This has been shown with nonnor- Method mal random effect distributions (LeBeau, 2013; Maas & Data Collection Hox, 2004), different sample sizes (LeBeau, 2013; Maas & Hox, 2004, 2005), with the presence of serial correlation Articles, dissertations, conference papers, or unpublished (Kwok et al., 2007; LeBeau, 2012; Murphy & Pituch, 2009), documents were gathered to attempt to answer the research and with different estimation algorithms (Delpish, 2006). questions above. Documents were selected if they are simu- Therefore, if the researcher is solely interested in estimates lation studies that reported empirical type I error rates for the of the fixed effects, then little care to assumption violations fixed effects. Only studies with continuous outcome vari- is needed. ables were included to keep the comparison consistent. The However, if the researcher is interested in estimates of the simulation studies must include longitudinal data conditions, random effects or inference with the LMM, then researchers specifically multiple measurement occasions for individuals need to pay specific attention to assumption violations. that are often smaller than cross-sectional models (Singer & Nonnormal random effects can inflate estimates of the ran- Willett, 2003). dom effects, especially in small sample size conditions Based on the above criteria, the population of studies is (LeBeau, 2013; Maas & Hox, 2004). In addition, not model- defined as all possible MC LMM studies exploring data con- ing serial correlation when present can also cause an infla- ditions similar to longitudinal studies using a continuous tion in the random effects and underestimate the standard dependent variable. An initial search was performed in errors for the fixed effects (Kwok et al., 2007; LeBeau, 2012; March 2012 and follow-up searches were performed in April Murphy & Pituch, 2009). This can lead to an inflation in the 2013 and June 2014. Articles were selected for relevance empirical type I error rate. Finally, misspecifying, specifi- based on their title. Abstracts from the articles selected by cally underspecifying, the random effect structure can also their titles were read to determine their relevance. If the lead to severe inflation of the empirical type I error rates for study met the criteria established above, it was set aside to be fixed effects (LeBeau, 2012). read and code information from the study. These results suggest that checking of model assumptions A Boolean search was used with the Eric, PsycInfo, and is important when researchers are interested in conducting Dissertation Abstract databases to search for documents to inference, which likely encompasses most applied research- be coded. The Boolean search string took the following ers. In addition, inflated estimates of the variance of the ran- form: (“Monte Carlo” or simulation) and (“linear mixed dom components can lead researchers to include predictors model” or “hierarchical linear model” or “mixed effects” or to explain variation, when in reality, this variation is smaller “mixed-effects” or longitudinal or LMM or HLM or LMEM) than expected. A quantitative synthesis can be informative and generalized and (nonlinear or Bayesian or SEM or for applied researchers to show which assumption violations “structural equation model”). are crucial to achieving valid inferences. The MC literature The search identified 223 articles for review. Of those 223 can also be better informed through the ability of this study articles, a total of 25 were selected for inclusion and were to include moderator variables that were not directly manip- read further to include in the meta-analysis. There were three ulated within a study, but were between studies, such as primary reasons why studies were excluded: (a) The article number of fixed effects. did not report empirical type I error rates (or it was not an outcome), (b) the study was cross sectional, or (c) the study was done in the structural equation modeling framework. Research Questions After studies were found using the above manner, Google Based on the prior MC literature, the following research Scholar was used to find articles that cited the articles found questions were explored in the current meta-analysis: above. Footnote chasing was also used by exploring the titles in the reference list of the articles selected. The titles and Research Question 1: Is there evidence that the type I abstracts of studies identified through Google Scholar were error rate is different from the nominal rate of 0.05 for screened for inclusion. fixed and random effects? From each of the sampled studies, effect sizes and a num- Research Question 2: To what extent does the indepen- ber of study characteristics were coded to help with the data dence assumption of the within-cluster residuals affect the analysis step of the meta-analysis. Three independent read- empirical type I error rate? ers, who had completed PhD training or were in a PhD pro- Research Question 3: To what extent does the normality gram related to quantitative methods, completed the coding assumption of the random effects affect the empirical type of the studies included in this meta-analysis. One individual I error rate? coded all the studies and the other two individuals indepen- Research Question 4: To what extent do the MC study dently coded approximately half of the studies to evaluate characteristics moderate the relationships found in the coding consistency. Internal consistency was high on all above questions? coded variables, including the primary dependent variable, 4 SAGE Open Figure 1. Data structure for model. the empirical type I error rate. For the primary dependent tests when the null hypothesis is true should be very close to variable, 98% of the values were coded the same across cod- the α value set by the researcher. Deviations in the propor- ers. Those that did not match were reviewed by looking at tion of tests that reject a true null hypothesis from the α the original study for verification of the correct values. value reflect problems in estimation; as a result, hypothesis tests are too conservative (proportion less than the ) or liberal (proportion greater than the α ). Data Evaluation Studies were first checked to ensure that they contained the Independent Variables dependent variable of interest, the empirical type I error rate. If the studies do not contain the empirical type I error rate or The primary independent variables were the conditions that if the study used an empirical type I error rate other than the MC studies directly manipulated. Variables commonly 0.05, the study was not coded. The studies were then checked manipulated are the cluster sample size (i.e., how many indi- for methodological/coding flaws. Evidence that the data viduals), presence of serial correlation, what kind of serial were generated accurately was explored first. If studies correlation structure is assumed, and the number of measure- appear to show inaccuracies when the model assumptions ment occasions. Other conditions that are commonly not have been met compared with the body of MC literature, manipulated within an MC study that were coded included they will be excluded from the sample of studies due to how many fixed effects are in the model, the number of ran- severe methodological or coding flaws. No studies were dom effects in the model, the number of replications within a removed due to methodological or coding flaws. cell of the study design, estimation method (e.g., full infor- All studies were read one at a time to code the variables of mation maximum likelihood, restricted maximum likelihood interest. Each MC study contributes many effect sizes to the [REML]), and whether the design was balanced or meta-analysis; however, due to the quasi-random number unbalanced. generation within each study, the effect sizes are assumed to The random effects are commonly assumed to follow a be independent within a study. However, there may be coder normal distribution, and violating this assumption has been or author effects that need to be considered, and this potential studied thoroughly with numerous MC studies. The simu- dependency was adjusted by using an LMM and is discussed lated random effect distribution was coded. In addition to the in more detail below. name of the coded distribution, the theoretical and empirical Accuracy in coding was checked to ensure that all the skewness and kurtosis values were coded for the random effect sizes were coded properly. Summary statistics and effects distribution. These independent variables were used plots were used to examine the distribution of effect sizes to help determine whether the skewness or kurtosis of the looking for possible extreme values because of erroneous distribution has a larger impact on the type I error rate. coding. Very large or small empirical type I error rates were checked against the values published in the manuscripts. Data Analysis After errors were corrected, exploratory and inferential data analyses were used to attempt to explain variation in effect Exploratory data analyses were used to explore variation in sizes. The coded variables and analyses are described in the effect sizes. If significant variation in the empirical type more detail below. I error rates was found, an LMM was used to see whether any moderator variables explain variation in the type I error rates. The LMM was chosen due to the hierarchical structure of the Dependent Variable empirical type I error rates, which is illustrated in Figure 1. The primary dependent variable in the current meta-analysis The empirical type I error rates are proportions. As a was the empirical type I error rate for each condition in the result, the variance is a function of the specific value of the MC studies. The type I error rate is commonly reported as empirical type I error rates and the sampling distribution is the proportion of tests that reject a true null hypothesis. unlikely to be normally distributed. Therefore, the empirical Statistical theory informs us that the proportion of rejected type I error rates were transformed using the Freeman–Tukey LeBeau et al. 5 transformation (Freeman & Tukey, 1950). This transforma- study-specific residual terms that are assumed to follow a tion takes the following form: normal distribution with mean zero and variance . The represent known sampling variances that are assumed to follow a normal distribution with mean zero and known     1 x x +1 k k t = arcsin + arcsin ,     v calculated from Equation 3 above. Within a study, the     21 n + n +1 k k     empirical type I error rates were treated as independent due (2) to the quasi-random number generation used by MC stud- t x ies (Rubinstein & Kroese, 2016). The moderators chosen where is the transformed proportion, is the number of k k n to be included in the LMM were informed by the explor- type I errors, and is the total number of replications. atory data analysis. Many articles reported the empirical type I error rate as a First, an omnibus model with no covariates was used to proportion, not the number of type I errors for each cell of the explore the heterogeneity in the effect sizes. The Q test was MC design. To calculate the transformation, the number of used to assess the amount of heterogeneity (Cooper, Hedges, empirical type I error rates made in each cell of the design & Valentine, 2009). If this test was significant, covariates was found by taking xn =× π , where π is the empirical kk k κ were added to attempt to explain variation in the effect sizes type I error rate reported by the study. t with a meta-regression (Cooper et al., 2009). The covariates The variance of the transformed proportions ( ) is will take the form of simulation conditions that were coded and discussed above. Descriptive analyses will help to inform v = , which covariates are included in the model. Significant pre- 42 ×+ n () (3) dictors will be identified when the z value is greater than 2.33 in absolute value, representing a p value less than .01. The v n where is the variance and is the number of replica- k k level of significance was selected to help control for com- tions. Miller (1978) defined a back-transformation to con- pounding type I error rates from many tests of predictors in vert back into the raw proportion metric defined as the analysis and better reflect covariates of practical follows: significance. To assess the amount of explanatory power of the predic-  2  R tors, an statistic defined by Aloe, Becker, and Pigott   Meta         (2010) will be used. The statistic takes the following form: sin t −         sin t  k      π =− 11 sin coss tt −+ in , ()   kk k   2  n  (4)  k  g cond   R =− 1 ,       Meta   g     uncond   (6)   g where and represent the conditional and cond uncond unconditional estimates of the between study variation. where is the back-transformed empirical type I error rate, A similar analysis to that described above was also done t n is the transformed value from Equation 2, and is the k k using the empirical type I error rate for the random effects. In number of replications. Results will be back-transformed to this analysis, the dependent variable was for the empirical the empirical type I error rate metric for use in figures and type I error of the random effects and independent variables tables. were the simulation conditions described in more detail above. There were fewer studies that studied the empirical Inferential model. The LMM was fitted with REML as this type I error rate for the random effects; therefore, many study has been shown to produce less biased estimates of the ran- conditions coded did not have variation and were omitted dom components (Fitzmaurice, Laird, & Ware, 2004; from the model. Raudenbush, 2009). The LMM took the following general form: Proportional Odds Models (POMs). POMs (Yee, 2015) were also explored to further attempt to understand variation in the empirical type I error rates for the fixed effects. A new depen- tX =+ ββ ++  β Xb ++ e . kk 01 1 tktk k (5) dent variable was defined that represented three ordinal cat- egories. These ordinal categories represented conservative In Equation 5, represents the transformed empirical tests (with a level of significance less than .05), accurate type I error rate coded from the articles. is an intercept tests, and liberal tests (with a level of significance greater ββ ,,  and represent the relationship between the pre- 1 t than .05). XX ,,  , dictor variables, and the dependent variable, kk 1 t To determine which group each observation belonged in, t . Finally, this model contains b , which represent confidence intervals (CIs) were created for the empirical k k 6 SAGE Open type I error rate reported for each condition. These took the This included the fitting of the model shown in Equation 5 following form: and the calculation of the Freeman–Tukey transformation, back-transformation, and the variance shown in Equations 2 through 4. POMs were fitted with the VGAM package (Yee, ππ () 1− kk CI =± π 2 × , 2010, 2015). Figures were generated with the ggplot2 pack- age within R (Wickham, 2009). (7) Limitations. There are at least three limitations of this study. where represents the empirical type I error rate for the The first stems from the nature of the meta-analysis, in that, fixed effects and represents the number of replications studies that did not report the empirical type I error rate can- for each simulation condition coded. If 0.05 is contained by not be included in this meta-analysis. Second, the studies the CI, then the dependent variable was coded as 1. If 0.05 included in this meta-analysis are not a random sample of was greater than the upper bound of the CI, the dependent the population of MC studies; therefore, the results are fixed variable took a value of 0 (an example of a conservative test); to the MC conditions coded from the included studies. The if it was less than the lower bound of the CI, the dependent extent to which the studies not included (due to omission or variable took a value of 2 (an example of a liberal test). There did not report the empirical type I error rate) are signifi- was a total of 162 (4%) that were below the lower bound cantly different than the included studies could bias the of the CI, 2,897 (72%) that were within the confidence results. For this reason, some care needs to be taken when band, and 943 (24%) that were above the confidence interpreting the results and the external validity may be band. affected. Both limitations overlap as the MC studies that did This variable was then modeled using a POM that had not report the type I error rate were also more likely to be three categories. This model took the following general journal articles. The degree to which these studies are differ- form: ent than the ones included in this meta-analysis may bias the results. Finally, many of the coded studies included in the   PY ≤ c () log =+ αβX , meta-analysis were published in social science journals or   ck   1−≤ PY c ()   were other document types completed within a social sci- (8) ence context. The degree to which the data conditions included in the primary studies is different in disciplines where the log odds of being less than or equal to a given outside of social science domains may affect the external category is modeled as a function of the covariates ( ). validity. In the current example, where , there were two cumula- c= 3 tive logits that were modeled simultaneously, L =+ log(ππ π ) L =+ log(ππ π ) and . 11 23 21 23 Results The POM assumes that the regression weights are consis- tent between the two cumulative logits defined above, Fixed Effects referred to as the parallel regression assumption (Yee, 2015). This assumption will be explored empirically using nested Summary information about the 13 articles coded in the model chi-square comparisons (Yee, 2015). If there is evi- meta-analysis is shown in Table 1 (two articles, Kwon [2011] dence that the slopes vary, these will be allowed to vary and LeBeau [2012], are represented in four rows of Table 1 between the two cumulative logits modeled. This strategy as they had two studies within a single article). As can be revealed that the covariate, model term (i.e., intercept, linear seen from Table 1, there are a total of 4,002 empirical type I slope for time, other within slopes), did not satisfy the paral- error rates for the fixed effects with an average unweighted lel regression assumption; as a result, these were allowed to type I error rate of 0.063% and 95% CI = [0.055, 0.070]. The vary between the cumulative logits described above. distribution of weighted effect sizes was highly concentrated A POM was also fitted for the empirical type I error rate between 0 and 0.1, but there were effect sizes greater than of the random effects. There was a total of 160 (26%) 0.15 and even one effect size larger than 0.4. that were below the lower bound of the CI, 320 (52%) Additional summary statistic information for weighted that were within the confidence band, and 132 (22%) that back-transformed empirical type I error rates separated by were above the confidence band. As in fixed-effects analysis, various potential moderators can be seen in Table 2. From the parallel regression assumption was not tenable for the the table, there appears to be differences based on many of covariate reflecting the variance component (i.e., within- these moderators. For example, missing a random effect cluster residual, variance of intercept, etc.), and these were appears to have a strong impact on type I error rate with a allowed to vary between the cumulative logits. mean of 0.078 compared with 0.055 when all random effects are modeled. Figure 2 shows the interaction between missing Software. Data analysis was performed with R (R Core a random effect and the fixed effect term (e.g., intercept, Team, 2015) using the metafor package (Viechtbauer, 2010). time, or other within). As can be seen from Figure 2, the LeBeau et al. 7 Table 1. Article Summary Information of Unweighted Empirical Type I Error Rates for the Fixed-Effects and Other Monte Carlo Conditions. Author Source K Repl Avg T1 Med T1 Min T1 Max T1 FE Range CS Range wCS Black (2011) Jour 40 1,000 0.088 0.075 0.000 0.310 4 (50, 50) (20, 20) Browne (2000) Jour 20 930 0.058 0.053 0.003 0.099 2 (12, 48) (18, 18) Browne (2002) Jour 14 10,000 0.053 0.054 0.047 0.057 2 (65, 65) (62, 62) Delpish (2006) Diss 64 500 0.054 0.052 0.044 0.091 4 (30, 100) (30, 30) Ferron (2002) Jour 192 10,000 0.054 0.052 0.045 0.079 4.75 (30, 500) (3, 12) Kwon (2011) Diss 324 250 0.051 0.048 0.000 0.448 6 (40, 160) (45, 45) Kwon (2011) Diss 540 250 0.049 0.044 0.008 0.192 10 (40, 160) (45, 45) LeBeau (2013) Conf 244 300 0.059 0.057 0.033 0.100 4 (30, 50) (6, 12) LeBeau (2012) Diss 1,500 500 0.063 0.061 0.015 0.119 5 (25, 50) (6, 8) LeBeau (2012) Diss 750 500 0.082 0.070 0.024 0.274 5 (25, 25) (6, 8) Maas (2004) Jour 144 1,000 0.059 0.058 0.038 0.088 4 (30, 100) (5, 50) Maas (2005) Jour 108 1,000 0.054 0.054 0.037 0.075 4 (30, 100) (5, 50) Mallinckrokt Jour 32 3,000 0.059 0.058 0.050 0.072 4 (100, 100) (7, 7) (2001) Murphy (2009) Jour 64 10,000 0.059 0.052 0.047 0.125 4 (30, 200) (5, 8) Overall (2010) Jour 66 1,500 0.062 0.063 0.039 0.087 4.33 (100, 100) (9, 9) Total — 4,002 1,123 0.063 0.058 0.000 0.448 — — — Note. K = number of effect sizes; Repl = replication; T1 = type I error rate; Med = median; Min = minimum; Max = maximum; FE = fixed effects; CS = cluster size; wCS = within-cluster size; Jour = journal article; Diss = dissertation. empirical type I error is only inflated for the fixed effect empirical type I error rates. The sample size differs com- terms associated with time, the others are similar to one pared with Table 1 due to missing data on the cluster and another. within-cluster sample sizes arising from reporting practices. Table 2 also shows that there are some differences in the For example, some articles did not provide tables for the empirical type I error rate for differing simulated random entire factorial research design. Instead, in the reported tables effect distributions. The empirical type I error rate for the from the article, there was some aggregation over simulation Laplace and chi-square(1) distributions were inflated at conditions. 0.067 and 0.066, respectively, compared with the other two distributions at 0.054 and 0.056 for normal and uniform, Inferential statistics. Expanding on the significant Q test, sim- respectively. The distributions of the empirical type I error ulation conditions were added to the model to attempt to rate for the simulated random effect distributions also showed explain variation in the empirical type I error rate. The pre- evidence of being positively skewed. This can be seen from dictors explained significant variation as shown by the sig- Q () 43 = 7,, 108 the median being less than the mean and with large maxi- nificant chi-square test for moderators, mum values most notably for the normal, Laplace, and chi- Although the predictors are shown to pR <= ., 0001 .049 Meta square(1) distributions. be highly significant based on the moderator chi-square test, The effect of sample size is difficult to see from Table 2. the explanatory power of the model is small. This could be For many sample sizes, the empirical type I error rate is close attributable to the small amount of variation between studies. to the theoretical value of 0.05. Some deviations from this The significant predictors can be seen in Table 3. Note, a occurs when the cluster sample size is 25 and the within- handful of predictors were significant, but had back-trans- cluster sample size is 6, 8, and 20. There may be more com- formed estimates of zero to three decimal places (i.e., 0.000). plicated effects that underlie the data here, and these The average empirical type I error rate for the intercept differences will be explored in more detail with the inferen- term (i.e., initial status) is very close to 0.05 at 0.048. This tial model. suggests that, on average, the empirical type I error rate con- Finally, considering the variance and the number of repli- trol is very good for the reference group. More specifically, cations, the weighted average empirical type I error rate was this is for dissertations, a normal random effect distribution, 0.058 with a 95% CI = [0.054, 0.062]. In addition, the omni- and independent fitted and generated serial correlation struc- Qp (, 4 001), =< 22 798,.0001, bus Q test was significant, tures. More simply, this would represent the situations where suggesting that there is significant variation in the empirical model assumptions have been adequately met. Assumption type I error rates. The estimate for the between study vari- violations, such as nonnormal random effects, does not affect ance (i.e., ) was 0.003 for the omnibus model using 3,842 the empirical type I error rate for the intercept. This can be 8 SAGE Open Table 2. Summary Statistics of Weighted Back-Transformed Empirical Type I Error Rate for the Fixed Effects by Parameter, Level of Parameter, Article Source, Cluster Size, Within-Cluster Size, Serial Correlation, Random Effect Distribution, and Missing Random Effect. Moderator Avg T1 Med T1 Min T1 Max T1 K Term Intercept .056 .055 .003 .448 1,835 Time .060 .057 .000 .274 1,693 Within .065 .066 .023 .110 474 Level parameter Level 1 .061 .059 .003 .274 2,044 Level 2 .057 .055 .000 .448 1,958 Cluster sample size 12 .059 .079 .003 .099 10 25 .072 .067 .021 .274 1,500 30 .061 .060 .036 .124 264 40 .045 .045 .009 .177 288 48 .048 .048 .036 .058 10 50 .058 .058 .000 .310 922 65 .053 .053 .046 .056 14 80 .048 .049 .003 .224 288 100 .054 .052 .036 .086 258 160 .052 .049 .013 .448 288 200 .050 .049 .046 .057 32 500 .050 .050 .046 .056 40 Random effect distribution Chi-square(1) .066 .062 .021 .268 846 Laplace .067 .064 .015 .272 846 Normal .054 .053 .000 .448 2,262 Uniform .056 .056 .037 .068 48 Article source Dissertations .059 .057 .003 .448 3,178 Journal .058 .056 .000 .310 680 Conference paper .059 .057 .033 .100 144 Missing random effect No .055 .054 .000 .448 3,252 Yes .078 .070 .024 .274 750 Within-cluster sample size 3 .051 .050 .047 .057 24 4 .055 .054 .044 .078 72 5 .057 .056 .040 .082 92 6 .066 .063 .023 .274 1,221 7 .058 .056 .049 .071 32 8 .067 .062 .015 .272 1,205 9 .060 .062 .038 .086 66 12 .060 .060 .034 .100 96 18 .054 .053 .003 .099 20 20 .078 .079 .000 .310 40 30 .054 .052 .037 .091 124 45 .049 .049 .003 .448 864 50 .054 .055 .036 .086 60 62 .053 .053 .046 .056 14 Note. T1 = type I error rate; Med = median; Min = minimum; Max = maximum; K = number of effect sizes; Intercept = regression terms modeling the intercept; Time = regression terms modeling time; Within = regression terms of another within-cluster variable. seen from Table 3, where the back-transformed values asso- On average, the time or linear trend fixed effect did not ciated with the intercept are zero to two or three decimal show evidence of being significantly different from the ini- places. tial status. However, there was evidence of inflation, on LeBeau et al. 9 Figure 2. Density plot of weighted empirical type I error rates by missing a RE and fixed effect term. Note. RE = random effect.  )  Table 3. Fixed Effect Meta-Regression Results for Significant Predictors in the Transformed βt and Type I Error Rate Metrics . βπ   ( ) βπ βπ Variable 99% CI Z βt k k Intercept 0.224 0.048 [0.019, 0.090] 6.296 w/ sample size −0.025 −0.000 [−0.000, −0.000] −3.629 Fit SC UN 0.051 0.002 [0.001, 0.002] 24.716 Time: Miss RE 0.171 0.028 [0.025, 0.031] 41.378 Fit SC ARMA: Miss RE −0.024 −0.000 [−0.000, −0.000] −5.439 Within: Fit SC ARMA 0.028 0.000 [0.000, 0.000] 8.097 Within: Fit SC MA2 0.035 0.000 [0.000, 0.001] 9.473 Time: Fit SC AR1:Miss RE −0.119 −0.000 [−0.000, −0.000] −19.527 Within: Fit SC AR1:Miss RE 0.027 0.001 [0.003, 0.001] 3.438 Time: Fit SC ARMA:Miss RE −0.115 −0.000 [−0.000, −0.000] −18.905 Within: Fit SC ARMA:Miss RE 0.028 0.001 [0.002, 0.001] 3.596 Time: Fit SC MA1:Miss RE −0.084 −0.000 [−0.000, −0.000] −12.953 Time: Fit SC MA2:Miss RE −0.105 −0.000 [−0.000, −0.000] −16.131 Within: Fit SC MA2:Miss RE 0.027 0.002 [0.004, 0.001] 3.352 Note. Reference groups were dissertations, normal random effect distributions, intercept fixed effect terms, and independent fitted and generated serial correlation structures. CI = confidence interval; RE = random effect; Fit = fitted; SC = serial correlation; ARMA = autoregressive moving average; MA = moving average; Miss = missing; w/ = within; colon (:) = an interaction. 10 SAGE Open Figure 3. Interaction plot showing three variable interaction between fixed effect term, fitted serial correlation structure (Toeplitz and unstructured omitted from figure), and misspecification of the random effect structure. Note. AR = autoregressive; ARMA = autoregressive moving average; MA = moving average; RE = random effect. average, when the random effect distribution was misspeci- when the fitted serial correlation structure is independent fied, an increase of 0.028 in the empirical type I error rate (i.e., assuming no serial correlation underlies the data). metric. This results in an estimated empirical type I error Including some form of serial correlation does help to limit rate of 0.076, an approximately 50% increase in the empiri- the strong inflation however. cal type I error rate metric. Other within individual variables behave very similar to There is evidence that including some form of serial cor- the intercept term. There are significant terms labeled as relation does improve the empirical type I error rate, but does “Within” in Table 3, but many of these have very small esti- not fully overcome the inflation due to misspecification of mates suggesting terms of little practical significance. the random effect structure. This can be seen by the negative coefficients for the time by fitted serial correlation by mis- POM. Results for the final POM can be seen in Figure 4 for specification of the random effect three-way interaction. covariates of primary interest. This plot depicts an interac- Estimates for these terms are not as large as the time by mis- tion between the generated and fitted serial correlation struc- specification of the random effect two-way interaction, sug- ture, the model term, whether the random structure was gesting an adjustment but not full correction of the inflation. misspecified, and the two cumulative logits shown with solid L L The first-order autoregressive (AR(1)) and first-order autore- lines ( ) and dashed lines ( ). As can be seen for the 1 2 gressive moving average (ARMA(1, 1)) seem to do the best intercept term (left most panel), the second cumulative logit job of correcting the inflation, shown in Figure 3. The figure is very close to 1, indicating that very few of the observations shows the fixed effects associated with time and a mis- were in the third group. Furthermore, the probability of being specification of the random effect structure, the empirical in Group 1 is also very small, indicating that fixed effect type I error rate is inflated. The effect is particularly large terms associated with the intercept tend to have the nominal LeBeau et al. 11 Figure 4. Interaction plot showing results of the partial proportional odds model by the fitted serial correlation structure, generated serial correlation structure, fixed effect term, and misspecification of the random effect structure. Note. P(Y ≤ 1) represents the probability of being a conservative test (a level of significance less than .05) compared with accurate or liberal tests (a level of significance greater than .05). P(Y ≤ 2) represents the probability of being a conservative or accurate test compared with liberal tests. AR = autoregressive; ARMA = autoregressive moving average; MA = moving average; RE = random effect. Table 4. Article Summary Information of Unweighted Empirical Type I Error Rates for the Random Effects and Other Monte Carlo Conditions. Author Source K Repl Avg T1 Med T1 Min T1 Max T1 RE Range CS Range wCS Black (2011) Jour 20 1,000 0.153 0.080 0.000 0.860 2 (50, 50) (20, 20) Browne (2000) Jour 20 930 0.103 0.091 0.032 0.186 2 (12, 48) (18, 18) Browne (2002) Jour 11 1,000 0.057 0.059 0.040 0.075 3 (65, 65) (62, 62) Delpish (2006) Diss 48 500 0.100 0.052 0.045 0.400 2 (30, 100) (30, 30) Kwon (2011) Diss 162 250 0.043 0.048 0.000 0.132 2 (40, 160) (45, 45) Kwon (2011) Diss 162 250 0.043 0.048 0.000 0.124 2 (40, 160) (45, 45) Maas (2004) Jour 108 1,000 0.097 0.051 0.005 0.429 2 (30, 100) (5, 50) Maas (2006) Jour 81 1,000 0.067 0.064 0.035 0.116 2 (30, 100) (5, 50) Total — 621 561 0.066 0.052 0.000 0.860 — — — Note. K = number of effect sizes; Repl = replication; T1 = type I error rate; Med = median; Min = minimum; Max = maximum; RE = random effects; CS = cluster size; wCS = within-cluster size; Jour = journal article; Diss = dissertation. The middle panel in Figure 4 shows the model results for rate enclosed in the 95% CI, indicating adequate type I fixed effects associated with the linear slope. As can be seen error rate coverage. 12 SAGE Open Table 5. Random Effect Meta-Regression Results for Significant Predictors in the Transformed () β and Type I Error Rate Metrics () β π . ^ ^ ^ Variable () β 99% CI () β Z π π k k Intercept 0.334 0.106 [0.055, 0.172] 8.246 Var b 0.067 0.003 [0.002, 0.004] 24.168 Var −0.055 −0.000 [−0.000, −0.000] −19.909 Number FE 6 −0.134 −0.000 [−0.000, −0.000] −2.806 Number FE 10 −0.134 −0.000 [−0.000, −0.000] −2.796 Note. Reference groups were dissertations, two fixed and random effects, within-cluster residuals. CI = confidence interval; FE = fixed effect. from the figure, there is an interaction effect where mis- a 95% CI = [0.048, 0.087]. The omnibus Q test was also specification of the random effect structure significantly (( Qp 491), =< 14 891,. 0 001) significant , suggesting that decreases the probability of being in Group 1 or 2. This sug- there is significant variation that may be able to be explained gests that the probability of having an inflated type I error by study conditions that were coded. The estimate of the rate is larger than .75 in many cases. When all the random between study variation was 0.003. effect terms are correctly modeled, the probability of having an inflated type I error rate decreases significantly as shown Inferential models. With the significant Q test, a meta- in the bottom plot of the middle panel. regression was performed to attempt to explain the signifi- Finally, the rightmost panel in Figure 4 shows the effect cant variation in the empirical random effect type I error for terms associated with other Level 1 slopes not including rates. Significant predictors at the .01 level are shown in the intercept or linear slope for time. Compared with the Table 5. The conditional model explained significant varia- intercept, the probability of being in both Groups 1 and 2 is tion as shown by the significant chi-square moderator test, slightly smaller, suggesting that the probability of being in Qp () 92 =< ,, 633 .; 001 R = .759 . Table 5 shows the M Meta Group 3 is higher than the intercept and the linear slope when weighted empirical type I error rate was 0.106 for the refer- the random effect structure is correctly modeled. ence condition being those with average sample size, disser- Throughout all the conditions, the AR(1) fitted serial cor- tations, two random and fixed effects, and for the variance of relation structure did increase the probability of being in the within-cluster residuals. The other two significant predic- Group 1 or 2, especially for the linear slope and other within tors shown in the table are adjustments for the variance of the slopes (middle and right panels of Figure 4). The indepen- random effect terms. Specifically, the empirical type I error dent generated serial correlation structures also provided the rate of the variance of the random intercepts tends to be highest probability of being in Group 1 or 2, a situation that slightly inflated compared with the within-cluster residuals. is not surprising, given that these are cases where serial cor- In contrast, the empirical type I error rate of the variance of relation is not present in the data, and the random effects the random slopes tends to be smaller (i.e., less biased) com- should adequately represent the dependency due to repeated pared with the within-cluster residuals. Of note, the cluster measures. Finally, Figure 4 also depicts across all conditions sample size was also significant, however, this term was not that the probability of being in Group 1 (type I error rate reported in the table as the parameter estimate was −0.000 (in smaller than the nominal rate) is not common. Liberal tests the transformed scale) and not deemed significant in are a much larger concern than conservative significant tests, practice. particularly when serial correlation is present in the underly- ing data. POM. Results for the POM can be seen in Figure 5 for sig- nificant covariates. This plot shows the effect of the number of fixed effects and the random effect term on the probability Random Effects of the empirical type I error rate to be conservative, accurate, or liberal. In this figure, the solid lines represent the proba- The unweighted empirical type I error rate for the random bility of a conservative test, the dashed lines represent the effects can be seen in Table 4. Of the 13 articles coded, only probability of an accurate or conservative test, and one minus seven reported empirical type I error rates for the random the probability of the dashed lines represent liberal tests. effects, resulting in a total of 621 effect sizes. On average, Increasing the number of fixed effects significantly increases the unweighted empirical type I error rate was at 0.066 with the probability of being in Group 1 or 2 compared with evidence of positive skew and large variation as shown by Group 3, suggesting that these tests are more likely to be the minimum and maximum values. accurate or conservative rather than liberal, with the proba- Considering the number of replications and variance, the bility being very close to 1. However, with only a couple weighted average empirical type I error rate was 0.066 with LeBeau et al. 13 Figure 5. Interaction plot showing results of the partial proportional odds model by the random effect term and number of fixed effects. Note. P(Y ≤ 1) represents the probability of being a conservative test (a level of significance less than .05) compared with accurate or liberal tests (a level of significance greater than .05). P(Y ≤ 2) represents the probability of being a conservative or accurate test compared with liberal tests. fixed effect terms in the model, the probability of a liberal There are a handful of takeaway messages from this test is much larger, as likely as .7 for the significant test asso- meta-analysis. First, not modeling a random effect when ciated with the variance of the intercept. that term is underlying the data leads to inflated type I error The within-cluster residuals (labeled as Res) and the vari- rates. For example, in the study by LeBeau (2012), the data ance of the intercept tend to be accurate tests (most likely to were generated with a random effect for time, but this ran- be in Group 2), particularly when more fixed effects are dom effect was omitted during model fitting. This leads to included in the model. In contrast, there is a rather large inflated type I error rates for the fixed effect parameters probability (about .90) of the empirical type I error rate to be associated with that random effect. This inflation of the type conservative for the variance of the slope when more than six I error rate could lead researchers to reject true null hypoth- fixed effects are included in the model as shown in Figure 5. eses more often than one would expect given their specified α level. Given the level of significance is one aspect that researchers have direct control over, the disconnect between Discussion the empirical value and the one specified is problematic for The purpose of this meta-analysis was to help improve the applied researchers. The problem of misspecification of the external validity of MC LMM studies, better understand random effect structure in this way may occur most often gaps in simulation conditions, and inform applied research- when models fail to converge. In this situation, fixing a ran- ers of assumption violations that can affect the study results. dom effect to zero, can drastically improve the convergence A meta-analysis of the empirical type I error rates from MC rate and may be a step taken by applied researchers. studies was performed and the manipulated and nonmanipu- Understanding ways to overcome this inflation or better lated study conditions were combined to empirically under- detecting serial correlation would be meaningful for the stand which are most important. literature. 14 SAGE Open Fitting a serial correlation structure has the effect of help- A second recommendation is to check for the presence ing to correct the inflation in type I error rates when a ran- of serial correlation in the data, especially when the repeated dom effect is missing. However, fitting the serial correlation measurements are measured close in time. Adding serial structure does not fully recover the type I error rate to nomi- correlation to the LMM can help to statistically adjust for a nal levels. In addition, by looking at the parameter estimates potential source of random effect misspecification and help and Figure 3, the AR(1) and ARMA(1, 1) structures seem to alleviate severe empirical type I error rate inflation. Finally, do a better job of reducing the type I error rate than the this meta-analysis also found that simpler serial correlation MA(1) or MA(2) serial correlation structures (−.013 and structures (AR(1)) perform just as well as more compli- −.012 vs. −.0006 and −.010, respectively). This may be due cated structures (ARMA(1, 1)) as shown in Figure 3 and to AR(1) and ARMA(1, 1) structures better representing lon- Figure 4. gitudinal data structures (i.e., correlation decreasing between Finally, statistical tests for the random effects also show measurement occasions as the time lag increases), whereas evidence of being too liberal on average (see Table 5), par- the MA structures may better align with other nested data ticularly with few fixed effects included in the model (see structures. Figure 5). This can be problematic if applied researchers are No relationship was found between the random effect dis- counting on these tests for inclusion of random effects in the tribution and the type I error rate for the fixed effects, which LMM. However, as shown by the fixed effect results, it is has been found previously in the literature (LeBeau, 2013; extremely problematic to underspecify the random effect Maas & Hox, 2004, 2005). Similarly, the generated serial structure; therefore, including more random effects than correlation structure does not have an impact on the type I needed is likely less problematic as long as the model still error rates of the fixed effects. This suggests that the random converges. Research into this would be worthy of further effects do an adequate job of accounting for the dependency study. due to repeated measurements, regardless of the type of serial correlation structure underlying the data. Informing Future MC Studies Finally, the small effects related to the sample sizes is an interesting finding. With maximum likelihood being an This meta-analysis can help inform additional areas of study. asymptotic estimation method, the larger sample sizes may None of the MC studies allowed for variables to vary besides have provided better estimates. The small effect helps to the initial status and linear trend. The implications for the inform researchers with relatively small sample sizes that the empirical type I error rates when the trajectory of additional type I error rate can be held in check with few observations variables is allowed to vary between clusters would be inter- and relatively few clusters (12 clusters was the smallest con- esting to explore, especially in relation to cases when the ran- dition in the current meta-analysis). However, sample size dom structure is misspecified. may play an important role in the estimates of the variances Second, the number of fixed effects in the MC studies of the random components and would be worthy of further coded were homogeneous and were not significant in the study. final inferential model. As can be seen in Table 1, most studies had between four and six fixed effects and only a single study included more with 10 fixed effects. Many Informing Applied Researchers applied studies using the LMM tend to have more than four There are three recommendations for applied researchers to six fixed effects, for example, Harwell, Post, Medhanie, with respect to empirical type I error rates for the LMM. Dupuis, and LeBeau (2013) has 33 covariates included in a First, results suggest that conservative tests are less com- three-level longitudinal LMM. Better understanding the mon with the LMM for the fixed effect terms (see Figure implications when there are many fixed effects related to 4). Instead, significant tests are much more likely to be lib- assumption violations and small sample size conditions eral. One aspect to be particularly careful with is when the would be a welcome addition to the MC literature. random effect for time is not included in the final model. Relatedly, little attention in the MC LMM literature has This could be a case where the random effect structure is been given to three-level models and would be worthy of misspecified, and severe inflation of the empirical type I more attention. error rate may be present for fixed effects associated with the linear slope. In these situations, exploring variation in Declaration of Conflicting Interests the trajectories of individuals would be helpful to see The author(s) declared no potential conflicts of interest with respect whether there is variation in the data not being captured by to the research, authorship, and/or publication of this article. the LMM. If there is evidence of misspecification of the random effect structure and the inclusion of the random Funding effect for the linear slope leads to nonconvergence, using robust standard errors may be a way to alleviate elevated The author(s) received no financial support for the research, author- ship, and/or publication of this article. empirical type I error rates. LeBeau et al. 15 References Laird, N., & Ware, J. (1982). Random-effects models for longitudi- nal data. Biometrics, 38, 963-974. References marked with an asterisk indicate studies included in the *LeBeau, B. (2012). The impact of ignoring serial correlation meta-analysis in longitudinal data analysis with the linear mixed model: Aloe, A. M., Becker, B. J., & Pigott, T. D. (2010). An alterna- A Monte Carlo study (Doctoral dissertation). University of tive to R2 for assessing linear models of effect size. Research Minnesota, Twin Cities. Synthesis Methods, 1, 272-283. *LeBeau, B. (2013, April). Impact of non–normal level one and *Black, A. C., Harel, O., & McCoach, D. B. (2011). Missing data two residuals on the linear mixed model. Paper presented at the techniques for multilevel data: Implications of model mis- American Educational Research Association Conference, San specification. Journal of Applied Statistics, 38, 1845-1865. Francisco, CA. *Browne, W., & Draper, D. (2000). Implementation and *Maas, C., & Hox, J. (2004). The influence of violations of performance issues in the Bayesian and likelihood fitting assumptions on multilevel parameter estimates and their stan- of multilevel models. Computational Statistics, 15, dard errors. Computational Statistics & Data Analysis, 46, 391-420. 427-440. *Browne, W., Draper, D., Goldstein, H., & Rasbash, J. (2002). *Maas, C., & Hox, J. (2005). Sufficient sample sizes for Bayesian and likelihood methods for fitting multilevel mod- multilevel modeling. Methodology: European Journal of els with complex level-1 variation. Computational Statistics & Research Methods for the Behavioral and Social Sciences, Data Analysis, 39, 203-225. 1(3), 86-92. Browne, W., & Goldstein, H. (2010). MCMC sampling for a multi- *Mallinckrodt, C. H., Clark, W. S., & David, S. R. (2001). Type I level model with nonindependent residuals within and between error rates from mixed effects model repeated measures ver- cluster units. Journal of Educational and Behavioral Statistics, sus fixed effects ANOVA with missing values imputed via last 35, 453-473. observation carried forward. Drug Information Journal, 35, Cooper, H., Hedges, L. V., & Valentine, J. C. (2009). The hand- 1215-1225. book of research synthesis and meta-analysis. New York, NY: Miller, J. J. (1978). The inverse of the Freeman–Tukey double Russell Sage Foundation. arcsine transformation. The American Statistician, 32, *Delpish, A. (2006). Comparison of estimators in hierarchical lin- 138-138. ear modeling: Restricted maximum likelihood versus bootstrap *Murphy, D., & Pituch, K. (2009). The performance of multi- via minimum norm quadratic unbiased estimators (Doctoral level growth curve models under an autoregressive moving dissertation). Florida State University, Tallahassee. average process. The Journal of Experimental Education, 77, Diggle, P., Heagerty, P., Liang, K.-Y., & Zeger, S. (2002). Analysis 255-284. of longitudinal data. Oxford, UK: Oxford University Press. *Overall, J. E., & Tonidandel, S. (2010). The case for use of simple *Ferron, J., Dailey, R., & Yi, Q. (2002). Effects of misspecifying difference scores to test the significance of differences in mean the first-level error structure in two-level models of change. rates of change in controlled repeated measurements designs. Multivariate Behavioral Research, 37, 379-403. Multivariate Behavioral Research, 45, 806-827. Fitzmaurice, G., Laird, N., & Ware, J. (2004). Applied longitudinal Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. analysis. Hoboken, NJ: Wiley-IEEE. (2001). Monte Carlo experiments: Design and implementation. Freeman, M. F., & Tukey, J. W. (1950). Transformations related to Structural Equation Modeling, 8, 287-312. the angular and the square root. The Annals of Mathematical Raudenbush, S. W. (2009). Analyzing effect sizes: Random-effects Statistics, 21, 607-611. models. The Handbook of Research Synthesis and Meta- Goldstein, H. (2010). Multilevel statistical models. West Sussex, Analysis, 2, 295-316. UK: Wiley. Raudenbush, S. W., & Bryk, A. (2002). Hierarchical linear mod- Harwell, M. R., Post, T. R., Medhanie, A., Dupuis, D. N., & LeBeau, els: Applications and data analysis methods. Thousand Oaks, B. (2013). A multi-institutional study of high school mathemat- CA: Sage. ics curricula and college mathematics achievement and course R Core Team. (2015). R: A language and environment for statisti- taking. Journal for Research in Mathematics Education, 44, cal computing. Vienna, Austria: R Foundation for Statistical 742-774. Computing. Available from https://www.R-project.org/ Harwell, M. R., Rubinstein, E. N., Hayes, W. S., & Olds, C. C. Rubinstein, R. Y., & Kroese, D. P. (2016). Simulation and the (1992). Summarizing Monte Carlo results in methodological Monte Carlo method (Vol. 10). Hoboken, NJ: John Wiley. research: The one- and two-factor fixed effects ANOVA cases. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data Journal of Educational Statistics, 17, 315-339. analysis: Modeling change and event occurrence. Oxford, UK: Hoaglin, D. C., & Andrews, D. F. (1975). The reporting of com- Oxford University Press. putation-based results in statistics. The American Statistician, Skrondal, A. (2000). Design and analysis of Monte Carlo experi- 29, 122-126. ments: Attacking the conventional wisdom. Multivariate Kwok, O., West, S., & Green, S. (2007). The impact of misspecify- Behavioral Research, 35, 137-167. ing the within-subject covariance structure in multiwave longi- Viechtbauer, W. (2010). Conducting meta-analyses in R with the tudinal multilevel models: A Monte Carlo study. Multivariate metafor package. Journal of Statistical Software, 36(3), 1-48. Behavioral Research, 42, 557-592. Retrieved from http://www.jstatsoft.org/v36/i03/ *Kwon, H. (2011). A Monte Carlo study of missing data treatments Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. for an incomplete level-2 variable in hierarchical linear models New York, NY: Springer. (Doctoral dissertation). The Ohio State University, Columbus. 16 SAGE Open Yee, T. W. (2010). The VGAM package for categorical data analy- research interests include longitudinal data, reproducible sis. Journal of Statistical Software, 32(10), 1-34. workflows, and research software development. Yee, T. W. (2015). Vector generalized linear and additive models: Yoon Ah Song is a graduate student at the University of Iowa in the With an implementation in R. New York, NY: Springer. Educational Measurement and Statistics program. She is interested in item response theory and linear mixed models. Author Biographies Wei Cheng Liu is a graduate of the Educational Measurement and Brandon LeBeau is an assistant professor of Educational Statistics program at the University of Iowa. He is interested in pro- Statistics and Measurement at the University of Iowa. His gram evaluation.

Journal

SAGE OpenSAGE

Published: Dec 22, 2018

Keywords: linear mixed model; longitudinal data; type I error rate; meta-analysis

References