The Yasuyo building, Shinjuku, March 2025

Introduction

The issue of how to analyse data on repeated observations on the same units over time (i.e. panel data/longitudinal data) occurs frequently. For example, visitors returning to a website, the sale price of a group of historic buildings over time, people’s pattern of TV viewing and individuals stated well-being in response to repeated survey questions to name just a few examples I’ve encountered.

The below are some notes to refresh my knowledge of the basics of the topic. What follows assumes some knowledge of regression and linear algebra. It is based on the treatment of the topic in Davidson & Mackinnon and Johnston & DiNardo.

Panel data and the estimation issues it creates

We assume there are T observations over time on m different units. Where y_it is the dependent variable of unit i at time t and x_it is the explanatory variable of unit i at time t.

panel_1

The error for unit i at time t consists of a unit specific effect η_i and a time specific unit effect ε_it. panel_2 The standard Ordinary Least Squares (OLS) estimator for the vector β of n coefficients is the following equation where the dependent variable is represented by y an (m x T, 1) vector, the n explanatory variables X are represented by a matrix of dimensions (m x T, n). panel_3 This gives y_hat the predicted value of y for a given X. panel_4 A geometric interpretation of this is that y_hat is the projection of y onto the space spanned by vectors that are the explanatory variables X. Where y_hat̂ is determined by the vector of coefficients β which minimises the difference (the sum of squared errors) between y and y_hat.

If we estimate a regression on panel data using a standard OLS estimator without taking account that the data comes from different units (i.e. we run a regression as if we have m*T independent observations, rather than T repeated observations on m different units) then this is likely to create a series of problems:

-Biased estimators of the coefficients from omitted variable bias Running a regression without adjusting for the fact that repeated observations come from the same unit, is likely to result in a form of omitted variable bias. This is because there are likely to be unit specific effects that affect the outcome variables which won’t be captured by the explanatory variables. If these unit specific effects are correlated with the variables whose influence we are trying to estimate then this will result in OLS estimates being biased.

-Inefficient estimators due to heteroscedasticity of errors Unit specific effects are highly likely to result in heteroscedasticity as each unit in the regression data will probably have a different pattern of unit specific errors, resulting in varying errors across the members of the panel. This is likely to result in inefficient estimates as it results in heteroscedasticity.

Two standard approaches to estimating panel data to address these:

  1. Fixed effects (which helps remove the bias)
  2. Random effects (which helps improves the efficiency)

are now described. In what follows, we refer to the units as individuals or groups.

1. Fixed effects: Using the variation in the individual units over time (and removing the cross-sectional variation) to estimate the coefficients

In the simplest form of fixed effects estimation we assume that each unit’s identity has a specified effect on the dependent variable which does not vary over time e.g. if y is spending and X is a set of data on personal characteristics then person i will always spend η_i more (or less) depending on who they are.

We can adjust for these effects by estimating the following equation where X are the explanatory variables over time across all time periods and observations using set of dummy variables D where η is a vector of individual specific effects which do not vary over time. D is an (m*T, m) matrix of group/individual specific effects which is 1 for person/unit i’s data for all time periods, but which is otherwise zero. panel_5 We do not necessarily need to estimate the regression using dummy variables to control for the fixed effects as there are other ways to adjust for these. A standard approach is to take the dependent variable y and the variables in X and for each unit subtracting the mean value for that unit for each variable.

To see why this works we split the regression explanatory variables into two sets X_1 and X_2. We will later specialise to X_2 being the dummy variables that describe when the observation relates to a given unit. panel_6 This means that we first regress the measured y and the set of variables X_1 on the other variables X_2 and obtain as residuals the variation of y and X_1 that can’t be explained by variation in X_2. If we then estimate an OLS regression on this transformed y and X_1 data we can then isolate the effects of the Betas for X_1.

In the special case of adjusting for the unit specific effects the X_2 variables are the dummy variables in the matrix D. As the dummy variables are specified at the level of the group/individual i.e. they are 1 for individual/or group i across all time periods then this is equivalent to taking each independent and dependent variable and subtracting their unit level means. panel_7 Where the following properties of M_D are used: panel_8

We have shown that we can control for individual level specific effects by subtracting the difference of the means at the group level even though we may not have data on the individual level fixed effects. This means that only variation relative to an individual’s mean effects the estimate. The relative level of a unit’s data, its mean, relative to others does not matter. As a result, fixed effects estimators are often referred to as within groups estimators as they are only using the variation within groups. The OLS estimator is a mixture of within and between group estimation as shown in Appendix 1 where the between groups estimator is based on running a regression on the mean of each unit’s values removing any time variation.

The benefits of fixed effects estimation

In most cases it is highly likely there will be factors that affect the outcome of the dependent variable we are interested in, but which we do not have information on e.g. people may live in the same street, have similar incomes, but spend money on consumption in quite different ways. This may for example relate to say differences in wealth, genetic variation or individual’s upbringing, but we are unlikely to have complete information on this. Fixed effects offer us the promise of being able to adjust for these differences even if don’t have information on them.

As the errors from the fixed effects estimator are likely to vary across the different units there will be heteroscedasticity in the errors resulting in fixed-effects estimators being less efficient i.e. having a higher variance.

Conceptually, although the ability of fixed effects to remove the time-invariant unit effects is attractive. It is also more sensitive to measurement error in the explanatory variables as it uses the variation over time in the variables which is likely to be more driven by this.

2. Random effects: Treating the unit specific effects as random variation and adjusting for the heteroscedaticity with Generalised Least Squares (GLS)

In the random effects model we treat the individual level effects as random variables. We therefore estimate a form of GLS which corrects the estimator for the varying error terms across units. In this instance we make a series of assumptions about the error term: panel_9 The variance parameters in the lambda are estimated by using the variance of the:

-within groups regression errors to estimate the unit specific variance
-between groups regression errors to estimate the variance of errors across units

If there is no variation in unit specific effects then lambda collapses to 0 and we revert to a standard OLS estimator. If we have a large enough set of observations over time T then we also converge on OLS as the variation in the unit specific effects tend to 0. If lambda is 1 then the random effects estimator becomes the between groups estimator, where all variable values at the unit level are set to their average for the unit (see Appendix 1) in the regression.

We then estimate OLS on the transformed data: panel_10

Where we have used: panel_11 Calculating the variance covariance matrix: panel_12

3. Choosing between fixed and random effects:

In principle if the errors of the random effects model are not correlated with the explanatory variables then the random effects estimates of the coefficients should be better than in the fixed effects model as they will be unbiased and have lower variance.

However, at the same time it feels very plausible that the unit specific factor which we don’t have data on a) affect the dependent variable and b) will often be correlated with the explanatory variables biasing the coefficients that we estimates using random effects.

A standard test to distinguish the two is the Wu Haussman test which is based on the difference between the vector of estimated coefficients from fixed and random effects estimation and the two corresponding variance-covariance matrices. panel_13

Where H is distributed as a Chi squared distribution with the degrees of freedom equal to the number of coefficients being estimated. The null hypothesis is that the random effects estimator is right. If we do not reject the null then that implies that there is a no difference between the random effects and the fixed effects estimator, and we should use the lower variance estimator. However, if fixed effects are substantively different that suggests random effects may be biased and we should use that instead.

If the random effects estimator passes the test, this may indicate that it is a better estimator, but it is also consistent with there not enough being variation in the explanatory variables to distinguish the two types of estimator.

Appendix 1: The OLS estimator is a mixture of between and within groups estimators

panel_14

Appendix 2: Showing that we can invert the errors variance-covariance matrix

panel_15 panel_16