Factor Based Imputation of Missing Data

Awardees

Serena Ng
Edwin W. Rickert Professor of Economics

$246,111

Missing observations occur in physical and social science research; Missing observations may arise from people not responding to question,; changes in data definitions, technology of data collection, and natural disasters and wars, among others. Many methods have been proposed to impute the missing observations; these methods often impose restrictive assumptions about the nature of the missing values, with missing at random being the most common. Though they work well in practice, the theoretical properties of the imputed data are not well understood. Without a distribution theory, the sampling uncertainty around the imputed values cannot be measured. Furthermore, the assumption that the missing observations happen at random is often not appropriate for economic data. This project will develop procedures that will use information in the time and cross-section dimensions to recover the missing values in a data set. This framework has applications in many areas but the project uses three areas to illustrate the usefulness of these methods. These methods allow researchers to work with data that has missing observations, allowing for more efficient evaluation of programs as well as provide better advise to businesses and businesses. This will also help establish the US as the global leader in big data econometrics.

This project will develop a suite of factor based imputation (FBI) procedures that will use information in the time and cross-section dimensions to recover the missing values. The main insight is to organize the data into blocks and subsequently exploit the overlapping information between blocks. In place of the missing at random, the analysis assumes that the data admit a strong factor structure. In the simplest case when the missing data appear in an organized manner, the solution is an algorithm that involves no more than two applications of principal components. When the missing data are disorganized, one application of principal components coupled with a series of projections will also yield consistent estimates. The project will establish the convergence rates of the missing values, characterize the sampling error, and document its finite sample properties via simulations. This framework has broad applications but the project focuses on three: (i) estimation of the individual and average treatment effect of the treated; (ii) covariance structure estimation (of e.g. earnings dynamics) when missing data affects computation of second moments; and (iii) mixed frequency data and the use of imputed values in forecasting. A set of computer programs, written in R package, will be made publicly available. This research will establish the US as the global leader in big data analysis.