Improving Representativeness in Non-Probability Surveys and Causal Inference with Regularized Regression and Post-Stratifiation

Awardees

Andrew Gelman
Higgins Professor of Statistics and Professor of Political Science
Qixuan Chen
Associate Professor

The proposed project has a broad aim of working with the increasing complexities of survey statistics with decreasing response rate. We focus specifically on non-probability samples (samples of convenience) due to their increasing popularity, but note that these non-probability samples are simply an extreme case of a probability based survey with high non-response, and so our methods could be expected to generalize. Long term, our hope is to find methods and techniques to safely adjust non-probability samples to a wider population whilst concurrently developing methods of critiquing these estimates to increase researcher, policy maker and public confidence in these estimates. Our specific aims focus in on developing the tools and techniques to make this possible. We focus primarily on a regularized regression and poststratification methodology that has already shown some success with non- representative and even convenience samples. Using this methodology, we focus on adaptions that make this technique useful for public health settings. Specifically we focus on a three pronged approach. Firstly, we aim to make adaptions to the current state of the arc of modelling technique to better suit the unique challenges posed by public health datasets and questions. Our approach to achieve this is to focus on partial pooling with more structured adjustment variables, and more broadly considering high dimensional variables with continuous and non-continuous components. Not only that, but we move to also consider uncertainty in poststratification, namely when adjusting for variables not known in the population. In a complementary approach, we also aim to assess coverage by combining raw survey data but assuming differences in sample. Secondly, we note that many our central methodology could be extended to questions of a causal nature. This is particularly relevant to public health challenges because often causal estimates are desired. Our approach is to extend the model based approach to assume heterogeneity of effect within demographic subgroups. Then by using regularization, the effect within each subgroup is estimated and used to poststratify to the population. Groups with relatively few treated/untreated individuals would be estimated with greater uncertainty, which is an innovative approach to accounting for balance. Thirdly and finally we note that the regularized regression and prediction technique is particularly reliant on model assumptions. Our final aim is to consider methods of testing and validating models with non-representative data in order to obtain better and more trustworthy population based estimates.