Recent Award

Flexible Log-Likelihood Functions

Scientific models often make poor predictions of their outcomes due to underfit or overfit. Just like the radio of an old car has only two adjustable dials for frequency and volume, probability distributions, like the bell-shaped normal distribution, that are used in scientific models typically only have one or two adjustable dials. Underfit occurs when no combination of the dials ever produces a good signal, while overfit occurs when the dials produce an extremely good signal in particular circumstances that becomes poor in slightly different circumstances, such as the location of the car or a new group of observations in a scientist’s data set. Scientists should be able to add as many adjustable dials to their models as they need to avoid underfitting the data but also need the capability to avoid overfitting when simultaneously adjusting more dials. Doing so will promote the progress of science by allowing scientists to make decent predictions of their outcomes that readily generalize when circumstances change moderately. Since scientists in all fields face these ubiquitous problems, providing software to help combat them will have a broad impact.

Those metaphorical dials are the parameters in the likelihood function used by scientists when utilizing maximum likelihood or Bayesian estimation techniques. Almost all likelihood functions are taken from the exponential family of probability distributions and have only one or two unknown parameters. In the past few years, a new continuous probability distribution from outside the exponential family has been derived where scientists can specify any fixed number of parameters, but this metalog(istic) distribution is difficult to use as a likelihood function because its density lacks an explicit expression. Moreover, while there is a convex set of parameters that produce a valid probability distribution, some combinations of parameters do not, and it is presumably impossible to characterize the admissible set explicitly. Nevertheless, it is quite possible to evaluate a metalog likelihood function numerically while imposing the required constraints on the unknown parameters, which would allow it to be utilized by scientists in many situations. The investigators will implement the metalog likelihood function in the free and open-source software known as Stan, whose algorithms have become the workhorse for Bayesian analysis in many scientific fields. The Bayesian approach also would permit a principled way to choose a model with the appropriate number of parameters in the likelihood in order to avoid overfitting future data. The project will also further advance interdisciplinary professional development of the next generation of researchers in statistical, data science, and other STEM disciplines.

Principal Investigator: 

Benjamin Goodrich

Associate Research Scholar; Lecturer in the Department of Political Science

Andrew Gelman

Higgins Professor of Statistics and Professor of Political Science

Date: 

Sunday, May 1, 2022 to Tuesday, April 30, 2024

Research Category: 

Amount: 

$150,000

Newsletter

Don't want to miss our interesting news and updates! Make sure to join our newsletter list.

* indicates required

Contact us

For general questions about ISERP programs, services, and events.

Working Papers Bulletin Sign-up

Sign up here to receive our Working Papers Bulletin, featuring work from researchers across all of the social science departments. To submit your own working paper for our next bulletin, please upload it here, or send it to iserp-communication@columbia.edu.
* indicates required