Data Science Seminar
Hosted by the Department of Mathematics and Statistics

Abstract


In this talk, we show that the available information about unknown parameters in rare events data is only tied to the relatively small number of cases, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. We derive an optimal sampling probability for the inverse probability weighted (IPW) estimator to minimize the information loss. We further propose a likelihood-based estimator to further improve the estimation efficiency, and show that the improved estimator has the smallest asymptotic variance among a large class of estimators. It is also more robust to pilot misspecification. The likelihood-based estimator is also generalized to a class of models beyond binary response models. We validate our approach on simulated data, the MNIST data, and a real click-through rate dataset with more than 0.3 trillion instances.

Biography of the speaker: Dr. Wang is an Associate Professor in the Department of Statistics at the University of Connecticut. He obtained his Ph.D. from the Department of Statistics at the University of Missouri in 2013, and his M.S. from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences in 2006. His research interests include informative subdata selection for big data, model selection, model averaging, measurement error models, and semi-parametric regression.