On Saturday April 20, the Department of Statistics hosted the “New Challenges and Journeys for Statistics in Science Discovery” workshop. Processor Annie Qu organized the third annual event and it featured prominent statistical researchers from across the country. A diverse set of topics were discussed including design of large studies, product sales forecasting, network analysis, personalized medicine, individualized variable selection and methods for analyzing nonignorable missing data.
The workshop also discussed the new challenges faced in the growing field of data science. In particular, a need to think critically to evaluate scholarship in a growing field, how to recruit and retain faculty, and how to provide relevant education to students. The workshop discussed how important it is to stay relevant in the new data science era, both in research and education.
Thanks to the many researchers who traveled to University of Illinois to present cutting edge research in data science. The talk titles and abstracts of the workshop’s speakers can be viewed below:
Workshop on “New Challenges and Journeys for Statistics in Science Discovery”
Abstracts
A Temporal Latent Factor Model for Product Sales Forecasting
Xuan Bi (xbi@umn.edu), U. of Minnesota
With the advent of the big-data era, personalization is becoming increasingly important for electronic commerce, as they assist users in finding customized content, products, and services among the ever-increasing set of available alternatives. One of the most effective predictive methodologies to achieve such personalized recommendation is the latent factor model, whose applications have focused largely on modeling individual, subjective, taste-driven consumer preferences in domains of ``experience'' goods (e.g., movies, music, or books). In this article, we apply the latent factor model to the domain of sales forecasting, which achieves individualized sales forecasting for brick-and-mortar stores and product distributors. The major contribution of our work is that we incorporate local sales competition into the model, which formulates local market demand for different product categories, in addition to improving prediction accuracy. Technically, the proposed method applies a tensor factorization approach for time-aware, simultaneous modeling of past and current sales across multiple products and multiple stores, utilizes an additional set of negatively-correlated region- or product-specific latent factors to formulate sales competition, and conducts a seasonal time-series model to extend prediction results to future time points. The advantages of the proposed method are demonstrated by comparing its performance to a number of techniques from prior literature on a large dataset of sales transactions from more than 2,000 grocery stores across 47 U.S. markets, in terms of both product sales forecasting accuracy and decision support (e.g., for store managers or product distributors) on new product introduction.
Individualized Treatment Recommendation (ITR) for Survival Outcomes
Haoda Fu, Eli Lilly
ITR is a method to recommend treatment based on individual patient characteristics to maximize clinical benefit. During the past a few years, we have developed and published methods on this topic with various applications including comprehensive search algorithms, tree methods , benefit risk algorithm, multiple treatment & multiple ordinal treatment algorithms. In this talk, we propose a new ITR method to handle survival outcomes for multiple treatments. This new model enjoy the following practical and theoretical features. Instead of fitting the data, our method directly search the optimal treatment policy which improve the efficiency. To adjust censoring, we propose a doubly robust estimator. Our method only requires either censoring model or survival model is correct, but not both. When both are correct, our method enjoys better efficiency. Our method handles multiple treatments with intuitive geometry explanations. Our method is Fisher’s consistent even under either censoring model or survival model misspecification (but not both).
Growing Pains for Statistics Research and Education
Xuming He, U. of Michigan
We live in a golden age as statistics meets big data. Our graduates are in high demand, and statistics programs are expanding with no end in sight. However, Statistics programs across the country must be aware that we need a new generation of statisticians to revitalize our field and to flourish in the long run. We need to think critically how we evaluate scholarship in a growing field, how we recruit and retain faculty, and how we provide relevant education to our students. I will share some of my thoughts about what we need to do to stay ahead in the new data science era, both in research and education.
How to Design Big Comparative Studies?
Feifang Hu, George Washington University
Covariate balance is one of the most important concerns for successful comparative studies, such as causal inference, online A/B testing and clinical trials, because it reduces bias and improves the accuracy of inference. However, chance imbalance may still exist in traditional randomized experiments, and are substantial increasing in big data. To address this issue, the proposed method allocates the units sequentially and adaptively, using information on the current level of imbalance and the incoming unit's covariate. With a large number of covariates or a large number of units, the proposed method shows substantial advantages over the traditional methods in terms of the covariate balance and computational time, making it an ideal technique in the era of big data. Furthermore, the proposed method improves the estimated average treatment effect accuracy by achieving a minimum variance asymptotically. Numerical studies and real data analysis provide further evidence of the advantages of the proposed method.
Minimalist G-modeling: A comment on Efron
Roger Koenker, UIUC Emeritus
Abstract: Efron’s elegant approach to g-modeling for empirical Bayes problems is contrasted with an implementation of the Kiefer-Wolfowitz nonparametric maximum likelihood estimator for mixture models for several examples. The latter approach has the advantage that it is free of tuning parameters and consequently provides a relatively simple complementary method.
Statistical inference on high-dimensional generalized linear models: a refined de-biased approach
Bin Nan, UC Irvine
Abstract: In the existing literature, "de-biasing" or "de-sparsifying" the L_1-norm penalized estimator represents a very important line of methods for drawing inference in high-dimensional linear models, and has been extended to generalized linear models (GLMs). However, we found
that the de-sparsified approach in GLMs may not completely recover the bias or deliver reliable confidence intervals. In this work, we primarily consider the case of n > p with p diverging and provide an alternative modification to the original de-sparsified lasso, based on directly inverting the Hessian matrix, that further reduces bias and results in improved confidence interval coverage. Theoretical justification for drawing inference on linear combinations of the regression coefficients has been provided. Extensive simulations are conducted to show the improvement. This is a joint work with Lu Xia and Yi Li.
Individualized Multi-directional Variable Selection
Xiwei Tang, U. of Virginia
In this paper we propose a heterogeneous modeling framework which achieves the individual-wise feature selection and the covariate-wise subgrouping simultaneously. In contrast to conventional model selection approaches, the key component of the new approach is to construct a separation penalty with multi-directional shrinkages, which facilitates individualized modeling to distinguish strong signals from noisy ones and selects different relevant variables for different individuals. Meanwhile, the proposed model identifies subgroups among which individuals share similar covariates’ effects, and thus improves individualized estimation efficiency and feature selection accuracy. Moreover, the proposed model also incorporates within-individual correlation for longitudinal data. We provide a general theoretical foundation under a double-divergence modeling framework where the number of individuals and the number of individual-wise measurements can both diverge, which enables the inference on both an individual level and a population level. In particular, we establish the population-wise oracle property for the individualized estimator to ensure its optimal large sample property under various conditions. Simulation studies and an application to HIV longitudinal data are illustrated to compare the new approach to existing variable selection methods.
Collaborative ranking for personalized prediction
Junhui Wang, Citi University of Hong Kong
Abstract: Personalized prediction arises as an important yet challenging task, which predicts user-specific preferences on a large number of items given limited information. It is often modeled as certain recommender systems focusing on ordinal or continuous ratings. In this talk, I will present a new collaborative ranking system to predict most-preferred items for each user given search queries. Particularly, a psi-ranker is proposed based on ranking functions incorporating information on users, items, and search queries through latent factor models. Its probabilistic error bound is established showing that its ranking error has a sharp rate of convergence in the general framework of bipartite ranking, even when the dimension of the model parameters diverges with the sample size. Consequently, this result also indicates that the psi-ranker outperforms two major approaches in bipartite ranking: pairwise ranking and scoring. Finally, the proposed psi-ranker is applied to analyze the data from the Mobike big data challenge, consisting of three-million bicycle sharing records.
Conditional inference of a large undirected graph
Peng Wang, U. of Cincinnati
Abstract:
Inference of the structure of an undirected graph has an array of applications in network analysis, particularly when it links a network's structures to covariates of interest. For instance, in gene network analysis of a certain lung cancer, the network structures may vary over clinical attributes differentiating different subtypes of the cancer. In this article, we infer a network's structures, defined by an undirected graph. To increase the power of hypothesis testing, we de-correlate the structure equation models, develop a combined constrained likelihood ratio test, combining independent marginal likelihoods and unregularizing hypothesized parameters whereas regularizing nuisance parameters through $L_0$-constraints controlling the individual degree of sparseness. On this ground, we derive asymptotic distributions of the combined constrained likelihood ratio, which is chi-square or normal depending on if the co-dimension of a test is finite or increases with the sample size. This leads to likelihood-based tests in a high-dimensional situation permitting a network's size to increase in the sample size. Numerically, we demonstrate that the proposed method performs well in various situations. Finally, we apply the proposed method to infer a structural change of a gene network of a lung cancer with respect to four subtypes and other covariates of interest.
Random thoughts on statistics
Heping Zhang, Yale University
Abstract: Statistics originated from solving practical problems using mainly mathematical and probabilistic methods, and now makes more uses of computing tools. Because it encompasses a broad spectrum of tasks from postulating hypothesis, to study design, data collection and quality control, data analysis and methodology development, and interpretation, the identity of statistics hasn’t been completely settled or understood within and outside the statistical community. We almost reached the point where people seem to see enough differences between mathematics and statistics, but we get into another identity crisis quicker than we got out of an old one. What was, is and will be statistics? Why do we face identity crisis? I’ll try to offer my own humble perspective.
A Journey of Analyzing Nonignorable Missing Data with a Flexible and Robust Approach
Jiwei Zhao, State University of New York at Buffalo
Abstract: Nonignorable missing data exist in various biomedical studies and social sciences, e.g., aging research, metabolomics data analysis, electronic medical records, and health surveys. A major hurdle of rigorous nonignorable missing data analysis is how to model or estimate the missingness mechanism. Since this model depends on some unobserved data, its model fitting and model diagnostics are generally regarded as difficult, if not impossible. In this talk, I will briefly discuss some estimation procedures developed in recent years where the modeling of the nonignorable missing data mechanism can be completely avoided. These procedures are robust to the mechanism model misspecification; hence, they can be widely used to different problems under a broad spectrum of situations. Some potential future research topics will also be explored in the end of the talk.
Matrix Completion for Network Analysis
Ji Zhu, University of Michigan
Abstract: Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss two examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for cross-validation or non-parametric bootstrap on network data, a long-standing problem in network analysis. The second example focuses on reconstructing incompletely observed networks, with structured missingness resulting from the egocentric sampling mechanism, where a set of nodes is selected first and then their connections to the entire network are observed. We show that matrix completion can generally be very helpful in solving network problems, as long as the network structure is taken into account. This talk is based on joint work with Elizaveta Levina, Tianxi Li and Yun-Jhong Wu.