MU-RES Research Archives

Body

Explore the innovative research led by our students over the past five years, showcasing the breadth and depth of inquiry across statistics, data science, and interdisciplinary applications. Projects range from Bayesian modeling of psychometric data, deep learning for stock price prediction, and environmental modeling of climate change impacts, to sports analytics, public health studies, and recommendation systems for student engagement.

Each project represents a collaboration between students and faculty mentors, blending statistical theory, computational methods, and real-world data to address complex problems across disciplines. Browse the archive below to discover the creativity, technical rigor, and curiosity that define our students’ research — and get inspired for your own future project.

Selections of past projects

Spring 2025 Projects

Zheer Wang, Mohit Singh, Idrees Kudaimi - Era-adjusted baseball statistics: website, software, tech report, and interesting findings 

Mentor: Daniel Eck 

Siddhant Gupta - Dimensionality Reduction in Neural Activity Simulations: A Computational Approach Using HNN-Core and Synaptic Weight Modulation 

Mentor: Matthew Singh 

Sanjana Addanki - How Course Modality and COVID-19 Impacted GPA Trends in LAS STEM and Humanities at UIUC 

Mentor: Christopher Kinson 

This study explores how class format (online versus in-person) correlates with grade outcomes in the College of Liberal Arts & Sciences at UIUC. Using over 50,000 course records from 2016 to 2024, we investigate GPA patterns across STEM and Humanities disciplines before and after the COVID-19 pandemic. By transforming letter-grade distributions into average GPA scores and conducting t-tests, we find statistically significant increases in GPAs post-COVID for both STEM and Humanities. Interestingly, online courses in STEM showed a notably higher average GPA than their in-person counterparts post-COVID. Box plots further highlight how certain departments experienced sharper GPA increases than others. In particular, I would like to highlight these differences in disciplines like Chemistry, Italian, and Spanish. These findings suggest that course modality and pandemic-era changes in instruction may have long-lasting effects on academic performance. This research contributes to understanding how instructional shifts can differentially affect academic outcomes across fields. 

Rong Xie, Kejun Sun, Yutao Rao - Bitcoin price prediction using various learning methods 

Mentor: Hyoeun Lee 

This project evaluates the predictability of Bitcoin's returns using daily and weekly frequency observations through the application of traditional methods of forecasting and advanced machine learning techniques. Specifically, the traditional method utilizes the ARMA(2,1)-GARCH (1,1) model under the Student's t distribution to examine the dynamics of daily and weekly logarithmic returns through mean reversion and fat-tail volatility clustering. Although the accuracy of numerical predictions is limited, converting the one-step-ahead conditional mean into a directional signal may significantly improve predictive ability. In addition, more advanced machine learning methods such as random forest, XGBoost and LSTM models also confirm that while precise numerical predictions achieve limited success, forecasting market direction significantly improves when using lower-frequency aggregation. These results suggest the existence of considerable noise in the returns of Bitcoin that makes precise numerical forecasts problematic, but highlight the utility of trend classification models, especially at a weekly frequency, for tactical purposes of trading and risk management. 

Leah Decatus-Haddad - Police Budgets and Crime Rates in Urbana 

Mentor: Christopher Kinson   

Rachel Zhou, Arseniy Titov- Simulation Toolbox for Epidemic Control Models with Mean Field Games  

Mentor: Gökçe Dayanıklı 

Jeffrey Huang, Yunxi Zeng - Matching Methods for Observational Factorial Studies 

 Mentor: Ruoqi Yu  

Carrie Song- Understanding the influence of geographical and environmental factors on respiratory disease infections 

Mentor: Pamela Martinez  

Alyssa Anastasi - Analyzing health seeking behavior in response to respiratory diseases in Illinois 

Mentor: Pamela Martinez

Luke Thorell, Elen Huang - Measuring Student Comprehension of the Simulation Process and its Product: An Analysis Contrasting Ownership of Modeling Simulations Versus Using Pre-Built Applets 

Mentor: Kit Clement 

Many studies have advocated for using simulation-based inference (SBI) in introductory statistics courses due to its power in helping students understand statistical inference at a deeper conceptual level. However, little research has investigated the various implementations of SBI in terms of the curricular design and the simulation technology that students use. Applet-based simulations are primarily focused on presenting the product of simulation to students, but their “black box” nature may obscure the process of simulation, thereby limiting students’ understanding. This study aims to compare the effectiveness of two different curricula: one using pre-built applet simulations and the other engaging students in modeling simulations from the ground-up. Students from both curricula responded to an open-ended survey at the end of the semester as part of the Simulation Understanding in Statistical Inference and Estimation (SUSIE) instrument. Qualitative analysis was conducted on survey responses, which were double-coded according to this instrument. Results revealed similarities in the understanding of the products of the simulation; however, students who engaged in building simulations were more likely to have a stronger grasp of the simulation process. These results suggest that building and interacting with the simulation procedure may aid in developing students’ statistical reasoning. 

Madeline Hunt, Samin Hemani - Is Tolerance of Uncertainty Related to Students' Understanding of Statistics? 

Mentor: V.N. Vimal Rao 

Xiaoshan Huang - Bayesian Modeling of Rusty Blackbird Winter Counts in Arkansas 

 Mentor: Weijia Jia       

We analyzed long-term (1965–2020) Christmas Bird Count (CBC) data to assess Rusty Blackbird (Euphagus carolinus) population trends in Arkansas. To address variable survey effort, we applied a Bayesian hierarchical model with an effort-adjustment term (exp(B·(effort^p−1)/p) within a Markov Chain Monte Carlo (MCMC) framework. Our analysis employed a zero-inflated negative binomial regression to account for both excess zeros and overdispersion in count data. Spatial effects were modeled using latitude and longitude to assess geographic distribution shifts, while controlling for environmental covariates including temperature and forest cover. This approach provides robust trend estimates of bird count by accounting for major sources of uncertainty in wildlife count data. 

William Wang, Zitao Zhang - Estimating Causal Effects of Cover Crop Implementation on Crop Yield Using an Instrumental Variable 

Mentor: Chan Park 

 Many real-world causal inference questions are answered under the assumption of no unmeasured confounding, meaning that all common causes of both treatment and outcome are accounted for. For example, recent studies have examined the causal effect of cover crop implementation on crop yield in Midwest states using satellite data under this assumption. Their findings suggest that cover crop implementation leads to yield losses. However, because these approaches do not account for potential unmeasured confounders (e.g., farmers' skills or management practices) the conclusions may be biased or even indicate an effect opposite to the true causal effect. To better estimate the causal effect, we employ an instrumental variable (IV) approach, a widely used method in economics, epidemiology, and statistics. In our study, we have hand-collected county-level cover crop incentives as an IV, as we believe it reasonably meets these criteria for IVs. Using this IV, we reanalyzed the satellite data to estimate the average and heterogeneous treatment effects of cover crop implementation on crop yield while accounting for potential unmeasured confounding. Our analysis finds no statistical evidence that over-crop implementation causes yield losses, contrasting with previous studies that relied on the assumption of no unmeasured confounding. Moreover, we find that this effect is strongly related to temperature and solar radiation, indicating substantial regional variation in the effectiveness of cover cropping practices. 

Tyler Hetch, Baoyuan Zhou - Extreme Heat Events impacts on Health: An analysis of Emergency Department Visits across the US during years 2018 -2025 

Mentor: Lelys Bravo de Guenni 

 Extreme heat and cold events are one of the most important causes of climate-related deaths worldwide. According to Chen et al. (2024), five million deaths were attributed to extreme heat and cold globally between 2000-2019. Future climate projections indicate that heat related deaths will increase, and cold related deaths will decrease under warmer climates.  Understanding the impacts of extreme heat on health, and on health services demand would provide a better estimation of future climate-related illness burden. In this project we use data from the Heat and Health tracker from the Center of Disease Control and Prevention (CDC) website and other related data sources to investigate the impact of heat waves and extreme heat events on Emergency Department Visits (EDVs) associated to heat related illnesses after standardizing by population. Data was available at a daily basis at a regional level, and it was spatially aggregated to the 10 Health Department Regions (HDRs) across the US, by using an area weighted average. Maximum daily temperature and daily heat Index, as calculated using the National Weather service approach, were analyzed to understand their seasonal cycle and variability across the HDRs. The potential association between the extreme heat events, as determined by the peak times in maximum temperature and heat index, and the EDV time series were investigated and used in the analysis. Log-linear mixed effect models were fitted to the EDV time series accounting for seasonal effects, the impact of the climate variables and vulnerability factors related to socio-economic conditions. Dependence variability between the response and predictors among regions was accounted for as random effects in the proposed models. Prediction errors and goodness of fit were also assessed to evaluate model performance. 

Jaehoon JungAssessing the Impact of Weather Constraints on Drone Flyability Using Geostatistical Methods  

Mentor: Lelys Bravo de Guenni 

 Aerial drone operations have become increasingly crucial in diverse industries and fields across logistics, agriculture and military, yet there no established system exists of verifying the appropriate weather condition for those operation. In this study, we mainly refer to the Weather Constraints on Global Drone Flyability (Mozhou Gao,2021) and project its findings to a specific local region, North Korea. We passed daily meteorological data from the Korean Meteorology Administration (2015-2024) to a decision tree model that classifies the flyable days when temperature, windspeed and precipitation fall within drone operating condition. We then interpolate the values between 27 observation posts across the region. During the spatial interpolation, or the kriging process, we set the mean trend as a linear combination of the longitude, latitude and the altitude and identify the variance of the differences in the values. The results from this study show how to utilize meteorological data such as temperature and wind speed and transform it into useful information of drone flyability, even in specific areas that do not have nearby weather stations. We expect this research to become a practical index for the decision-making process of planning drone operations in previously unexplored areas. 

Mingqian Wang - Bridging Deep Learning and Symbolic Regression: A Hybrid Approach for Interpretability and Expressiveness 

Mentor: Matthew Singh 

Shreyas Talluri - The creation of individualized brain models to control brain dynamics in response to TMS 

Mentor: Matthew Singh

Jerry Liang - Queue-Based Load Modeling and Detection of DoS Attacks 

Mentor: Georgios Fellouris 

Karena Liang - Time Series Modeling of Malaria Cases: Integrating climate and mosquito data through machine learning 

Mentor: Lelys Bravo de Guenni 

 Entomological surveillance is very important in tropical remote areas where malaria is an endemic disease. However, data collection on mosquito abundance is costly and demanding especially in remote regions, due to logistic considerations and lack of consistency in the collection methods. The availability of long time series on the number of different mosquito species present in a particular location would improve understanding of the seasonal fluctuations and biting behavior of the different mosquito vector species transmitting the malaria parasites. In this work, we use machine learning approaches for missing data imputation in the estimated mean number of mosquitoes during a data collection period during 2010-2016. We compare the  different imputation methods using a leave-one-out cross-validation approach. A correction method based on the significance of seasonal effects on mosquito populations was implemented to maintain balanced data in mosquito population counts between trap and human caught methods. We propose a generalized time series model for predicting the number of malaria cases as a function of climate drivers and mosquito abundance after data imputation. Model predictions aim to provide an analytic tool to estimate the incidence of the disease conditioned on several environmental factors. 

Vinayak Bagdi - Assessing trends in heat waves intensity, frequency, duration, and season length for present and future climate in two locations: Boston metropolitan area (BMA) and Chicago metropolitan area (CMA)  

Mentor: Lelys Bravo de Guenni 

Fall 2024 Projects

Qianhua Zhou, Tailei Liu - A Computationally Efficient Matching Method via Sparse Mixed Integer Programming

Mentor: Rouqi Yu

Kehuan Wang - Probabilistic Assessment of Extreme Rainfall Projections for Champaign (Illinois) under Climate Change using Daily Climate Simulations

Mentor: Lelys Bravo

Chengyun Jiang, Tiancheng Guo - Predicting Drought Conditions in the Champaign area using Machine Learning

Mentor: Lelys Bravo

Ricky Gong, Mengxuan Wei - Forecasting Housing Rental Prices in Champaign

Mentor: Hyoeun Lee

Jerry Liang- Detection of a Fast Decaying Gaussian Signal

Mentor: Georgios Fellouris

Senuvi Jayasinghe, Jessica Gong - DINA Model on Rational Number Knowledge

Mentor: Steven Culpepper

Ziyi Gao- Effects of Policies on Large Societies: Application to Epidemic Control

Mentor: Gokce Dayanikli

Spring 2024 Projects

Jaewon Kim, Ziyi Gao - Decoding Global Financial Crisis: A Statistical Analysis of Market Crash Precursors and Dynamics 

Mentor: Hyoeun Lee 

Huizhu Jia - Repeatability and Reproducibility of Transvaginal Quantitative Ultrasound Measurements of the Cervix for Women at Risk of Preterm Birth 

Mentor: Doug Simpson 

Jiepeng Yuan - The Influence of Missing Data Imputing Procedures on Evaluating Diagnostic Assessments 

Mentor: Susu Zhang 

Bhuvan Kala, Julianna Drew, Sanjana Gongati - Computational Thinking with Data 

Mentor: Vimal Rao 

Rachel Selvaraj - Reconsidering the Role of the Basic Reproductive Number when Evaluating Competition among Different Pathogen Strains  

Mentor: Pamela Martinez  

Chloe Yang - Disentangling the Seroconversion and Seroreversion Rates of Seasonal Coronaviruses using Age-Stratified Seroprevalence Data 

Mentor: Pamela Martinez 

Dhruv Borda - Visualizing Change: Descriptive Analytics of Illinois Climate Variability 

Mentor: Lelys Bravo De Guenni 

Yiqian Zhang, Aoyang Li - Approximate Bayesian Computation for Fitting Preferential Attachment Models 

Mentors: Yuexi Wang, Yuguo Chen 

Yi Yang, Yayan Jiang - Detecting Temperature and Precipitation Trends in the Illinois Climate Network Station 

Mentor: Lelys Bravo De Guenni 

Daniel Hogan - Mind Wandering when Learning Statistics 

Mentor: Vimal Rao 

Barret Li - Factors Contributing to the Risk of Nurse Burnout 

Mentor: Alexandra Chronopoulou 

Linzhe Teng, Shuzhen Zhang - Geomagnetic Storm Real-time Detection of Solar Wind Data via Neural Network CuSum 

Mentor: Georgios Fellouris 

Fall 2023 Projects

Ishaan Bandari, Kehuan Wang, April Wu, Haiyue Zhang, Wendy Zheng - Predicting cardiovascular disease: A comparison of Bayesian and frequentist methods 

Mentor: Steve Culpepper 

Alyssa Anastasi, Sandra Garcia Lopez - Disentangling the impact of environmental drivers on rotavirus transmission in Bangladesh 

Mentor: Pamela Martinez 

Shubh Goyal - Individual differences in human clustering 

 

Mentor: Vimal Rao 

Jiepeng Yuan - Assessment of multiple imputation methods in latent variable modeling 

Mentor: Susu Zhang 

Huizhu Jia, Ziyi Gao - Repeatability and reproducibility of prenatal quantitative ultrasound measurements used to assess pre-term birth risk 

Mentor: Doug Simpson  

Vishnu Sadhu - Determinants of life insurance purchase 

Mentor: Vimal Rao  

Chloe Yang - Why come to class? A qualitative look at students' comparison of video vs. in-person class experiences 

Mentor: Kelly Findley 

Jiyang Xu - Penalized conditional risk regression for ordinal and multinomial response data 

Mentor: Doug Simpson 

Baoyi Chen, Huizhu Jia - Probabilistic assessment of extreme rainfall and air temperature projections under climate change 

Mentor: Lelys Bravo De Guenni 

Zean Li, Katherine Zeng - The illusion of randomness: Evaluating student performance on the disk task 

Mentors: Kelly Findley, Stephen Portnoy 

Haotian Ju, Xiangxuan Yu - A review and strategies of different models on the forecast of daily stock price 

 Mentor: Hyoeun Lee 

Linzhe Teng - Earth geomagnetic storm real-time detection by astronomical solar wind data 

Mentor: Georgios Fellouris 

Ashrith Anumala - Forecasting lightning with a log Gaussian Cox Process 

 Mentor: Daniel Ries 

Rohan Gumaste - Comparing covariance structures of log Gaussian Cox Process applied to lightning strike data 

Mentor: Daniel Ries 

Khatija Syeda - Measuring seasonal spatial dependencies of lightning strikes in upper Midwest 

Mentor: Daniel Ries 

Spring 2023 Projects

Junan Jiang, Alice Cao - Probabilistic changes of extreme rainfall projections under different Climate Change scenarios 

Mentor: Professor Lelys Bravo De Guenni 

 Given the urgency of the effects of climate change, the goal of our project is to analyze the possible impacts of greenhouse gases on extreme rainfall in Urbana-Champaign, Illinois under two different greenhouse gas scenarios as well as historical rainfall conditions for comparison. To assess future precipitation projections, we used data from 33 Global Climate Models (GCMs) for the year 2070 to 2099 under two greenhouse gas scenarios (Representative Concentration Pathways), RCP4.5 and RCP8.5, as well as model data from a historical reference period from 1950 to 2005. The data was analyzed by fitting a Generalized Extreme Value (GEV) distribution to annual monthly maxima, and model parameters were estimated by Maximum Likelihood (ML). The collected estimates were used to assess differences in the probability of extreme rainfall in Urbana Champaign under potential climate change conditions. 

Hye Rim Ahn - Quantifying sources of uncertainty in future precipitation projections under climate change for Champaign County 

Mentor: Professor Lelys Bravo De Guenni 

 The NASA Earth Exchange Downscaled Climate Projections (NEX-DCP30) data is used to evaluate future projections of precipitation for the period 2069-2099 from 31 different General Circulation Models (GCMs), under four different Representative Concentration Pathways (RCPs) depicting seasonal future greenhouse gases concentration trajectories. The objective of the analysis is to quantify the contribution of uncertainty in GCMs model and RCPs into the estimation of the relative change of future precipitation relative to a historical reference period 1950-2005 using linear mixed effect models. Different climate projections produced by the 31 GCMs account for the internal climate variability under the four RCPs scenarios. Further analysis on the variance contribution of the random and fixed effects for the linear mixed effect models was carried and demonstrated through calculation. 

Mingrui Xu - Assessing the impact of environmental factors on malaria infections in a Brazilian location 

Mentor: Professor Lelys Bravo De Guenni 

 The goal of this project is to investigate the relationship between environmental factors and malaria infections, as well as several species of malaria-carrying mosquitoes, in a specific location in Brazil. To achieve this, we calculated cross-correlations between environmental and mosquito variables to measure their association at different time lags. We used generalized linear models (GLM) to predict the number of mosquitoes based on the lagged versions of the environmental variables. We also considered mosquito abundance for different species as a predictor variable in a GLM to predict the number of malaria cases. Quasi Poisson and negative binomial families were used to account for overdispersion, and we analyzed residuals and goodness of fit to evaluate model performance. Our findings suggest that environmental variables (rainfall, river level, and relative humidity) significantly impact abundance of all mosquito types but these factors have a weaker association with the number of malaria cases. 

Zean Li, Wenqi Zeng - Exploring college students' understanding of randomness 

Mentor: Professor Kelly Findley, Professor Stephen Portnoy 

The concept of randomness is ubiquitous in the field of science, especially in statistics. By understanding randomness and its characteristics, students can have a perception of the uncertainty and unpredictability of a specific event. However, misconceptions about randomness are pervasive among students; and we noticed that few experiments on investigating students’ understanding of randomness were conducted on college students. Therefore, we want to design an experiment to see how college students across different majors and school years define and understand randomness. We specifically wish to test their understanding in several areas: 1) their ability to distinguish random events, 2) their ability to distinguish between a random sequence and a non-random sequence, 3) their understanding of the variation of random samples, 4) randomness in uniform and non-uniform distributions. We designed a questionnaire for students to complete, which includes both content questions about randomness, as well as tasks where students are asked to choose points on a line or on a shape as randomly as they can. In our presentation, we will present our findings about how the students performed and whether we found associations between their content understanding and their performance on the clicking tasks.

Heqi Yin - A comparative study of differential abundance tests in microbiome compositional data analysis 

Mentor: Professor Shulei Wang 

 Differential abundance analysis in compositional data is one of the most important tools in microbiome data analysis. However, the presence of compositional constraints and zero counts poses significant challenges to existing methods. We design simulation experiments to compare several popular methods, including Zicoseq and ANCOM-BC. 

Austin Shwatal - Nowcasting lightning strikes through convolutional neural networks 

Mentor: Dr. Daniel Ries, Sandia National Laboratories 

 Lightning is one of the most common weather events in the United States, causing significant damage to life and property every year. The National Ocean and Atmospheric Association (NOAA) has tracked the locations and intensity of lightning, as well as monitored general atmospheric conditions across the United States. Recent work, including by Cintineo et al. (2022) has been devoted to the goal of immediate weather prediction using this real-time data, often referred to as “nowcasting”. The data used for this project includes multi-spectral bands from the GOES-16 satellite’s Advanced Baseline Imager (ABI), and lightning strike information from the National Lightning Detection Network. We focus our analysis on the upper Midwest. We apply U-nets, originally developed for biomedical image segmentation, to predict lightning from radiance data. The U-net is designed to rapidly deconstruct a complex image into identifiable features without the loss of image resolution, a key feature that allows precise geographic prediction accuracy. The U-net takes inputs as multispectral images, and outputs images quantifying the probability of lightning occurring in a particular region. The trained model produces predicted probabilities of lightning strikes, allowing preparation for severe weather events. 

Jack Banks, Michael Escobedo - Simulating the 2023 MLB season 

Mentor: Professor Daniel Eck, Professor David Dalpiaz 

Full season simulators play a crucial role in assisting the day-to-day operations of a Major League Baseball organization. In this project, we used data from the 2015-2022 season to construct a simulator for the Chicago Cubs. Individual player talents are derived from a justified regression model to predict weighted on-base average (wOBA). The baseball statistic wOBA is one of the most accurate measurements of a player’s true value in runs, combining a player’s outcomes with their mean expected run values. These individual player talents are then aggregated based on lineups for a specific matchup, and an elo system is run to simulate the outcome and changes in team performance. An elo system allows us to measure the relative skill of teams over time. As a result, our expected outcome of the 2023 MLB season is based upon a simulation system, dependent upon wOBA, that adjusts the talent of each team over time. This project is formatted into an R package, allowing for smooth transition from our computers to the front office of the Chicago Cubs. 

Mingli Xu, Harsh Patel - Sector-specific stock price forecasting: A comparative analysis of time series models for S&P 500 industries       

Mentor: Professor Hyoeun Lee 

The stock market is a complex system that is constantly changing, and accurately predicting stock prices is a difficult task. In this study, we aim to identify the best time series model for predicting stock prices of different sectors in the S&P 500. We compare traditional models, including ARIMA, ARCH, and GARCH models, to the more recent learning methods such as LSTM, and determine which model performs the best for each sector. Our findings provide insights into effective modeling techniques for predicting stock prices of different business sectors, which can be useful for investors and analysts in making informed investment decisions. 

Nowcasting lightning strikes through generalized linear mixed model and binary hurdle model 

Spencer Bauer 

Mentor: Dr. Daniel Ries, Sandia National Laboratories 

We are using a hurdle model to account for a large number of zeros and overdispersion in lightning counts since the frequency of zeros accounts for 94.4% of the data. The hurdle model consists of two parts: a binary hurdle part that models the probability of lightning events and a truncated count part that models the number of flashes given that a lightning event occurs. The binary hurdle part uses a Bernoulli distribution for the probability of non-zero lightning event occurrence. The truncated count part uses a zero-truncated negative binomial distribution with additive predictors for location (θ>0) and dispersion (μ>0) to model the number of flashes. The modeling and statistical analysis will be done in RStudio (2021 version). The main packages we will use are sp and countreg. We will use the sp package for creating spatial objects, or polygons to be specific, and we will use the countreg package for the zero-truncated negative binomial distribution and plotting. 

Fall 2022 Projects

Katherine Ann Christensen and Emily Page Hasson – Using behavior to predict latent personality states with a Hidden Markov Model

Mentor: Steven Culpepper

The premise of this project is to model human personality using behavioral data. We introduced a hidden Markov model to predict transitions between different latent personality states based on the Objective Personality System framework. We used simulations to illustrate how the model could be applied to real sampled data.

Eliana Chandra and Tom Shin – Host Heterogeneity and SARS‐CoV‐2: From Vaccine Inequity to Variants of Concern

Mentor: Pamela Martinez

Using publicly available data from dozens of countries with thousands of sub‐national locations, we examined temporal trends of vaccination rates across high and low socioeconomic groups at the sub‐national level. Our analyses show two distinct vaccination strategies: one characterized by a rapid initial roll‐out, quickly reaching half of the vaccination potential, then slowing, and a second strategy that is slow to begin but reaches a steady state more rapidly. Informed by these observed patterns, we implemented a model incorporating socioeconomic groups, with heterogeneity in the force of infection and vaccination rates, to track the immune histories of individuals exposed to different variants of concern. This second part of the project aimed to characterize the impact of bivalent boosters in low‐ and middle‐income countries with limited access to the SARS‐CoV‐2 vaccine.

Shangyun Zhangliang – A comparison of statistical procedures for post‐marketing safety surveillance

Mentor: Georgios Fellouris

We consider various statistical signal detection methods for the monitoring of spontaneously reported adverse events of medical products. We generate synthetic data in order to design and compare these methods, and we apply them to data from the FDA Adverse Event Reporting System (FAERS) database.

Jonathan Kang and Zijun Yu – Analysis of Heat Wave Trends in Illinois

Mentor: Lelys Bravo De Guenni

With global warming becoming a crucial issue worldwide, the goal of our project is to analyze the nature and current trends of heat waves in Illinois over a historical period (1989‐2021). To determine the occurrence of a heat wave, we use air temperature and relative humidity data from the Illinois Water and Atmospheric Resources Monitoring (WARM) network to calculate a heat wave index and its excess values over a local threshold for two locations: Champaign and the Chicago area. Next, we describe the frequency, duration, and intensity of heat waves using a continuous Rectangular Pulses Poisson Process model (RPPM). The RPPM is fitted to the historical data by preserving the first and second order moments of the observed daily heat wave index during the summer periods. A Moving Block Bootstrap was implemented to investigate the variability of the RPP process parameters over several decades during the study period.

Yixin Zhou and Elina Mehra – Stock price prediction using various time series models

Mentor: Hyoeun Lee

In this study, we focus on the comparison of time series models for Google stock price prediction. We use various models including linear regression, ARIMA, ARCH, dynamic regression, and Exponential Smoothing (Holt‐Winters). We use train‐test split to evaluate the forecast accuracy.

Heqi Yin and Ziyang Zheng – Differential Abundance Analysis of Microbial Compositional Data in Wine Grapes over Different Regions and Cultivars

Mentor: Shulei Wang

The microbes harbored on wine grape surfaces play an essential role in grape growth and wine quality. However, there remains limited understanding of how grape‐surface microbiota interact with geographical patterns and environmental factors. In this project, we apply several differential abundance analysis methods, such as Zicoseq and ANCOM_BC, to identify microbial species associated with geographical regions and cultivars. These findings help deepen understanding of the grape‐surface microbiome community and may lead to new strategies for improving wine quality.

Joey Shallat – Establishing a Design of Experiments (DoE) Workflow

Mentor: James Balamuta

When Rokwire began in 2019, the set of features available for interaction within the Rokwire framework was limited. After further development, new features were added that augment existing interactions or introduce new ones entirely. Under this project, we sought to establish a way to understand, at a fundamental level, how effective new features are based on engagement. We developed a framework following design of experiments (DoE) principles to evaluate how features are received upon introduction and what ripple effects occur with existing features, allowing for better allocation of development time and resources.

Spring 2022 Projects

Xiaowei Wu and Alex Murvine - Optimal selection of geomorphometric parameters to estimate flash flood peak flows in coastal mountainous watersheds. 
Zewei Deng - Areal to Point Rainfall estimation in Tropical regions 
Molly Hu and Ajay Raygaga - Understanding Global Oceans Health through the Global Health Index 
Greg Gu and Tyler Barton - Recommendation System for Student Events 
Tyler Barton - Organizing your Organization in the Cloud 

Fall 2021 Projects

Daniel Huang, Adrian Pizano, Scott Turro, Hanling ZhangAnalysis of psychometric data using Bayesian inference of latent classes
Mentor: Steve Culpepper

Abstract: We consider latent class models (LCM) for multivariate binary response data, which is common in psychometric research. LCM are useful for identifying groups of individuals who respond in similar ways. We employ Gibbs sampling to estimate the distribution of the parameters, a typical technique for Bayesian algorithms. We provide a python package and R functions implement our algorithm and show the accuracy with a Monte Carlo simulation on test data. We apply our technique to various psychometric datasets and analyze the results.

Sicong HeParallel computing for variance reduction of estimates of expected Darwinian fitness
Mentor: Daniel Eck

Abstract: Precise estimation of expected Darwinian fitness, the expected lifetime number of offspring of organism, is a central component of life history analysis. The aster model serves as a defensible statistical model for distributions of Darwinian fitness. The aster model is equipped to incorporate the major life stages an organism travels through which separately may effect Darwinian fitness. Envelope methodology reduces asymptotic variability by establishing a link between unknown parameters of interest and the asymptotic covariance matrices of their estimators. It is known both theoretically and in applications that incorporation of envelope methodology reduces asymptotic variability. Current software which incorporates envelope methodology into the aster estimation framework is too slow and too complicated for users. We develop new software which addresses these shortcomings.

Abe SunApplication of Bayesian variable selection to college admission data
Mentor: Steve Culpepper

 

Julia WapnerImprovements to SEAM methodology for player matchup evaluations
Mentor: Daniel Eck

Abstract: We further develop the SEAM (synthetic estimated average matchup) method for describing batter versus pitcher matchups in baseball, both numerically and visually. We first estimate the distribution of balls put into play by a batter facing a pitcher, called the spray chart distribution. This distribution is conditional on batter and pitcher characteristics. These characteristics are a better expression of talent than any conventional statistics. Many individual matchups have a sample size that is too small to be reliable. Synthetic versions of the batter and pitcher under consideration are constructed in order to alleviate these concerns. Weights governing how much influence these synthetic players have on the overall spray chart distribution are constructed to minimize expected mean square error. Current implementation of the SEAM method has shortcoming on several fronts. Through testing and evaluation several enhancements and issues were discovered. Improvements were made to validation of results, computational speed, code dependency, and conceptual implementation.

Baoyu LiSales price prediction using deep learning methods
Mentor: Hyoeun Lee

Abstract: In this research, we focus on time series prediction using deep learning methods. First, we predict the sales price using several deep learning models, such as MLP, RNN, LSTM. In addition, we predict the sales price using a hybrid model, which is a combination of a traditional time series model (ARIMA) and a deep learning model. Finally, we compare the single deep learning method
and the hybrid model method, focusing on whether the hybrid model can improve the prediction of the sales price.

Jiaqi Hu, Yanyi LuBayesian analysis of NBA team performance
Mentor: Xinran Li

Abstract: According to the SGMA’s U.S. Trends in Team Sports research, there are over 26 million people in the United States who play basketball. And this brings basketball to one of the most prevailing sports among the United States, even the world. Since the establishment of the National Basketball Association (NBA), the public’s enthusiasm for basketball continues to grow, and the popularity of it remains high all the time. At the same time, the public is becoming more concerned about the performance of their favorite team each season. Such circumstances had brought our interest to study the performance of teams in each game season. Previous studies have unveiled some factors that might affect team performance, such as whether teams have home-court advantages, replacement of coaches, time spent on rest, etc. In our study, we will use a dataset containing the data behind the complete history of the NBA from Kaggle, in order to build a model to analyze the performance of NBA teams using statistical methods such as Bayesian Hierarchical Modeling. Our findings suggest that the major factor that contributes to the significant drop/rise of team performances in each season is the replacement of players, such as whether the player is retired or quit the game because of injuries.

Yijia GaoImpact of COVID-19 on the stock market
Mentor: Hyoeun Lee

Abstract: In this project, we investigate the impact of COVID-19 on technology company stock prices. We present a set of stylized empirical facts using statistical analysis of normalized returns and volatility of normalized returns based on the stock data of 34 US technology companies. We investigate the impact of COVID-19 on the statistical properties of those companies and we are interested in
determining if those stylized facts remain true in light of the events surrounding COVID-19.

Songyuan Wang, Carol SongAnalyze environmental drivers of the biting behavior for several species of malaria parasite mosquitoes in a remote Amerindian region in South America
Mentor: Lelys Bravo de Guenni

Abstract: Our project goal is to analyze the impact of environmental drivers on the biting behavior for several species of malaria parasites mosquitoes in a remote Amerindian region in South America. This project has been developed in different phases. In phase 1 we analyzed the effectiveness of mosquitoes’ traps baited with two chemical attractants (Octenol and Lurex) based on data collected
during year 2015 using a non-parametric ANOVA analysis. In phase 2, we analyzed the impact of different climate predictors on mosquito abundance, including the implementation of different imputation methods for missing data on the number of mosquitoes collected. A next phase of the project attempts to model the impact of climate variables and mosquito abundance on the number of
malaria cases using a generalized time series modeling approach.

Weain YinSpatial modelling of water quality from multiple pollution sources in the BCE Cambodia wetland using a Bayesian approach
Mentor: Lelys Bravo de Guenni

Abstract: This project is a continuation of the project 'Assessing water quality spatial heterogeneity from multiple pollution sources in the BCE Cambodia wetland'. The objective of the project is to characterize the spatial variability of contaminants in the wetland region using Bayesian kriging methods. The water quality data is grouped by rain season and water type (surface water or
ground water). Empirical and theoretical semi-variograms are computed for each contaminant to assess the strength of the spatial autocorrelation. It is found that each contaminant exhibits a differentiated spatial dependence structure. Spatial prediction using Bayesian kriging models is being implemented using the R package 'spBayes'.

Linjia Feng, Tymon DuchnickiQuantifying climate change uncertainty in Champaign County
Mentor: Lelys Bravo de Guenni

Abstract: The NASA Earth Exchange Downscaled Climate Projections (NEX-DCP30) data is used to evaluate future projections of precipitation for the period 2069-2099 from 31 different General Circulation Models (GCMs), under two different Representative Concentration Pathways (RCPs) depicting future greenhouse gases concentration trajectories. The aim of the analysis is to quantify the contribution of GCMs model uncertainty and RCPs into the estimation of the relative change of future precipitation relative to a historical reference period 1950-2005 using linear mixed effect models. Different climate projections produced by the 31 GCMs account for the internal climate variability under the two RCPs scenarios.

Lilac (Yinmeng) LaiAnalysis of Midwest drought conditions vs. El Niño
Mentor: Lelys Bravo de Guenni

Abstract: With farming being a crucial industry in Illinois, understanding drought behavior could help increase foresight in this critical sector of Illinois’ economy. Inspired by previous works done around El Niño and drought episodes, we wanted to analyze the relationship between Urbana’s drought data and El Niño events. Using El Niño indicators Niño 3.4 and Oceanic Niño Index (ONI) data found on noaa.gov, and drought indicators Palmer Drought Severity Index (PDSI) and Self-calibrated Palmer Drought Severity Index (SC-PDSI), we were able to analyze the relationship between drought and El Niño in both, the frequency domain and the time domain. By performing coherence analysis, we were able to find a weak coherence between El Niño data (ONI) and drought data (PDSI). We also produced a one-month ahead prediction model of drought data (PDSI) from El Niño data (ONI) using transfer function models.

Jennings ChengComparison of differential abundance test in compositional data
Mentor: Shulei Wang

Abstract: Compositional data naturally arise in a wide range of biomedical applications such as microbiome study. Differential abundance test is an essential tool in these biomedical applications. In this project, we compare several different popular differential abundance tests on simulated data set. 

Lauren Bome Jang, Elina MehraPower optimization for knockoff filters
Mentor: Jingbo Liu

Abstract: The knockoff filter is a variable selection method developed by Barber and Cand`es 2016 with the purpose of controlling the false discovery rate at an exact threshold. It has since been the basis to several other knockoff methods. We specifically looked at effective signal deficiency (ESD) introduced by Liu and Rigollet 2019. ESD shows that the conditional independence structure of the predictors plays an important role in consistency and hence predicts the consistency of the different variable selection methods. We attempted to optimize the function in the correlation matrix of the predictive variables to control the power of the method.

Spring 2021 Projects

Chen Song and  Songyuan Wang - Entomological surveillance for the control of malaria in remote Amerindian village

Mentor: Lelys Bravo

Lilac (Yinmeng) Lai - Analysis of Midwest Drought Conditions vs El Niño

Mentor: Lelys Bravo

Rebecca Chen - Exploration of the Latent Structure of Personnel Selection Assessment Response Data

Mentor: Susu Zhang