This marks the 17th year of the SFU/UBC Joint Statistics Seminar. The goal of the seminar is to have graduate students from SFU and UBC socialize and present their new research. The event is concluded with a talk by a faculty member from one of the universities.
Things will be a little bit different this year, as the seminar will be held online. We still hope that this seminar will give you a great opportunity to chat with friends from both universities and to learn more about the exciting research that is being done!
Asymptomatic and paucisymptomatic presentations of COVID-19 along with restrictive testing protocols result in undetected COVID-19 cases. Estimating undetected cases is crucial to understand the true severity of the outbreak. We introduce a new hierarchical disease dynamics model based on the N-mixtures hidden population framework. The new models make use of three sets of disease count data per region: reported cases, recoveries, and deaths. These models are applied to estimate the level of under-reporting of COVID-19 in the Northern Health Authority region of British Columbia, Canada during thirty weeks of the provincial recovery plan. Parameter covariates are used to improve model estimates. When accounting for changes in weekly testing volumes, we found under-reporting rates varying from 60.2\% to 84.2\%.
A novel framework for statistical learning is introduced which combines ideas from regularization and ensembling. This framework is applied to learn an ensemble of logistic regression models for high-dimensional binary classification. In the new framework the models in the ensemble are learned simultaneously by optimizing a multi-convex objective function. To enforce diversity between the models the objective function penalizes overlap between the models in the ensemble. Measures of diversity in classifier ensembles are used to show how our method learns the ensemble by exploiting the accuracy-diversity trade-off for ensemble models. In contrast to other ensembling approaches, the resulting ensemble model is fully interpretable as a logistic regression model, asymptotically consistent, and at the same time yields excellent prediction accuracy as demonstrated in an extensive simulation study and gene expression data applications. The models found by the proposed ensemble methodology can also reveal alternative mechanisms that can explain the relationship between the predictors and the response variable.
The scale of genome-wide association studies (GWAS) is approaching population level, with population genetics resources like the UK Biobank being used in the research efforts for many unsolved diseases. As the size of these databases grow, the computational time and space required for GWAS grows with it, necessitating new and more efficient methods for analysis. This work improves the efficiency of population genetic file formats and GWAS computation by leveraging properties of the distribution of samples in population-level genetic data which are exploited by a family of compression methods known as finite state entropy algorithms. This results in efficiency gains over existing dictionary compression methods that are often used for population genetic data such as Zstd and Zlib. We provide open-source prototype software for multi-phenotype GWAS which implements finite state entropy compression.
Variational inference is a popular alternative to Markov chain Monte Carlo methods that constructs a Bayesian posterior approximation by minimizing a discrepancy to the true posterior within a pre-specified family. This converts Bayesian inference into an optimization problem, enabling the use of simple and scalable stochastic optimization algorithms. However, a key limitation of variational inference is that the optimal approximation is typically not tractable to compute; even in simple settings the problem is nonconvex. Thus, recently developed statistical guarantees -- which all involve the (data) asymptotic properties of the optimal variational distribution -- are not reliably obtained in practice. In this work, we provide two major contributions: a theoretical analysis of the asymptotic convexity properties of variational inference in the popular setting with a Gaussian family; and consistent stochastic variational inference (CSVI), an algorithm that exploits these properties to find the optimal approximation in the asymptotic regime. CSVI consists of a tractable initialization procedure that finds the local basin of the optimal solution, and a scaled gradient descent algorithm that stays locally confined to that basin. Experiments on nonconvex synthetic and real-data examples show that compared with standard stochastic gradient descent, CSVI improves the likelihood of obtaining the globally optimal posterior approximation.
State-space models or hidden Markov models (HMM) are used in many contexts such as object tracking, finance, and electrical engineering. These models include latent variables that impact the observable of interest. Moreover, nonlinearity and non-Gaussian models are often necessary to accurately capture the changes in the latent variables as well as their relation to the observable. We provide a methodology for the Bayesian estimation of general state-space models that combines the discrete nonlinear filter and Markov chain Monte Carlo (MCMC). We demonstrate with an example from jump-diffusion models in finance that the methodology is particularly effective when there are a low number of latent factors or when certain observables do not depend on all hidden variables (e.g., in joint estimation) when compared with existing approaches.
Edge-exchangeable probabilistic network models generate edges as an i.i.d. sequence from a discrete measure, providing a simple means for statistical inference of latent network properties. The measure is often constructed using the self-product of a realization from a Bayesian nonparametric (BNP) discrete prior; but unlike in standard BNP models, the self-product measure prior is not conjugate the likelihood, hindering the development of exact simulation and inference algorithms. Approximation via finite truncation of the discrete measure is a straightforward alternative, but incurs an unknown approximation error. In this paper, we develop methods for forward simulation and posterior inference in random self-product-measure models based on truncation, and provide theoretical guarantees on the quality of the results as a function of the truncation level. The techniques we present are general and extend to the broader class of discrete Bayesian nonparametric models.