Annie Qu

Department of Statistics, UC Irvine

Title: Representation Retrieval Learning for Heterogeneous Data Integration
Date: Friday, March 21st, 2025
Time: 1:30PM (PDT)
Location: ASB 10900

Abstract:

In this presentation, I will showcase advanced statistical machine learning techniques and tools designed for the seamless integration of information from multi-source datasets. These datasets may originate from various sources, encompass distinct studies with different variables, and exhibit unique dependent structures. One of the greatest challenges in investigating research findings is the systematic heterogeneity across individuals, which could significantly undermine the power of existing machine learning methods to identify the underlying true signals. This talk will investigate the advantages and drawbacks of current data integration methods such as multi-task learning, optimal transport, missing data imputations, matrix completions and transfer learning. Additionally, we will introduce a new representation retriever learning aimed at mapping heterogeneous observed data to a latent space, facilitating the extraction of shared information and knowledge, and disentanglement of source-specific information and knowledge. The key idea is to project heterogeneous raw observations to representation retriever library, and the novelty of our method is that we can retrieve partial representations from the library for a target study. The main advantages of the proposed method are that it can increase statistical power through borrowing partially shared representation retrievers from multiple sources of data. This approach ultimately allows one to extract information from heterogeneous data sources and transfer generalizable knowledge beyond observed data and enhance the accuracy of prediction and statistical inference.