MENU

NIRODHA EPASINGHEGE DONA

Title: Big Data Applications in Genetics and Sports
Date:
Monday, October 23rd, 2023
Time: 10:00AM
Location: Hybrid, over zoom and in LIB 2020
Supervised by: Dr. Tim Swartz

Abstract: This thesis consists of five distinct chapters; each considers various applications within the domain of big data. In the opening chapter, an application of genetics is presented. It discusses how to simulate exome-sequencing data for 150 families from a North American admixed population, containing at least four members affected with lymphoid cancer. These data encompass details regarding the ascertained families, along with information about single-nucleotide variants found in the exome of the affected family members. The subsequent chapters focus on sports analytics through the lens of big data applications. In the second chapter, the expected goals concept is extended to limited overs cricket where ideas are illustrated using the economy rate statistic. The approach is based on the estimation of batting outcome probabilities given detailed data on each ball that is bowled in a match. Through the utilization of machine learning techniques, estimation of batting outcomes is carried out. From the analysis, distinctions between men’s and women’s T20 cricket are observed. One such finding is that there is a higher frequency of sixes occurring in the men’s game than in the women’s game. In the third chapter, the focus shifts to examining the issue of pace of play in soccer. In this study, the key question revolves around whether employing a fast-paced playing style offers an advantageous strategy in the game. This is a question that remains insufficiently addressed in both soccer and hockey. The investigation is enabled through the utilization of tracking data which provides the locations of players measured at frequent intervals (i.e. 10 times per second). The chapter begins by formulating a definition of pace. In this study, we use methods of causal inference to investigate the relationship between pace in soccer and shots. The analysis reveals that maintaining a higher pace than the opponent throughout a match results in an advantage of approximately two additional shots per game. The fourth chapter entails an assessment of the optimal locations for throw-ins in soccer. The investigation is also enabled through the utilization of tracking data which provides the locations of players measured at frequent intervals (i.e. 10 times per second). The methods for the investigation are necessarily causal since there are confounding variables that impact both the throw- in location and the result of the throw-in. A simple causal analysis indicates that on average, backwards throw-ins are beneficial and lead to an extra 2.5 shots per 100 throw-ins. We also observe that there is a benefit to long throw-ins where on average, they result in roughly 4.0 more shots per 100 throw-ins. These results are confirmed by a more complex causal analysis that relies on the spatial structure of throw-ins. The last chapter proposes increasingly complex models based on publicly available data involving rally length in tennis. The models provide insights regarding player characteristics involving the ability to extend rallies and relates these characteristics to performance measures. The analysis highlights some important features that make a difference between winning and losing, and therefore provides feedback on how players may improve. Bayesian models are introduced where posterior estimation is carried out using Markov chain Monte Carlo methods.

Keywords: Family Studies; Exome Sequencing; Lymphoid Cancer; Ascertained Pedigrees; Sports Analytics; Player Tracking Data; Causal Inference; Machine Learning.