Invited sessions

CNC23 hosts 3 invited sessions: Challenges in genomic prediction, Advances in (network) meta-analysis, and Bayesian machine learning & optimal design.

INVITED SESSION I: Challenges in Genomic Prediction

On the challenges of multi-omics data and method comparisons: Insights from a systematic benchmark study on survival prediction

Moritz Herrmann (Institute of Statistics, Ludwig-Maximilians-Universität München)

Predicting disease outcomes, such as patients survival times, using genomic data poses several challenges. This talk will focus on the challenges arising from the presence of multiple groups of molecular data types, so-called multi-omics data, on the one hand, and the challenges posed by the complexity of benchmark studies, on the other hand. With regard to the first aspect, the presence of multiple feature groups in multi-omics datasets requires effective handling strategies. For example, while high-dimensional sets of molecular features may provide rich information, there is a risk that important but low-dimensional clinical features may be lost in the prediction process. The effective incorporation of different types of high-dimensional molecular features alongside a small number of often highly informative clinical variables is critical to predictive performance. In contrast, the second aspect addresses the challenges of comparing prediction methods and selecting an appropriate method for the problem at hand. Identifying the most appropriate methods for a given dataset and outcome from the plethora of prediction methods available requires careful consideration of their strengths, limitations and ability to effectively use multi-omics data. Neutral benchmark studies, designed to systematically compare and evaluate different prediction methods on a large number of datasets, play a crucial role in this selection process. However, the variety of design and analysis options available can lead to biased interpretations and overly optimistic conclusions in favor of a particular method. Addressing these issues involves navigating the complexities of benchmark study design, performance measures, handling missing values in benchmarking results and aggregating results across datasets. By recognizing and addressing these challenges, we can improve the predictive power, harness multi-omics data while maintaining clinical relevance, and ensure rigorous and reliable evaluation and comparison of prediction methods. Ultimately, these efforts will help advance personalized medicine and improve patient
outcomes.

Leveraging ancestral recombination graphs to store genome-wide data and enable origin-aware genomic predictions

Gregor Gorjanc (The Roslin Institute and Royal (Dick) School of Veterinary Medicine, The University of Edinburgh)

We now have abundant genome-wide data that fuels quantitative genetic applications and biological discovery. The amount of this data is growing rapidly, challenging data storage and computation. However, the established statistical models don't seem to be able to leverage this data growth fully to increase the accuracy of genomic predictions. In this talk I will describe our work on leveraging ancestral recombination graphs, encoded with the tree sequence data format. We are working with ancestral recombination graphs for two reasons. First, to store mega-scale whole-genome sequence data. I will present two ongoing projects in our lab - encoding data from the globally diverse set of 1000 bull genomes and from the more narrowly diverse set 1 million pig genomes. Second, to build a richer statistical model that can leverage the full description of recombinations and mutations to enable origin-aware genomic predictions.

New strategies for old problems: should we integrate Deep Learning in plant breeding programs?

Laura Zingaretti (BASF)

The primary goal of any breeding program is to improve complex trait in a cost-effective way. Since Lush times, breeders have strongly relied on the breeding equation (BE) to guide plant breeding decisions. Basically, BE expresses genetic gain as a function of the selection intensity, the genetic variation, the selection accuracy, and the generation interval.
Genomic Selection (GS) consist of using a dense set of markers to select superior genotypes without identifying the QTL individually. GS adds value by increasing the accuracy of breeding value predictions in comparison with pedigree-based models, helps to increase the intensity of selection and reduces the generation interval. Despite it is a well-established technology, its application in many plant breeding programs is still immature. GS is typically implemented by linear mixed models or Bayesian equivalent approaches. Although powerful, there are impractical to model non-additive genetic effects (such as dominance or epistasis) and assume a homogeneous variable input.
In recent years, the use of Deep Learning (DL) in GS has become a popular alternative. Although promising, the application of DL in GS in both plant and animals has not given clear signs of outperforming the standard methods. One of the reasons that have been argued is that the variable to be predicted (the genetic value) is not actually observed. Other reasons may be that input variables (i.e., SNPs) have a highly leptokurtic distribution. Some scenarios, though, may be more beneficial: in plants, the use of DL has been shown to be powerful in capturing complex (non-additive) patterns.
Importantly, applications of DL in breeding can span a broader spectrum than just GS: generative neural networks (GANs) or variational autoencoders hold promise for conducting more realistic (data-driven) simulations, as they can learn complex distributions and can be used to produce synthetic DNA or images. DL are also powerful techniques to combine different data sources, e.g., Long- Short Term Memory (LSTM) can be used to model a temporal dimension with minimal preprocessing, allowing climate and environmental information to be easily added into GS or phenotypic models. Many of these applications are still in their infancy, so we expect rapid development of these technologies and breeders should take advantage of it.

INVITED SESSION II: Advances in (Network) Meta Analysis

Generalisability in surrogate endpoint evaluation and borrowing information across diverse sources when evaluating novel health technologies: meta-analytic approaches

Sylwia Bujkiewicz (Biostatistics Research Group, Department of Population Health Sciences, University of Leicester)

Precision medicine research identifies subgroups of population, for example defined by genetic biomarkers in oncology, to which targeted therapies can be delivered successfully. As target populations become small, such novel therapies are increasingly evaluated in trials with short-term surrogate endpoints (such as progression free survival as a surrogate for overall survival) or single arm studies. As data on the final clinical outcome are either limited or not available at early stages of drug development, regulatory agencies grant accelerated licensing approval conditional on a surrogate marker. It is important that a surrogate endpoint is a reliable predictor of clinical benefit not only to ensure robust licensing approvals, but also to allow HTA agencies (such as Belgian KCE, Dutch ZIN, French HAS or NICE in the UK) to draw inferences from such limited evidence for reimbursement decisions - whether the new treatment is a good value for money to be recommended for use by health services. With small patient populations, synthesis of diverse sources of data and generalisability of surrogate relationships across treatments or even indications (disease areas) will play an increasingly important role in such decisions. We will discuss meta-analytic methods for borrowing of information from indirect sources of data; for example, across treatment classes or indications when evaluating surrogate endpoints and novel therapies.

The assessment of replicability with an application to meta-analysis

Leonhard Held (Epidemiology, Biostatistics and Prevention Institute, University of Zurich)

Replicability of experiments is a cornerstone of the scientific method. A standard way to provide evidence of replicability is the two-trials rule by the US FDA which requests "at least two adequate and well controlled studies, each convincing on its own, to establish effectiveness" for drug approval. I will describe some alternatives based on p-value combination methods and compare the different methods with respect to relevant operating characteristics (Held, 2023). The corresponding p-value functions (Infanger and Schmidt-Trucksäss, 2019) can then be used to provide novel meta-analytic confidence intervals for the combined treatment effect with some interesting properties. A comparison with more standard approaches will be provided based on results from an extensive simulation study.

INVITED SESSION III: Bayesian Machine Learning & Optimal Design

Extensions to probabilistic tree-based machine learning algorithms

Estevao Prado (Department of Mathematics & Statistics, Lancaster University)

Bayesian additive regression trees (BART) is a tree-based machine learning method that hasbeen successfully applied to regression and classification problems. BART assumesregularisation priors on a set of trees that work as weak learners and is very flexible forpredicting in the presence of non-linearities and low-order interactions. In this talk, we presenttwo extensions to semi-parametric models based on BART. First, we propose a new class ofmodels for the estimation of genotype-by-environment interactions in plant-based genetics.Our approach uses semi-parametric BART to accurately estimate marginal genotypic andenvironment effects along with their interaction in a cut Bayesian framework. We demonstratethat our approach is competitive or superior to similar models widely used in the literature viaboth simulation and a real-world dataset. Second, we extend semi-parametric BART modelswith a view to analysing data from an international education assessment, where certainpredictors of students' achievements in mathematics are of particular interpretational interest.Through additional simulation studies and another application to a well-known benchmarkdataset, we also show competitive performance when compared to regression models,alternative formulations of semi-parametric BART, and other tree-based methods.

Optimal design of experiments using amortized Bayesian inference and conditional normalizing flows

Matthias Bruckner (Janssen Pharmaceutica)

Bayesian optimal design of experiments involves choosing a design that maximizes the expected utility, where the utility function depends on the posterior distribution of the model parameters. Analytically calculating the expected utility is infeasible in most cases. Estimates of the expected utility can be obtained using Monte-Carlo methods by averaging over a large number of datasets sampled from the prior predictive distribution. Evaluating the utility function in turn requires Monte-Carlo Markov Chain methods to obtain posterior samples except in simple models with conjugate priors. Repeating this for thousands of simulated datasets for every design option can be computationally prohibitive. Recent advances in amortized Bayesian inference offer a potential solution to this problem. Here the Bayesian inference process is divided into a costly training phase and a much cheaper inference phase. During the training phase a neural network learns an estimator for the probabilistic mapping from data to underlying model parameters. The trained neural network can then without additional training or optimization efficiently sample from the posterior distributions for arbitrary many datasets involving the same model family. We explore the computational advantages and possible applications in the design of non-clinical experiments.

Bayesian Optimization approaches for optimal dose combination identification in early phase dose finding trials

James Willard (School of Population and Global Health, McGill University)

Identification of optimal dose combinations in early phase dose-finding trials is challenging due to the trade-off between precisely estimating a large number of parameters required to flexibly model the dose-response surfaces, and the small sample sizes in early phase trials. Existing methods often restrict the search to pre-defined dose combinations, which may fail to identify regions of optimality in the dose combination space. These difficulties are even more pertinent in the context of personalized dose-finding, where patient characteristics are used to identify tailored optimal dose combinations. To overcome these challenges, we propose the use of Bayesian optimization for finding optimal dose combinations in standard ("one size fits all") and personalized multi-agent dose-finding trials. Bayesian optimization is a method for estimating the global optima of expensive-to-evaluate objective functions. The objective function is approximated by a surrogate model, commonly a Gaussian process, paired with a sequential design strategy to select the next point via an acquisition function. This work is motivated by an industry-sponsored problem, where focus is on optimizing a dual-agent therapy in a setting featuring minimal toxicity. To illustrate and compare the performance of the standard and personalized methods under this setting, simulation studies are performed under a variety of scenarios.

Privacy | Accessibility