My research interests encompass two main directions: 1) Bayesian model development for next- generation sequencing (NGS) data, and 2) Bayesian adaptive designs, subgroup analysis, and causal inference for clinical trials. Below I summarize related current and future research projects.

MAD Bayes for Tumor Heterogeneity - Feature Allocation with Non-Normal Sampling

Tumor cell populations are composed of different subclones of cells, each of which is defined by a unique genome. The phenomenon of having heterogeneous subclones within or across tumors is called tumor heterogeneity (TH). Understanding TH provides opportunities in the development of effective and precise treatment strategies for cancer patients, and challenges the traditional one- size-fits-all cancer care. While innovative models for TH based on NGS data have been rapidly introduced in the literature, most existing methods can not scale up to accommodate the large size of NGS data. To fill this gap, we propose small-variance asymptotic approximations for inference on tumor heterogeneity (TH) using NGS data. Specifically, we develop a hierarchical model with an exponential family likelihood and a feature allocation prior using Indian buffet process (IBP). The IBP defines a prior probability model for a binary matrix whose columns define the (latent) subclones by a subset of mutations. Our results show that the proposed algorithm can successfully recover latent structures of different haplotypes and subclones. More importantly, compared with available Markov chain Monte Carlo samplers that cannot scale for NGS data, the proposed algorithm is magnitudes faster and scalable. Due to the computational advantage, we made a useful tool so that scientists can apply it for real-life sequencing data.

Complex Network Analysis: Integration of Multiple Data Sources

The Cancer Genome Atlas (TCGA) project has generated abundant high-throughput molecular profiling data from a large number of patient samples across multiple cancer types, including calls of somatic mutations, measurements of DNA copy number variations, methylation and expression quantifications of mRNAs, microRNAs, and proteins. Taking advantage of the availability of mul- tiple modalities in the TCGA data, we perform a cross-platform integration of genomic features using a Bayesian graphical model and assemble a large-scale database and information system, called Zodiac. Zodiac reports computational results on the ge- netic interaction between features of 19,304 genes and 186,312,556 gene pairs, from a genome-wide large-scale data analysis based on the proposed Bayesian graphical models. As a web resource, Zodiac provides a user-friendly interface allowing for visualization of genetic interactions as graphs. Due to the fully probabilistic inference under the Bayesian model, false discovery rates of reported graphs can be controlled using posterior probabilities and thus ensuring the quality of reported interactions. As a unique resource of analytic inferences based on TCGA data, Zodiac will be useful to a variety of cancer researchers.

Nonparametric Bayesian Bi-Clustering for ChIP-Seq data

Histone modifications (HMs) play important roles in transcription through post-translational mod- ifications. Combinations of HMs, known as chromatin signatures, encode specific messages for gene regulation. Inference on possible clustering of HMs and an annotation of genomic locations on the basis of such clustering can provide new insights about the functions of regulatory elements and their relationships to HM clusters. To this end, we propose a nonparametric Bayesian local clustering Poisson model (NoB-LCP) to facilitate posterior inference on two-dimensional clustering of HMs and genomic locations. A zero-enriched Polya urn prior is used to model random partitions of HMs and genomic locations. The NoB-LCP clusters HMs into HM sets and allows each HM set define its own clustering of genomic locations. Furthermore, it probabilistically excludes HMs and genomic locations that are irrelevant to clustering. By doing so, the proposed model effectively identifies important sets of HMs and groups regulatory elements with similar functionality based on HM patterns. The paper is published in Bayesian Analysis. In the current model, partitions do not allow overlap between the partitioning subsets. One possible extension is to introduce overlapping clusters into the probability models, reflecting that one HM can feature in multiple biologic processes.

A Bayesian Nonparametric Model to Evaluate Dynamic Treatment Regimes for Acute Leukemia

Dynamic treatment regimes in oncology and other disease areas are often characterized by an alternating sequence of treatments or other actions and transition times between disease states. The sequence of transition states may vary substantially from patient to patient, depending on how the regime plays out, and in practice there often are many possible counterfactual outcome sequences. For evaluating the regimes, the mean final overall time may be expressed as a weighted average of the means of all possible sums of successive transitions times. A common example arises in cancer therapies where the transition times between various sequences of treatments, disease remission, disease progression, and death characterize overall survival time. For the general setting, I propose estimating mean overall outcome time by assuming a nonparametric Bayesian survival regression for the transition times. I construct a dependent Dirichlet process prior with Gaussian process base measure (DDP-GP). I summarize the joint posterior distribution by Markov chain Monte Carlo (MCMC) posterior simulation. Then I use likelihood-based G-estimation under the DDP-GP model to estimate causal inference by accounting for all possible outcome paths, the transition times between successive states, and effects of covariates and previous outcomes, on each transition time. The Bayesian paradigm works very well, and the simulation studies suggest that our DDP-GP method yields more reliable estimates than inverse probability of treatment weighted (IPTW) method.

Subgroup-Based Adaptive (SUBA) Designs for Multi-Arm Biomarker Trials

Targeted therapies based on biomarker pro- filing are becoming a mainstream direction of cancer research and treatment. Depending on the expression of specific prognostic biomarkers, targeted therapies assign different cancer drugs to subgroups of patients even if they are diagnosed with the same type of cancer by traditional means, such as tumor location. For example, Herceptin is only indicated for the subgroup of patients with HER2+ breast cancer, but not other types of breast cancer. How- ever, subgroups like HER2+ breast cancer with effective targeted therapies are rare and most cancer drugs are still being applied to large patient populations that include many patients who might not respond or benefit. To address these issues, we propose the SUBA design to simultaneously search for prognostic subgroups and allocate patients adaptively to the best subgroup-specific treatments throughout the course of the trial. SUBA uses a tree type ran- dom partition on the biomarker space to define biomarker subgroups, which allows a flexible and simple mechanism to realize subgroup exploration as posterior inference on the partition. The main features of SUBA include the continuous reclassification of patient subgroups based on a random partition model and the adaptive allocation of patients to the best treatment arm based on posterior predictive probabilities. It can be applied outside a trial setting and formalized as a true individualized treatment strategy for patient care.

A Latent Gaussian Process Model with Application to Clinical Trials

In many clinical trials treatments need to be repeatedly applied as diseases relapse frequently after remission over a long period of time. Most research in statistics focuses on the overall trial design, such as sample size and power calculation, or on the data analysis after trials are completed. Little is done to improve the efficiency of trial monitoring, such as early termination of trials due to futility. The challenge faced in such trial monitoring is mostly caused by the need to properly model repeated outcomes from patients. We propose a Bayesian trial monitoring scheme for clinical trials with repeated and potentially cyclic binary outcomes. Statistical inference is based on a latent Gaussian process (LGP) model. LGP describes the underlying latent process that gives rise to the observed longitudinal binary outcomes. Using LGP we propose efficient and important steps for trial monitoring, which include 1) forecasting future outcomes for each patient and 2) comparing overall patterns between different conditions (treatment v.s. control). The posterior consistency property of the proposed model is studied.