Te Tari Pāngarau me te Tatauranga
Department of Mathematics & Statistics

Archived seminars in Statistics

Seminars 1 to 50

Next 50 seminars
Making Better Use of Genotyping-by-Sequencing Data

Jie Kang

Mathematics and Statistics, University of Otago

Date: Thursday 12 March 2020

Advances in sequencing technologies enable us to characterise variation in the genome of non-model but agriculturally important species. Approaches such as Genotyping-by-Sequencing (GBS) can produce abundant markers at relatively low cost. This has encouraged implementation of Genomic Selection (GS) to accelerate genetic gains in plant breeding, but has also raised the challenge of how to make better use of the genomic information. Unlike animal breeding, where high-quality reference genomes and well-developed modelling strategies already exist, a versatile analysis pipeline is needed for out-breeding plant species, such as perennial ryegrass (Lolium perenne). In addition, we want approaches that take the highly polymorphic nature of ryegrass into consideration when analysing (low-depth) GBS data. We thus hypothesise that existing GS models can be enhanced by accounting for short haplotypes or 'ShortHaps', that is, multiple variants in small genomic segments such as those captured within a GBS read. In this talk, I will (1) describe the bioinformatics workflows associated with ShortHaps calling, and (2) discuss why ShortHaps should work better than SNPs in terms of breeding value predictions and relatedness estimation.
Count Data Regression Models: Properties, Applications and Extensions

John Hinde

National University of Ireland, Galway

Date: Thursday 10 October 2019

The basis of regression models for count data is the Poisson log-linear model that can be applied to both raw counts and aggregate rates. In practice, many observed counts exhibit overdispersion, where the count variance is greater than the mean, and this can arise in many different ways. One specific source of overdispersion is the occurrence of excess zero counts, a situation often referred to as zero-inflation. Over the last 30 years or so many models for overdispersed and zero-inflated count data have been developed, although, in practice, distinguishing between these two aspects can be difficult. A less common phenomenon is that of underdispersion, where the count variance is less than the mean. Underdispersion has received little attention in the literature, although there are various simple ways in which it can arise. In this talk we will consider some families of count regression models that can incorporate both under and over-dispersion, these include extended Poisson–Tweedie models, the COM-Poisson model, and Gamma and Weibull-count models. I will discuss some of the possible causes of over and under-dispersion, the nature and basis of the various models, their estimation from a likelihood perspective, and software implementations (typically in R). The use of these models will be illustrated with examples from different application areas. I will also discuss the merits of different models and estimation approaches, implications for inference on covariates of interest, and a simple graphical approach for checking model adequacy and comparing competing models.
Whakatipu te Mohiotanga o te Ira: Growing Māori capability and content in genetics-related education

Phillip Wilcox

Mathematics and Statistics, University of Otago

Date: Thursday 3 October 2019

This Seminar will focus on recent efforts at the University of Otago to increase (a) Māori content in statistics, genetics and biochemistry courses, and (b) Māori involvement in genetics-based research and applications.
What on Earth is this? Applying deep learning for species recognition

Varvara Vetrova

University of Canterbury

Date: Thursday 19 September 2019

Can we use a smartphone camera, take a picture of an animal or a plant in the wild and identify its species automatically using convolutional neural networks? What about very similar-looking or rare organisms? How many images are enough? This talk will try to reveal some answers to these questions. This talk is based on the MBIE-funded research project "BioSecure-ID".
Instrumental Limit of Detection, Non-linear Sensors, and Statistics

Peter Dillingham

Mathematics and Statistics, University of Otago

Date: Thursday 12 September 2019

For more than 20 years, the International Union of Pure and Applied Chemistry (IUPAC) has recommended a probabilistic approach to defining the limit of detection (LOD) of analytical instruments. In this talk, I will describe the background to the recommendation, its link to Neyman-Pearson hypothesis testing, and practical implementation. Particularly, the process of estimating LOD and reporting its uncertainty is as necessary as correctly defining it. Finally, calculation, estimation, and scientific importance of LOD for ion-selective electrodes will be described. The detection threshold is determined by the distribution of the blank signal and an acceptable false positive rate (FP, set to 0.05). The limit of detection (LOD) is the smallest non-blank with power equal to 1 minus the false negative (FN) rate.
Single-Fit Bootstrapping

David Fletcher

Mathematics and Statistics, University of Otago

Date: Thursday 5 September 2019

The bootstrap is a useful tool for assessing the uncertainty associated with a frequentist parameter estimate. There are many variations of the basic idea, which involves simulation of new data, fitting the model to these data, and obtaining a new estimate. In this talk I will consider two settings in which we can avoid refitting the model. Use of such a "single-fit" bootstrap clearly has advantages when fitting the model is time-consuming.
An overview of respondent driven sampling

Lisa Avery

University of Otago Mathematics and Statistics Department

Date: Thursday 22 August 2019

Respondent driven sampling is essentially glorified snowball sampling. However, it is arguably the best method we have of measuring the health of difficult to reach populations, and in particular disease prevalence (HIV is commonly studied using RDS). I will review the sampling method, some of the most popular prevalence estimators and highlight some of the difficulties in drawing inferences from these samples. In particular, I’ll talk about difficulties we’ve encountered applying regression methods to these samples.
NeighborNet, with plans for a sequel

David Bryant

Mathematics and Statistics, University of Otago

Date: Thursday 15 August 2019

NeighborNet is an unsupervised clustering algorithm developed mainly for evolutionary genetics. It is hierarchical, however unlike standard classification algorithms it infers clusters which can overlap. The method has proved quite popular across a range of disciplines, with several thousand citations, but hasn't yet been picked up much by statisticians. We are currently working on a sequel which will be more computationally efficient and, hopefully, a bit more elegant. I'll spend most of this seminar introducing the method, then go on to talk about the current research directions.
Autoencoders, archetypal analysis and an application

Matthew Parry

Department of Mathematics and Statistics

Date: Thursday 8 August 2019

The idea of an autoencoder comes from machine learning but it is implicit in a number of statistical techniques. I give a brief review of autoencoders, focusing mainly on their use as a dimension reduction technique and how one can construct probabilistic versions of autoencoders. I then pivot to the archetypal analysis of Cutler and Breiman (1994), which is a form of cluster analysis given in terms of extremal points, i.e. archetypes. Following Bauckhage et al. (2015), I show how archetypal analysis can be viewed as an autoencoder. I finish with an application of archetypal analysis to data imputation. This is based on joint work with Pórtya Piscitelli Cavalcanti.
computer vision for culture and heritage

Steven Mills

Department of Computer Science

Date: Tuesday 30 July 2019

In this talk I will present some of our recent and ongoing work, with an emphasis on cultural and heritage applications. These include historic document analysis, 3D modelling for archaeology and recording the built environment, and tracking for augmented spectator experiences. I will also outline some of the outstanding issues we have where collaboration with mathematicians and statisticians might be valuable.
The reliability of latent class models for diagnostic testing with no gold standard

Matthew Schofield

Department of Mathematics and Statistics

Date: Thursday 18 July 2019

Latent class models are commonly used for diagnostic testing in situations where there is no gold standard. Our motivating example is a Leptospirosis study in Tanzania, where four possible testing procedures were considered. A two-state latent class model appears to fit the data well but returns estimates that do not conform to prior expectations. The diagnostic test that was believed to be most reliable was estimated as the worst of the four. In this talk we attempt to understand this problem. We show using simulation that the assumption that the latent class corresponds to disease status can be problematic. This can lead to large bias in the estimated sensitivities while having minimal effect on the fit of the model.
Managing sensitive data within distributed software systems

David Eyers

Department of Computer Science

Date: Thursday 23 May 2019

Cloud computing and the Internet of Things are making distributed software systems increasingly commonplace. Within these systems, an increasing volume of sensitive data is being transferred, such as personally identifiable information. This talk examines some of the mechanisms I have explored with collaborators that aim to assist software developers to build systems that can handle sensitive data in a more secure and accountable manner.
~~David has broad research interests in computer science topics, including distributed systems and information security. One theme of his research has been seeking security techniques that are usable and accessible to end users and software developers.~~
The life of a consulting biometrician

Assoc. Prof. Darryl MacKenzie

Proteus & Department of Mathematics and Statistics

Date: Thursday 16 May 2019

In this Statchat-style presentation, I will talk about being a consulting biometrician/statistician. I shall cover a range of topics including; what lead me to becoming one, the types of projects that I’ve been involved with, statistically-interesting applications, skills I’ve learnt and what skills I’ve found most useful, and the highs and lows of the job. I’ll also talk about practical aspects such as frequency of work, charge-out rates, etc. Students that are considering life after study are encouraged to come along to hear more about a non-academic career option.
Better understanding the effect of using poorly imputed genotypes in genetic evaluations

Michael Lee

University of Otago Statistics

Date: Thursday 9 May 2019

The metric for selection of animals in a breeding program is generally based on breeding values which are random effects predicted via Best linear Unbiased Prediction (BLUP). Increasingly, genomic information from individual animals is also included to better predict breeding values termed genomic breeding values (GBVs) with Single Step Genomic BLUP (ssGBLUP). In the NZ Sheep Industry, in order to make the prediction of GBVs more cost effective imputation is used to allow a lower density of markers to be used. This seminar will describe the process used to predict GBVs and in particular some results associated with the inclusion of imputed genotypes that are imputed inaccurately.
A tale of two paradigms: A case study of frequentist and Bayesian modelling for genetic linkage maps

Timothy Bilton

Department of Mathematics and Statistics

Date: Thursday 2 May 2019

A genetic linkage map shows the relative position of and genetic distance between genetic markers, positions of the genome which exhibit variation, and underpins the study of species' genomes in a number of scientific applications. Genetic maps are constructed by tracking the transmission of genetic information from individuals to their offspring, which is frequently modelled using a hidden Markov model (HMM) since only the expression and not the transmission of genetic information is observed. Typically, HMMs for genetic maps are fitted using maximum likelihood. However, the uncertainty associated with genetic map estimates are rarely presented, and construction of confidence intervals using traditional frequentist methods are difficult, as many of the parameter estimates lie on the boundary of the parameter space. We investigate Bayesian approaches for fitting HMMs of genetic maps to facilitate characterizing uncertainty, and consider including a hierarchical component to improve estimation. Focus is given to constructing genetic maps using high-throughput sequencing data. Using simulated and real data, we compare the frequentist and Bayesian approaches and examine some of their properties. Lastly, the advantages/disadvantages of the two procedures and some issues encountered are discussed.
Biostatistics in nutrition-related research

Dr Jill Haszard

Division of Sciences Biostatistician

Date: Thursday 18 April 2019

Working as a biostatistician in the Department of Human Nutrition has exposed me to a wide variety of study designs and data. In particular, I handle a large amount of dietary data and am familiar with many of the statistical methods that are used to overcome the difficulties inherent when investigating dietary intake and nutritional status. As well as nutrition studies, I am also involved in studies exploring the influence of physical activity, sedentary behaviour, and sleep – all of which co-exist in a constrained space (the 24-hour day). This type of data requires compositional data analysis. However, using compositional data analysis needs careful interpretation of the statistical output. This is also an issue when analysing studies that assess associations with the gut microbiota.
Time-inhomogeneous hidden Markov models for incompletely observed point processes

Amina Shahzadi

Department of Mathematics and Statistics

Date: Thursday 11 April 2019

Natural phenomena such as earthquakes and volcanic eruptions can be modelled using point processes with the primary aim of predicting future hazard based on past data. However, this is complicated and potentially biased by the problem of missing data in the records. The degree of completeness of the records varies dramatically over time. Often the older the record is, the more incomplete it is. We developed different types of time-inhomogeneous hidden Markov models (HMMs) to tackle the problem of time-varying missing data in volcanic eruption records. In these models, the hidden process has states of completeness and incompleteness. The state of completeness represents no missing events between each pair of consecutively observed events. The states of incompleteness represent different mean numbers of missing events between each pair of consecutively observed events. We apply the proposed models to a global volcanic eruption record to analyze the time-dependent incompleteness and demonstrate how we estimate the completeness of the record and the future hazard rate.
What everyone who use and teaches confidence intervals should know

Richard Barker

PVC, Division of Sciences

Date: Thursday 4 April 2019

The meaning of a confidence interval is one of those things that everyone thinks they know until they are asked to explain what it is. Confidence intervals have some surprising properties that call into question their value as an inferential tool. Using a couple of simple examples I discuss these and related foundational issues.
Non-linear economic value functions in breeding objectives

Cheryl Quinton

AbacusBio Limited, Dunedin

Date: Thursday 28 March 2019

Genetic improvement programs typically include a breeding objective that describes the traits of interest in the program and their importance. A breeding objective function is built that calculates the overall genetic value for each individual in a population based on the aggregate of trait predictions of genetic merit and trait economic values (EV). Most breeding objective functions are built as linear functions. Linear EVs have several advantages for ease of implementation and common quantitative genetics calculations, but they may be over-simplifications for diverse populations that span a wide range of economic and biological conditions. We have been helping an increasing number of breeding programs by applying non-linear EV functions. In these cases, more complex functions such as quadratic, exponential, and combinations are built to calculate the contribution of an individuals’ trait value to the overall aggregate merit combining multiple traits. Although breeding objectives with non-linear EV functions are more complex to implement, they can provide more specific and more robust valuation of traits and therefore of each individuals' overall genetic value. In this presentation, we describe some non-linear EV functions for prolificacy, wool quality, dystocia, and maternal ability in sheep and cattle breeding objectives.
CEBRA: mathematical and statistical solutions to biosecurity risk challenges

Andrew Robinson

University of Melbourne

Date: Thursday 21 March 2019

CEBRA is the Centre of Excellence for Biosecurity Risk Analysis, jointly funded by the Australian and New Zealand governments. Our problem-based research focuses on developing and implementing quantitative tools to assist in the management of biosecurity risk at national and international levels. I will describe a few showcase mathematical and statistical projects, underline some of our soaring successes, underplay our dismal failures, and underscore the lessons that we've learned.
Bayesian inference and model selection for stochastic epidemics

Simon Spencer

University of Warwick

Date: Thursday 14 March 2019

Model fitting for epidemics is challenging because not all of the information needed to write down the likelihood function is observable, for example the times of infection and recovery are not usually observed. Furthermore, the data that are available from diagnostic tests may not be perfectly accurate. These considerations are typically overcome by applying computationally intensive data augmentation techniques such as Markov chain Monte Carlo. To make things even more difficult, most of the interesting epidemiological questions are best expressed as model selection problems and so fitting just one model is not sufficient to answer them. Instead we must fit a range of different models, each representing an important epidemiological hypothesis, and then make meaningful comparisons between them. I will describe how to overcome (most of) these difficulties to learn about the epidemiology of Escherichia coli O157:H7 in cattle. This is joint work with Panayiota Touloupou, Bärbel Finkenstädt Rand, Pete Neal and TJ McKinley.
Fast evaluation of study designs for spatially explicit capture-recapture

Murray Efford

Department of Mathematics and Statistics

Date: Thursday 7 March 2019

he density of some animal populations is routinely estimated by the method of spatially explicit capture–recapture using data from automatic cameras, traps or DNA hair snags. However, data collection is expensive and most studies do not meet minimum standards for precision. Improved study design is the key to improved power. Simulation is often recommended for evaluating study designs, but it can be painfully slow. Another approach for evaluating novel designs is to compute intermediate variables such as the expected number of detected individuals E(n) and the expected number of recapture events E(r). Computation of E(n) and E(r) is deterministic and much faster than simulation. Intriguingly, the relative standard error of estimated density is closely approximated by the reciprocal of the square root of whichever is smaller, and for maximum precision E(n) is approximately equal to E(r). I show how these findings can be applied in interactive software for designing ecological studies.
Accounting for read depth in analyses of genotypes from low-depth sequencing

Ken Dodds


Date: Thursday 28 February 2019

Sequencing technology provides information on genomes including an individual’s genotype at a site with variation, such as single nucleotide polymorphisms (SNPs). To reduce costs, the sequencing protocol may be designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing result delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. Methods for accounting for this issue will be described for several different analyses including the estimation of relatedness, parentage assignment, testing for Hardy Weinberg equilibrium and the description of the genetic diversity of a population. These methods have applications in plant and animal breeding, ecology and conservation.
Modeling high-dimensional intermittent hypoxia

Abdis Sattar

Department of Population and Quantitative Health Sciences Case Western Research University USA

Date: Tuesday 19 February 2019

Many remarkable advances have been made in the nonparametri and semiparametric methods for high-dimensional longitudinal data. However, there is a lack of a method for addressing missing data in these important methods. Motivated by oxygenation of retinopathy of prematurity (ROP) study, we developed a penalized spline mixed effects model for a highdimensional nonlinear longitudinal continuous response variable using the Bayesian approach. The ROP study is complicated by the fact that there are non-ignorable missing response values. To address the non-ignorable missing data in the Bayesian penalized spline model, we applied a selection model. Properties of the estimators are studied using Markov Chain Monte Carlo (MCMC) simulation. In the simulation study, data were generated with three different percentages of non-ignorable missing values, and three different sample sizes. Parameters were estimated under various scenarios. The proposed new approach did better compare to the semiparametric mixed effects model with nonignorable missing values under missing at random (MAR) assumption in terms of bias and percent bias in all scenarios of non-ignorable missing longitudinal data. We performed sensitivity analysis for the hyper-prior distribution choices for the variance parameters of spline coefficients on the proposed joint model. The results indicated that half-t distribution with three different degrees of freedom did not influence to the posterior distribution. However, inverse-gamma distribution as a hyperprior density influenced to the posterior distribution. We applied our novel method to the sample entropy data in ROP study for handling nonlinearity and the non-ignorable missing response variable. We also analyzed the sample entropy data under missing at random.
Using functional data analysis to exploit high-resolution “Omics” data

Marzia Cremona

Penn State University

Date: Wednesday 30 January 2019

Recent progress in sequencing technology has revolutionized the study of genomic and epigenomic processes, by allowing fast, accurate and cheap whole-genome DNA sequencing, as well as other high-throughput measurements. Functional data analysis (FDA) can be broadly and effectively employed to exploit the massive, high-dimensional and complex “Omics” data generated by these technologies. This approach involves considering “Omics” data at high resolution, representing them as “curves” of measurements over the DNA sequence. I will demonstrate the effectiveness of FDA in this setting with two applications. In the first one, I will present a novel method, called probabilistic K-mean with local alignment, to locally cluster misaligned curves and to address the problem of discovering functional motifs, i.e. typical “shapes” that may recur several times along and across a set of curves, capturing important local characteristics of these curves. I will demonstrate the performance of the method on simulated data, and I will apply it to discover functional motifs in “Omics” signals related to mutagenesis and genome dynamics. In the second one, I will show how a recently developed functional hypothesis test, IWTomics, and multiple functional logistic regression can be employed to characterize the genomic landscape surrounding transposable elements, and to detect local changes in the speed of DNA polymerization due to the presence of non-canonical 3D structures.
Bayesian Latent Class Analysis for Diagnostic Test Evaluation

Geoff Jones

Massey University

Date: Thursday 25 October 2018

Evaluating the performance of diagnostic tests for infection or disease is of crucial importance, both for the treatment of individuals and for the monitoring of populations. In many situations there is no “gold standard” test that can be relied upon to give 100% accuracy, and the use of a particular test will typically lead to false positives or false negative outcomes. The performance characteristics of an imperfect test are summarized by its sensitivity, i.e. the probability of correct diagnosis for a diseased individual, and its specificity i.e. the probability of a correct diagnosis when disease-free. When these parameters are known, valid statistical inference can be made for the disease status of tested individuals and the prevalence of disease in a monitored population. In the absence of a “gold standard”, true disease status is unobservable so the sensitivity and specificity cannot be reliably determined in the absence of additional information. In some circumstances, information from a number of imperfect tests allows estimation of the prevalence, sensitivities and specificities even in the absence of gold standard data. Latent class analysis in a Bayesian framework gives a flexible and comprehensive way of doing this which has become common in the epidemiology literature. This talk will give an introduction and review of the basic ideas, and highlight some of the current research in this area.
Applied Data Science: A Small Business Perspective

Benoit Auvray

Iris Data Science & Department of Mathematics and Statistics

Date: Thursday 27 September 2018

Iris Data Science is a small Dunedin company established in 2013 providing data science solutions (predictive analytics) for clients in a range of areas, particularly in the agricultural and health sectors. In this talk, we will briefly describe deep learning, a machine learning tool we use extensively at Iris Data Science, and give a few examples of our work for some of our clients. We will also discuss the term “data scientist” and share our experiences running a small business using data science, statistics and machine learning as part of our core service offering. Finally, we will outline some of the practical aspects of developing a predictive tool for commercial use, from data collection and storage to timely and convenient delivery of the predictive model outputs to a client.
Bayesian Hierarchical Modelling

Matt Schofield

Department of Mathematics and Statistics

Date: Thursday 20 September 2018

Bayesian hierarchical modelling is an increasingly popular approach for data analysis. This talk is intended to introduce Bayesian hierarchical models with the aid of examples from genetics, anatomy and ecology. We will discuss various advantages to using such models, including improved estimation and a better description of the underlying scientific process. If time permits, we will also consider situations where hierarchical models may lead to misleading conclusions and a healthy dose of skepticism is required.
A missing value approach for breeding value estimation

Alastair Lamont

Department of Mathematics and Statistics

Date: Thursday 13 September 2018

For a particular trait, an individual’s breeding value is the genetic value it has for its progeny. Accurate breeding value estimation is a critical component of selective breeding, necessary to identify which animals will have the best offspring. As technology has improved, genetic data is often available, and can be utilised for improved breeding value estimation. While it is cost efficient to genotype some animals, it is unfeasible to genotype every individual in most populations of interest, due to either cost or logistical issues. This missing data creates challenges in the estimation of breeding values. Most modern approaches tend to impute or average over the missing data in some fashion, rather than fully incorporating it into the model. I will discuss how statistical models that account for inheritance can be specified and fitted, in work done jointly with Matthew Schofield and Richard Barker. Including inheritance allows missing genotype data to be natively included within the model, while directly working with scientific theory.
A 2D hidden Markov model with extra zeros for spatiotemporal recurrence patterns of tremors

Ting Wang

Department of Mathematics and Statistics

Date: Thursday 6 September 2018

Tectonic tremor activity was observed to accompany slow slip events in some regions. Slow slip events share a similar occurrence style to that of megathrust earthquakes and have been reported to have occurred within the source region of some large megathrust earthquakes. Finding the relationship among the three types of seismic activities may therefore aid forecasts of large destructive earthquakes. Before examining their relationship, it is essential to understand quantitatively the spatiotemporal migration patterns of tremors.

We developed a 2D hidden Markov model to automatically analyse and forecast the spatiotemporal behaviour of tremor activity in the regions Kii and Shikoku, southwest Japan. This new automated procedure classifies the tremor source regions into distinct segments in 2D space and infers a clear hierarchical structure of tremor activity, where each region consists of several subsystems and each subsystem contains several segments. The segments can be quantitatively categorized into three different types according to their occurrence patterns: episodic, weak concentration, and background. Moreover, a significant increase in the proportion of tremor occurrence was detected in a segment in southwest Shikoku before the 2003 and 2010 long-term slow slip events in the Bungo channel. This highlights the possible correlation between tectonic tremor and slow slip events.
Developing forage cultivars for the grazing industries of New Zealand

Zulfi Jahufer

AgResearch and Massey University

Date: Thursday 23 August 2018

Grass and legume based swards play a key role in forage dry matter production for the grazing industries of New Zealand. The genetic merit of this feed base is a primary driver in the profitability, production and environmental footprint of our pastoral systems. A significant challenge to sustainability of this dynamic ecosystem will be climate change. Elevation of ambient temperature and increases in the occurrence of moisture stress events will be a major constraint to forage plant vegetative persistence and seasonal dry matter production. Successful animal breeding has resulted in developing breeds that have higher feed requirements, resulting in increased grazing pressure on swards. The forage science group at AgResearch is actively focused on developing high merit forage grass, legume and herb cultivars. The aim is to optimise plant breeding systems and maximise rates of genetic gain applying conventional plant breeding methods, high throughput phenotyping and new molecular research tools.

~~Dr Zulfi Jahufer is a senior research scientist in quantitative genetics and forage plant breeding. He also conducts the Massey University course in plant breeding. His seminar will focus on the development of novel forage grass and legume cultivars; he will also introduce the new plant breeding software tool DeltaGen.~~
Lattice polytope samplers for statistical inverse problems

Martin Hazelton

Massey University

Date: Thursday 16 August 2018

Statistical inverse problems occur when we wish to learn about some random process that is observed only indirectly. Inference in such situations typically involves sampling possible values for the latent variables of interest conditional on the indirect observations. This talk is concerned with linear inverse problems for count data, for which the latent variables are constrained to lie on the integer lattice within a convex polytope (a bounded multidimensional polyhedron). An illustrative example arises in transport engineering where we observe vehicle counts entering or leaving each zone of the network, then want to sample possible inter-zonal patterns of traffic flow consistent with those entry/exit counts. Other areas of application include inference for contingency tables, and capture-recapture modelling in ecology.

In principle such sampling can be conducted using Markov chain Monte Carlo methods, through a random walk on the lattice polytope. However, it is challenging to design algorithms for doing so that are both computationally efficient and have guaranteed theoretical properties. In this talk I will describe some current work that seeks to combine methods from algebraic statistics with geometric insights in order to develop and study new polytope samplers that address these issues.
A faster algorithm for updating the likelihood of a phylogeny

David Bryant

Department of Mathematics and Statistics

Date: Thursday 9 August 2018

##Note day and time. A joint Mathematics and Statistics seminar taking place in the usual slot for Statistics seminars## Both Bayesian and Maximum Likelihood approaches to phylogenetic inference depend critically on a dynamic programming algorithm developed by Joe Felsenstein over 35 years ago. The algorithm computes the probability of sequence data conditional on a given tree. It is executed for every site, every set of parameters, every tree, and is the bottleneck of phylogenetic inference. This computation comes at a cost: Herve Philippe estimated that his research-associated computing (most of which would have been running Felsenstein's algorithm) resulted in an emission of over 29 tons of $CO_2$ in just one year. In the talk I will introduce the problem and describe an updating algorithm for likelihood calculation which runs in worst case O(log ~~n~~) time instead of O(~~n~~) time, where ~~n~~ is the number of leaves/species. This is joint work with Celine Scornavacca.
Sequential Inference with the Finite Volume Method

Richard Norton

Department of Mathematics and Statistics

Date: Thursday 2 August 2018

Filtering or sequential inference aims to determine the time-dependent probability distribution of the state of a dynamical system from noisy measurements at discrete times. At measurement times the distribution is updated via Bayes' rule, and between measurements the distribution evolves according to the dynamical system. The operator that maps the density function forward in time between measurements is called the Frobenius-Perron operator. I will show how to compute the action of the Frobenius-Perron operator with the finite volume method, a method more commonly used in fluid dynamics to solve PDEs.
Adaptive sequential MCMC for combined state and earameter Estimation

Zhanglong Cao

Mathematics and Statistics Department University of Otago

Date: Thursday 19 July 2018

Most algorithms for combined state and parameter estimation in state space models either estimate the states and parameters by incorporating the parameters in the state space, or marginalize out the parameters through sufficient statistics. In the case of a linear state space model and starting with a joint distribution over states, observations and parameters, we implement an MCMC sampler with two phases. In the learning phase, a self-tuning sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposal mechanism and is also used in a delayed-acceptance algorithm, which greatly improves sampling efficiency. Information on the resulting state of the system is given by a Gaussian mixture. In on-line mode, the algorithm is adaptive and uses a sliding window approach by cutting off historical data to accelerate sampling speed and to maintain appropriate acceptance rates. We apply the algorithm to joint state and parameter estimation in the case of irregularly sampled GPS time series data.
Modelling multilevel spatial behviour in binary-mark muscle fibre configurations

Tilman Davies

Mathematics and Statistics Department University of Otago

Date: Thursday 12 July 2018

The functional properties of skeletal muscles depend on the spatial arrangements of fast and slow muscle fibre types. Qualitative assessment of muscle configurations suggest that muscle disease and normal ageing are associated with visible changes in the spatial pattern, though a lack of statistical modelling hinders our ability to formally assess such trends. We design a nested Gaussian CAR model to quantify spatial features of dichotomously-marked muscle fibre networks, and implement it within a Bayesian framework. Our model is applied to data from a human skeletal muscle, and results reveal spatial variation at multiple levels across the muscle. The model provides the foundation for future research in describing the extent of change to normal muscle fibre type parameters under experimental or pathological conditions. Joint work with Matt Schofield (Maths & Stats); Jon Cornwall (School of Medicine); and Phil Sheard (Physiology).
Where does your food really come from?

Georgia Anderson


Date: Thursday 31 May 2018

Oritain is a scientific traceability company that verifies the origin of food, fibre and pharmaceutical products by analysing the presence of trace elements and isotopes in the product. Born in the research labs at the Chemistry Department in the University of Otago, Oritain has grown to become a multinational company with offices in Dunedin, London, and Sydney, and with clients from around the globe.

Oritain measures a product's origin using 'chemical fingerprints' derived from the compositions of plants and animals. These compounds vary naturally throughout the environment. Multivariate statistical methods such as principal component analysis and linear discriminant analysis are used to extract information and determine this fingerprint from the trace element and isotopic data.

This talk will present the science used at Oritain and explore how statistics is used in a commercial environment.
Project presentations

Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 25 May 2018

Qing Ruan : ~~Bootstrap selection in kernel density estimation with edge correction~~
Willie Huang : ~~Autoregressive hidden Markov model - an application to tremor data~~

Tom Blennerhassett : ~~Modelling groundwater flow using Finite Elements in FEniCS~~
Peixiong Kang : ~~Numerical solution of the geodesic equation in cosmological spacetimes with acausal regions~~
Lydia Turley : ~~Modelling character evolution using the Ornstein Uhlenbeck process~~
Ben Wilks : ~~Analytic continuation of the scattering function in water waves~~
Shonaugh Wright : ~~Hilbert spaces and orthogonality~~
Jay Bhana : ~~Visualising black holes using MATLAB~~
Modelling the evolution of sex-specific dominance in response to sexually antagonistic selection

Hamish Spencer

Department of Zoology

Date: Thursday 24 May 2018

Arguments about the evolutionary modification of genetic dominance have a long history in genetics, dating back over 100 years. Mathematical investigations have shown that modifiers of the level of dominance at the locus of interest can only spread at a reasonable rate if heterozygotes at that locus are common. One hitherto neglected scenario is that of sexually antagonistic selection, which is ubiquitous in sexual species and can also generate the stable high frequencies of heterozygotes that would be expected to facilitate the spread of such modifiers. I will present a recursion-equation model that shows that sexually specific dominance modification is a potential outcome of sexually antagonistic selection.

The model predicts that loci with higher levels of sexual conflict should exhibit greater differentiation between males and females in levels of dominance and that the strength of antagonistic selection experienced by one sex should be proportional to the level of dominance modification. These predictions match the recent discovery of a gene in Atlantic salmon, in which sex-dependent dominance leads to earlier maturation of males than females, a difference that is strongly favoured by selection. Finally, I suggest that empiricists should be alert to the possibility of there being numerous cases of sex-specific dominance.
Designing randomised trials to estimate the benefits and harms of patient choice of treatment

Robin Turner

Biostatistics Unit, Dunedin School of Medicine

Date: Thursday 17 May 2018

With the increased use of shared decision making, it is increasingly important to provide evidence on the impact that patient treatment choice has on outcomes. Two-stage randomised trials, incorporating participant choice, offer the opportunity to determine the effects of choice, which is not estimable in standard trials. In order to answer important questions about the effect of choice, these trials need to be adequately powered. This talk will cover the design issues for this type of trial and the situations in which this design may be most beneficial.
Using administrative data to improve public health: examples of research using the Integrated Data Infrastructure (IDI) and other data sources

Gabrielle Davie and Rebecca Lilley

Department of Preventive and Social Medicine

Date: Thursday 10 May 2018

Electronically available administrative data are increasingly being used by researchers. Distinct routinely collected administrative datasets are often combined using linkage techniques to enhance the utility of separate data sources for research purposes. Recently in New Zealand, administrative data from a range of government agencies, Statistics NZ surveys, and non-government organisations have been linked at the person-level generating the Integrated Data Infrastructure (IDI). Statistics NZ manage the IDI making it available for ‘approved research projects that are in the public interest’. This presentation will describe our recent experiences with using the IDI for public health research and discuss some learnings applicable to researchers considering using the IDI. We will also present research of ours utilising novel applications of other administrative data (non-IDI) to inform public health.
Spatially explicit capture-recapture for open populations

Murray Efford

Department of Mathematics and Statistics

Date: Thursday 3 May 2018

In this century, capture–recapture methods for animal populations have developed on two tracks. Estimation of abundance has focussed on robust spatially explicit models for data from closed populations, where turnover during sampling may be ignored. Estimation of turnover (survival, recruitment and population trend) has relied on non-spatial models for data from open populations, where mortality etc. may occur between samples. Multiple benefits flow from combining the two approaches, but this has so far been attempted only in one-off applications using Bayesian models, which are slow to fit. I outline a maximum likelihood approach that combines the strengths of Schwarz and Arnason (1996 ~~Biometrics~~ 52:860) and Borchers and Efford (2008 ~~Biometrics~~ 64:377). The methods are now available in the R package ~~openCR~~ that will be demonstrated with data on Louisiana black bears identified from DNA collected at hair snags. Naive spatial implementations of non-spatial methods can perform poorly, but in simulations the present methods appear robust.
Modelling spatial-temporal processes with applications to hydrology and wildfires

Valerie Isham, NZMS 2018 Forder Lecturer

University College London

Date: Tuesday 24 April 2018

Mechanistic stochastic models aim to represent an underlying physical process (albeit in highly idealised form, and using stochastic components to reflect uncertainty) via analytically tractable models, in which interpretable parameters relate directly to physical phenomena. Such models can be used to gain understanding of the process dynamics and thereby to develop control strategies.

In this talk, I will review some stochastic point process-based models constructed in continuous time and continuous space using spatial-temporal examples from hydrology such as rainfall (where flood control is a particular application) and soil moisture. By working with continuous spaces, consistent properties can be obtained analytically at any spatial and temporal resolutions, as required for fitting and applications. I will start by covering basic model components and properties, and then go on to discuss model construction, fitting and validation, including ways to incorporate nonstationarity and climate change scenarios. I will also describe some thoughts about using similar models for wildfires.
Epidemic modelling: successes and challenges

Valerie Isham, NZMS 2018 Forder Lecturer

University College London

Date: Monday 23 April 2018

##Note time and venue of this public lecture##
Epidemic models are developed as a means of gaining understanding about the dynamics of the spread of infection (human and animal pathogens, computer viruses etc.) and of rumours and other information. This understanding can then inform control measures to limit spread, or in some cases enhance it (e.g., viral marketing). In this talk, I will give an introduction to simple generic epidemic models and their properties, the role of stochasticity and the effects of population structure (metapopulations and networks) on transmission dynamics, illustrating some past successes and outlining some future challenges.
Estimating dated phylogenetic trees with applications in epidemiology, immunology, and macroevolution

Alexandra Gavryushkina

Department of Biochemistry

Date: Monday 23 April 2018

##Note day, time and venue for this seminar##
Newly available data require developing new approaches to reconstructing dated phylogenetic trees. In this talk, I will present new methods that employ birth-death-sampling models to reconstruct dated phylogenetic trees in a Bayesian framework. These methods have been successfully applied in epidemiology and macroevolution. Dated phylogenetic histories can be informative about the past events, for example, we can learn from a reconstructed transmission tree which individuals were likely to infect other individuals. By reconstructing dated phylogenetic trees, we can also learn about the tree generating process parameters. For example, we can estimate and predict how fast epidemics spread or how fast new species arise or go extinct. In immunology, dating HIV antibody lineages can be important for vaccine design.
Confidence distributions

David Fletcher

Department of Mathematics and Statistics

Date: Thursday 19 April 2018

In frequentist statistics, it is common to summarise inference about a parameter using a point estimate and confidence interval. A useful alternative is a confidence distribution, first suggested by David Cox sixty years ago. This provides a visual summary of the set of confidence intervals obtained when we allow the confidence level to vary, and can be thought of as the frequentist analogue of a Bayesian posterior distribution. I will discuss the potential benefits of using confidence distributions and their link with Fisher's controversial concept of a fiducial distribution. I will also outline current work with Peter Dillingham and Jimmy Zeng on the calculation of a model-averaged confidence distribution.
A statistics-related seminar in Preventive and Social Medicine: Meta-analysis and its implications for public health policy decisions

Andrew Anglemyer

Naval Postgraduate School, California

Date: Wednesday 4 April 2018

When recommending policies, clinical guidelines, and treatment decisions, policy makers and practitioners alike can benefit greatly from clear evidence obtained from available empirical data. Methods for synthesizing these data that have been developed for use in clinical environments may prove to be a powerful tool in evidence-based decision making in other fields, as well. In this discussion, I will overview examples of how meta-analysis techniques have provided guidance in public health policy decisions (e.g., HIV treatment guidelines), methods for synthesizing data, and possible limitations of these approaches. Additionally, I will apply meta-analysis techniques to a uniquely Kiwi question to illustrate possible ways to provide guidance in health decisions.

~~Dr. Andrew Anglemyer is an epidemiologist who specializes in infectious diseases and study design methodology at Naval Postgraduate School (and previously at University of California, San Francisco). Since 2009 he has been a member of the World Health Organization’s HIV Treatment Guidelines development committee and was the statistics and methods editor for the HIV/AIDS Cochrane Review Group at UC San Francisco until 2014. Dr. Anglemyer has co-authored dozens of public health and clinical peer-reviewed papers with a wide range of topics including HIV prevention and treatment in high-risk populations, firearms-related injury, paediatric encephalitis and hyponatremia. He received an MPH in Epidemiology/Biostatistics and a PhD in Epidemiology from University of California, Berkeley.~~
A statistics-related seminar in Public Health - Mapping for public health: Effective use of spatial analysis to communicate epidemiological information

Jason Gilliland

Western University, Canada

Date: Thursday 29 March 2018

In this seminar I will present some background and lessons on the use of mapping and spatial analytical methods for public health. With practical examples from my own research, I will cover some important considerations for public health researchers wanting to bring GIS-based analyses into their own projects. The presentation will focus on key methodological issues related to using spatial data which are often overlooked by epidemiologists and other health researchers. Discussion will revolve around opportunities for using qualitative data in Health GIS projects and some other future directions and challenges.
~~Professor Jason Gilliland is Director of the Urban Development Program and Professor in the Dept of Geography, Dept of Paediatrics, School of Health Studies and Dept of Epidemiology & Biostatistics at Western University in Canada. He is also a Scientist with the Children's Health Research Institute and Lawson Health Research Institute, two of Canada's leading hospital-based research institutes. His research is primarily focused on identifying environmental influences on children’s health issues such as poor nutrition, physical inactivity, obesity, and injury. He is also Director of the Human Environments Analysis Lab (, an innovative research and training environment which specializes in community-based research and identifying interventions to inform public policy and neighbourhood design to promote the health and quality of life of children and youth.~~
Genetic linkage map construction in the next generation sequencing era: do old frameworks work with new challenges?

Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 29 March 2018

The low cost and high throughput of new DNA sequencing technologies have led to a data ‘revolution’ in genomics: two-to-three orders of magnitude more data can be generated for the same cost compared to previous technologies. This has facilitated genome-wide investigations in non-model species at scales not previously possible. However, these new technologies also present new challenges, particularly with genetic linkage mapping, where error due to sequencing and heterozygote undercalling upwardly bias estimates of linkage map lengths, and creates difficulties in reliably ordering clustered loci. In this talk I will describe the application of an exome capture based genotyping panel to genetic linkage map construction in ~~Pinus radiata D.Don~~. I will show that previously applied approaches first proposed in the mid-1990s still provide a suitable analytical framework for constructing robust linkage maps even in this modern data rich era.
Case-control logistic regression is more complicated than you think

Thomas Lumley

University of Auckland

Date: Thursday 22 March 2018

It is a truth universally acknowledged that logistic regression gives consistent and fully efficient estimates of the regression parameter under case-control sampling, so we can often ignore the distinction between retrospective and prospective sampling. I will talk about two issues that are more complicated than this. First, the behaviour of pseudo-$r^2$ statistics under case-control sampling: most of these are not consistently estimated. Second, the question of when and why unweighted logistic regression is much more efficient than survey-weighted logistic regression: the traditional answers of 'always' and 'because of variation in weights' are wrong.