Statistics
Department of Mathematics & Statistics
Te Tari Pāngarau me te Tatauranga

## Archived seminars in Statistics

Seminars 1 to 50

Next 50 seminars
Project presentations

### Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 2 June 2017

STATISTICS
Jodie Buckby : Model checking for hidden Markov models
Jie Kang : Model averaging for renewal process
Yu Yang : Robustness of temperature reconstruction for the past 500 years

MATHEMATICS
Fergus O'Leary : Stochastic spatial population models
Rachael Young : SIR epidemic models on networks
Sam Bremer : An effective model for particle distribution in groundwaters
Joshua Mills : Hyperbolic equations and finite difference schemes
Yiwen Qi : The heat equation and Brownian motion
Vijay Surada : Modelling the self-thinning rule
161007091338
Twists and trends in exercise science

### Jim Cotter

School of Physical Education, Sport and Exercise Sciences

Date: Thursday 1 June 2017

From my perspective, exercise science is entering an age of enlightenment, but misuse of statistics remains a serious limitation to its contributions and progress for human health, performance, and basic knowledge. This seminar will summarise our recent and current work in hydration, heat stress and patterns of dosing/prescribing exercise, and the implications for human health and performance. These contexts will be used to discuss methodological issues including of research design, analysis and interpretation.
170202102235
Hidden Markov models for incompletely observed point processes

Department of Mathematics and Statistics

Date: Thursday 25 May 2017

Natural phenomena such as earthquakes and volcanic eruptions can cause catastrophic damage. Such phenomena can be modelled using point processes. However, this is complicated and potentially biased by the problem of missing data in the records. The degree of completeness of volcanic records varies dramatically over time. Often the older the record is, the more incomplete it is. One way to handle such records with missing data is to use hidden Markov models (HMMs). An HMM is a two-layered process based on an observed process and an unobserved first-order stationary Markov chain with the state duration geometrically distributed. This limits the application of HMMs in the field of volcanology, where the processes leading to missed observations do not necessarily behave in a memoryless and time-independent manner. We propose Inhomogeneous hidden semi-Markov models (IHSMMs) to investigate the time-inhomogeneity of the completeness of volcanic eruption catalogues to obtain the reliable hazard estimate.
170111085822
The CBOE SKEW

### Jin Zhang

Department of Accountancy and Finance

Date: Thursday 18 May 2017

The CBOE SKEW is an index launched by the Chicago Board Options Exchange (CBOE) in February 2011. Its term structure tracks the risk-neutral skewness of the S&P 500 (SPX) index for different maturities. In this paper, we develop a theory for the CBOE SKEW by modelling SPX using a jump-diffusion process with stochastic volatility and stochastic jump intensity. With the term structure data of VIX and SKEW, we estimate model parameters and obtain the four processes of variance, jump intensity and their long-term mean levels. Our results can be used to describe SPX risk-neutral distribution and to price SPX options.
170110164745
Finding true identities in a sample using MCMC methods

### Paula Bran

Department of Mathematics and Statistics

Date: Thursday 11 May 2017

Uncertainty about the true identities behind observations is known in statistics as a misidentification problem. The observations may be duplicated, wrongly reported or missing which results in error-prone data collection. This error can affect seriously the inferences and conclusions. A wide variety of MCMC algorithms have been developed for simulating the latent identities of individuals in a dataset using Bayesian inference. In this talk, the DIU (Direct Identity Updater) algorithm is introduced. It is a Metropolis-Hastings sampler with an application-specific proposal density. Its performance and efficiency is compared with two other algorithms solving similar problems. The convergence to the correct stationary distribution is discussed by using a toy example where the data is comprised of genotypes which includes uncertainty. As the state space is small, the behaviour of the chains is easily visualized. Interestingly, while they converge to the same stationary distribution, the transition matrices for the different algorithms have little in common.
170202102134
Correlated failures in multicomponent systems

### Richard Arnold

Victoria University Wellington

Date: Thursday 4 May 2017

Multicomponent systems may experience failures with correlations amongst failure times of groups of components, and some subsets of components may experience common cause, simultaneous failures. We present a novel, general approach to model construction and inference in multicomponent systems incorporating these correlations in an approach that is tractable even in very large systems. In our formulation the system is viewed as being made up of Independent Overlapping Subsystems (IOS). In these systems components are grouped together into overlapping subsystems, and further into non-overlapping subunits. Each subsystem has an independent failure process, and each component's failure time is the time of the earliest failure in all of the subunits of which it is a part.

This is joint work with Stefanka Chukova (VUW) and Yu Hayakawa (Waseda University, Tokyo)
170110163825
Integration of IVF technologies with genomic selection to generate high merit AI bulls: a simulation study

### Fiona Hely

AbacusBio

Date: Thursday 27 April 2017

New reproductive technologies such as genotyping of embryos prior to cloning and IVF allow the possibility of targeting elite AI bull calves from high merit sires and dams. A stochastic simulation model was set up to replicate both progeny testing and genomic selection dairy genetic improvement schemes with, and without the use of IVF to generate bull selection candidates. The reproductive process was simulated using a series of random variates to assess the likelihood of a given cross between a selected sire and dam producing a viable embryo, and the superiority of these viable bulls assessed from the perspective of a commercial breeding company.
170116105335
Recovery and recolonisation by New Zealand southern right whales: making the most of limited sampling opportunities

### Will Rayment

Department of Marine Science

Date: Thursday 13 April 2017

Studies of marine megafauna are often logistically challenging, thus limiting our ability to gain robust insights into the status of populations. This is especially true for southern right whales, a species which was virtually extirpated in New Zealand waters by commercial whaling in the 19th century, and restricted to breeding around the remote sub-Antarctic Auckland Islands. We have gathered photo-ID and distribution data during annual 3-week duration trips to study right whales at the Auckland Islands since 2006. Analysis of the photo-ID data has yielded estimates of demographic parameters including survival rate and calving interval, essential for modelling the species’ recovery, while species-distribution models have been developed to reveal the specific habitat preferences of calving females. These data have been supplemented by visual and acoustic autonomous monitoring, in order to investigate seasonal occurrence of right whales in coastal habitats. Understanding population recovery, and potential recolonization of former habitats around mainland New Zealand, is essential if the species is to be managed effectively in the future.
170111085932
Ion-selective electrode sensor arrays: calibration, characterisation, and estimation

### Peter Dillingham

Department of Mathematics and Statistics

Date: Thursday 6 April 2017

Ion-selective electrodes (ISEs) have undergone a renaissance over the last 20 years. New fabrication techniques, which allow mass production, have led to their increasing use in demanding environmental and health applications. These deployable low-cost sensors are now capable of measuring sub-micromolar concentrations in complex and variable solutions, including blood, sweat, and soil. However, these measurement challenges have highlighted the need for modern calibration techniques to properly characterise ISEs and report measurement uncertainty. In this talk, our group’s developments will be discussed, with a focus on modelling ISEs, properly defining the limit of detection, and extensions to sensor arrays.
170110135500
What in the world caused that? Statistics of sensory spike trains and neural computation for inference

### Mike Paulin

Department of Zoology

Date: Thursday 30 March 2017

Before the “Cambrian explosion” 542 million years ago, animals without nervous systems reacted to environmental signals mapped onto the body surface. Later animals constructed internal maps from noisy partial observations gathered at the body surface. Considering the energy costs of data acquisition and inference versus the costs of not doing this in late Precambrian ecosystems leads us to model spike trains recorded from sensory neurons (in sharks, frogs and other animals) as samples from a family of Inverse Gaussian-censored Poisson, a.k.a. Exwald, point-processes. Neurons that evolved for other reasons turn out to be natural mechanisms for generating samples from Exwald processes, and natural computers for inferring the posterior density of their parameters. This is a consequence of a curious correspondence between the likelihood function for sequential inference from a censored Poisson process and the impulse response function of a neuronal membrane. We conclude that modern animals, including humans, are natural Bayesians because when neurons evolved 560 million years ago they provided our ancestors with a choice between being Bayesian or being dead.
This is joint work with recent Otago PhD students Kiri Pullar and Travis Monk, honours student Ethan Smith, and UCLA neuroscientist Larry Hoffman.
170216163720
Brewster Glacier - a benchmark for investigating glacier-climate interactions in the Southern Alps of New Zealand

### Nicolas Cullen

Department of Geography

Date: Thursday 23 March 2017

The advance of some fast-responding glaciers in the Southern Alps of New Zealand at the end of the 20th and beginning of the 21st century during three of the warmest decades of the instrumental era provides clear evidence that changes in large-scale atmospheric circulation in the Southern Hemisphere can act as a counter-punch to global warming. The Southern Alps are surrounded by vast areas of ocean and are strongly influenced by both subtropical and polar air masses, with the interaction of these contrasting air masses in the prevailing westerly airflow resulting in weather systems having a strong influence on glacier mass balance. Until recently, one of the challenges in assessing how large-scale atmospheric circulation influences glacier behaviour has been the lack of observational data from high-elevation sites in the Southern Alps. However, high-quality meteorological and glaciological observations from Brewster Glacier allow us to now assess in detail how atmospheric processes at different scales influence glacier behaviour. This talk will provide details about the observational programme on Brewster Glacier, which has been continuous for over a decade, and then target how weather systems influence daily ablation and precipitation (snowfall).
170131154008
Estimating overdispersion in sparse multinomial data

### Farzana Afroz

Department of Mathematics and Statistics

Date: Thursday 16 March 2017

When overdispersion is present in a data set, ignoring it may lead to serious underestimation of standard errors and potentially misleading model comparisons. Generally we estimate the overdispersion parameter $\phi$ by dividing the Pearson's goodness of fit statistic $X^2$ by the residual degrees of freedom. But when the data are sparse, that is when there are many zero or small counts, it may not be reasonable to use this statistic since $X^2$ is unlikely to be $\chi^2$-distributed. This study presents a comparison of four estimators of the overdispersion parameter $\phi$, in terms of bias, root mean squared error and standard deviation, when the data are sparse and multinomial. Dead recovery data on Herring gulls from Kent Island, Canada are used to provide a practical example of sparse multinomial data. In a simulation study, we consider Dirichlet-multinomial distribution and finite mixture distribution, which are widely used to model extra variation in multinomial data.
170110164941
Fast computation of spatially adaptive kernel smooths

### Tilman Davies

Department of Mathematics and Statistics

Date: Thursday 9 March 2017

Kernel smoothing of spatial point data can often be improved using an adaptive, spatially-varying bandwidth instead of a fixed bandwidth. However, computation with a varying bandwidth is much more demanding, especially when edge correction and bandwidth selection are involved. We propose several new computational methods for adaptive kernel estimation from spatial point pattern data. A key idea is that a variable-bandwidth kernel estimator for d-dimensional spatial data can be represented as a slice of a fixed-bandwidth kernel estimator in (d+1)-dimensional "scale space", enabling fast computation using discrete Fourier transforms. Edge correction factors have a similar representation. Different values of global bandwidth correspond to different slices of the scale space, so that bandwidth selection is greatly accelerated. Potential applications include estimation of multivariate probability density and spatial or spatiotemporal point process intensity, relative risk, and regression functions. The new methods perform well in simulations and real applications.
170110164649
Detection and replenishment of missing data in the observation of point processes with independent marks

### Jiancang Zhuang

Institute of Statistical Mathematics, Tokyo

Date: Thursday 2 March 2017

Records of processes of geophysical events, which are usually modeled as marked point processes, such as earthquakes and volcanic eruptions, often have missing data that result in underestimate of corresponding hazards. This study presents a fast approach for replenishing missing data in the record of a temporal point process with time independent marks. The basis of this method is that, if such a point process is completely observed, it can be transformed into a homogeneous Poisson process on the unit square $[0,1]^2$ by a biscale empirical transformation. This method is tested on a synthetic dataset and applied to the record of volcanic eruptions at the Hakone Volcano, Japan and several datasets of the aftershock sequences following some large earthquakes. Especially, by comparing the analysis results from the original and the replenished datasets of aftershock sequence, we have found that both the Omori-Utsu formula and ETAS model are stable, and the variations in the estimated parameters with different magnitude thresholds in past studies are caused by the influence of short-term missing of small events.
170110164432
A new multidimensional stress release statistical model based on coseismic stress transfer

### Shiyong Zhou

Peking University

Date: Tuesday 14 February 2017

NOTE venue is not our usual
Following the stress release model (SRM) proposed by Vere-Jones (1978), we developed a new multidimensional SRM, which is a space-time-magnitude version based on multidimensional point processes. First, we interpreted the exponential hazard functional of the SRM as the mathematical expression of static fatigue failure caused by stress corrosion. Then, we reconstructed the SRM in multidimensions through incorporating four independent submodels: the magnitude distribution function, the space weighting function, the loading rate function and the coseismic stress transfer model. Finally, we applied the new model to analyze the historical earthquake catalogues in North China. An expanded catalogue, which contains the information of origin time, epicentre, magnitude, strike, dip angle, rupture length, rupture width and average dislocation, is composed for the new model. The estimated model can simulate the variations of seismicity with space, time and magnitude. Compared with the previous SRMs with the same data, the new model yields much smaller values of Akaike information criterion and corrected Akaike information criterion. We compared the predicted rates of earthquakes at the epicentres just before the related earthquakes with the mean spatial seismic rate. Among all 37 earthquakes in the expanded catalogue, the epicentres of 21 earthquakes are located in the regions of higher rates.
170110163941
Next generation ABO blood type genetics and genomics

### Keolu Fox

University of San Diego

Date: Wednesday 1 February 2017

The ABO gene encodes a glycosyltransferase, which adds sugars (N-acetylgalactos-amine for A and α-D- galactose for B) to the H antigen substrate. Single nucleotide variants in the ABO gene affect the function of this glycosyltransferase at the molecular level by altering the specificity and efficiency of this enzyme for these specific sugars. Characterizing variation in ABO is important in transfusion and transplantation medicine because variants in ABO have significant consequences with regard to recipient compatibility. Additionally, variation in the ABO gene has been associated with cardiovascular disease risk (e.g., myocardial infarction) and quantitative blood traits (von Willebrand factor (VWF), Factor VIII (FVIII) and Intercellular Adhesion molecule 1 (ICAM-1). Relating ABO genotypes to actual blood antigen phenotype requires the analysis of haplotypes. Here we will explore variation (single nucleotide, insertion and deletions, and structural variation) in blood cell train gene loci (ABO) using multiple datasets enriched for heart, lung and blood-related diseases (including both African-Americans and European-Americans) from multiple NGS datasets (e.g. the NHLBI Exome Sequencing Project (ESP) dataset). I will also describe the use of a new ABO haplotyping method, ABO-seq, to increase the accuracy of ABO blood type and subtype calling using variation in multiple NGS datasets. Finally, I will describe the use of multiple read-depth based approaches to discover previously unsuspected structural variation (SV) in genes not shown to harbor SV, such as the ABO gene, by focusing on understudied populations, including individuals of Hispanic and African ancestry.

Keolu has a strong background in using genomic technologies to understand human variation and disease. Throughout his career he has made it his priority to focus on the interface of minority health and genomic technologies. Keolu earned a Ph.D. in Debbie Nickerson's lab in the University of Washington's Department of Genome Sciences (August, 2016). In collaboration with experts at Bloodworks Northwest, (Seattle, WA) he focused on the application of next-generation genome sequencing to increase compatibility for blood transfusion therapy and organ transplantation. Currently Keolu is a postdoc in Alan Saltiel's lab at the University of California San Diego (UCSD) School of Medicine, Division of Endocrinology and Metabolism and the Institute for Diabetes and Metabolic Health. His current project focuses on using genome editing technologies to investigate the molecular events involved in chronic inflammatory states resulting in obesity and catecholamine resistance.
170125161950
To be or not to be (Bayesian) Non-Parametric: A tale about Stochastic Processes

### Roy Costilla

Victoria University Wellington

Date: Tuesday 24 January 2017

Thanks to the advances in the last decades in theory and computation, Bayesian Non-Parametric (BNP) models are now use in many fields including Biostatistics, Bioinformatics, Machine Learning, Linguistics and many others.

Despite its name however, BNP models are actually massively parametric. A parametric model uses a function with finite dimensional parameter vector as prior. Bayesian inference then proceeds to approximate the posterior of these parameters given the observed data. In contrast to that, a BNP model is defined on an infinite dimensional probability space thanks to the use of a stochastic process as a prior. In other words, the prior for a BNP model is a space of functions with an infinite dimensional parameter vector. Therefore, instead of avoiding parametric forms, BNP inference uses a large number of them to gain more flexibility.

To illustrate this, we present simulations and also a case study where we use life satisfaction in NZ over 2009-2013. We estimate the models using a finite Dirichlet Process Mixture (DPM) prior. We show that this BNP model is tractable, i.e. is easily computed using Markov Chain Monte Carlo (MCMC) methods; allowing us to handle data with big sample sizes and estimate correctly the model parameters. Coupled with a post-hoc clustering of the DPM locations, the BNP model also allows an approximation of the number of mixture components, a very important parameter in mixture modelling.
170116145247
Computational methods and statistical modelling in the analysis of co-ocurrences: where are we now?

### Jorge Navarro Alberto

Date: Wednesday 9 November 2016

NOTE day and time of this seminar
The subject of the talk is statistical methods (both theoretical and applied) and computational algorithms for the analysis of binary data, which have been applied in ecology in the study of species composition in systems of patches with the ultimate goal to uncover ecological patterns. As a starting point, I review Gotelli and Ulrich's (2012) six statistical challenges in null model analysis in Ecology. Then, I exemplify the most recent research carried out by me and other statisticians and ecologists to face those challenges, and applications of the algorithms outside the biological sciences. Several topics of research are proposed, seeking to motivate statisticians and computer scientists to venture and, eventually, to specialize in the subject of the analysis of co-occurrences.
Reference: Gotelli, NJ and Ulrich, W, 2012. Statistical challenges in null model analysis. Oikos 121: 171-180
161101160727
Extensions of the multiset sampler

### Scotland Leman

Virginia Tech, USA

Date: Tuesday 8 November 2016

NOTE day and time of this seminar
In this talk I will primarily discuss the Multiset Sampler (MSS): a general ensemble based Markov Chain Monte Carlo (MCMC) method for sampling from complicated stochastic models. After which, I will briefly introduce the audience to my interactive visual analytics based research.

Proposal distributions for complex structures are essential for virtually all MCMC sampling methods. However, such proposal distributions are difficult to construct so that their probability distribution match that of the true target distribution, in turn hampering the efficiency of the overall MCMC scheme. The MSS entails sampling from an augmented distribution that has more desirable mixing properties than the original target model, while utilizing a simple independent proposal distributions that are easily tuned. I will discuss applications of the MSS for sampling from tree based models (e.g. Bayesian CART; phylogenetic models), and for general model selection, model averaging and predictive sampling.

In the final 10 minutes of the presentation I will discuss my research interests in interactive visual analytics and the Visual To Parametric Interaction (V2PI) paradigm. I'll discuss the general concepts in V2PI with an application of Multidimensional Scaling, its technical merits, and the integration of such concepts into core statistics undergraduate and graduate programs.
161011102333
New methods for estimating spectral clustering change points for multivariate time series

### Ivor Cribben

University of Alberta

Date: Wednesday 19 October 2016

NOTE day and time of this seminar
Spectral clustering is a computationally feasible and model-free method widely used in the identification of communities in networks. We introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Spectral clustering allows us to consider high dimensional time series where the number of time series is greater than the number of time points. NCPD allows for estimation of both the time of change in the network structure and the graph between each pair of change points, without prior knowledge of the number or location of the change points. Permutation and bootstrapping methods are used to perform inference on the change points. NCPD is applied to various simulated high dimensional data sets as well as to a resting state functional magnetic resonance imaging (fMRI) data set. The new methodology also allows us to identify common functional states across subjects and groups. Extensions of the method are also discussed. Finally, the method promises to offer a deep insight into the large-scale characterisations and dynamics of the brain.
161007094119
Inverse prediction for paleoclimate models

### John Tipton

Date: Tuesday 18 October 2016

NOTE day and time of this seminar
Many scientific disciplines have strong traditions of developing models to approximate nature. Traditionally, statistical models have not included scientific models and have instead focused on regression methods that exploit correlation structures in data. The development of Bayesian methods has generated many examples of forward models that bridge the gap between scientific and statistical disciplines. The ability to fit forward models using Bayesian methods has generated interest in paleoclimate reconstructions, but there are many challenges in model construction and estimation that remain.

I will present two statistical reconstructions of climate variables using paleoclimate proxy data. The first example is a joint reconstruction of temperature and precipitation from tree rings using a mechanistic process model. The second reconstruction uses microbial species assemblage data to predict peat bog water table depth. I validate predictive skill using proper scoring rules in simulation experiments, providing justification for the empirical reconstruction. Results show forward models that leverage scientific knowledge can improve paleoclimate reconstruction skill and increase understanding of the latent natural processes.
161007103441
Ultrahigh dimensional variable selection for interpolation of point referenced spatial data

### Benjamin Fitzpatrick

Queensland University of Technology

Date: Monday 17 October 2016

NOTE day and time of this seminar
When making inferences concerning the environment, ground truthed data will frequently be available as point referenced (geostatistical) observations accompanied by a rich ensemble of potentially relevant remotely sensed and in-situ observations.
Modern soil mapping is one such example characterised by the need to interpolate geostatistical observations from soil cores and the availability of data on large numbers of environmental characteristics for consideration as covariates to aid this interpolation.

In this talk I will outline my application of Least Absolute Shrinkage Selection Opperator (LASSO) regularized multiple linear regression (MLR) to build models for predicting full cover maps of soil carbon when the number of potential covariates greatly exceeds the number of observations available (the p > n or ultrahigh dimensional scenario). I will outline how I have applied LASSO regularized MLR models to data from multiple (geographic) sites and discuss investigations into treatments of site membership in models and the geographic transferability of models developed. I will also present novel visualisations of the results of ultrahigh dimensional variable selection and briefly outline some related work in ground cover classification from remotely sensed imagery.

Key references:
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Ultrahigh Dimensional Variable Selection for Interpolation of Point Referenced Spatial Data: A Digital Soil Mapping Case Study. PLoS ONE, 11(9): e0162489.
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Assessing Site Effects and Geographic Transferability when Interpolating Point Referenced Spatial Data: A Digital Soil Mapping Case Study. https://arxiv.org/abs/1608.00086
161007111343
New Zealand master sample using balanced acceptance sampling

### Paul van Dam-Bates

Department of Conservation

Date: Thursday 13 October 2016

Environmental monitoring for management organisations like the Department of Conservation is critical. Without good information about outcomes, poor management actions may persist much longer than they should or initial intervention may occur too late. The Department currently conducts focused research at key natural heritage sites (Tier 3) as well as a long term national monitoring (Tier 1). The link between the two tiers of investigation to assess the impact of management across New Zealand (Tier 2) is yet to be implemented but faces unique challenges for working at many different spatial scales and coordinating with multiple agencies. The solution is to implement a Master Sample using Balanced Acceptance Sampling (BAS). To do this some practical aspects of the sample design are addressed such as stratification, unequal probability sampling, rotating panel designs and regional intensification. Incorporating information from Tier 1 monitoring directly is also discussed.

Authors: Paul van Dam-Bates[1], Ollie Gansell[1] and Blair Roberston[2]
1 Department of Conservation, New Zealand
2 University of Canterbury, Department of Mathematics and Statistics
160525145234
How robust are capture–recapture estimators of animal population density?

### Murray Efford

Department of Mathematics and Statistics

Date: Thursday 6 October 2016

Data from passive detectors (traps, automatic cameras etc.) may be used to estimate animal population density, especially if individuals can be distinguished. However, the spatially explicit capture–recapture (SECR) models used for this purpose rest on specific assumptions that may or may not be justified, and uncertainty regarding the robustness of SECR methods has led some to resist their use. I consider the robustness of SECR estimates to deviations from key spatial assumptions – uniform spatial distribution of animals, circularity of home ranges, and the shape of the radial detection function. The findings are generally positive, although variance estimates are sensitive to over-dispersion. The method is also somewhat robust to transience and other misspecifications of the detection model, but it is not foolproof, as I show with a counter example.
160527115814
Bootstrapped model-averaged confidence intervals

### Jimmy Zeng

Department of Preventive and Social Medicine

Date: Thursday 29 September 2016

Model-averaging is commonly used to allow for model uncertainty in parameter estimation. In the frequentist setting, a model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights being based on an information criterion, such as AIC. A Wald confidence interval based on this estimate will often perform poorly, as its sampling distribution will generally be distinctly non-normal and estimation of the standard error is problematic. We propose a new method that uses a studentized bootstrap approach. We illustrate its use with a lognormal example, and perform a simulation study to compare its coverage properties with those of existing intervals.
160520152426
N-mixture models vs Poisson regression

### Richard Barker

Department of Mathematics and Statistics

Date: Thursday 22 September 2016

N-mixture models describe count data replicated in time and across sites in terms of abundance N and detectability p. They are popular because they allow inference about N while controlling for factors that influence p without the need for marking animals. Using a capture-recapture perspective we show that the loss of information that results from not marking animals is critical, making reliable statistical modeling of N and p problematic using just count data. We are unable to fit a model in which the detection probabilities are distinct among repeat visits as this model is overspecified. This makes uncontrolled variation in p problematic. By counter example we show that even if p is constant after adjusting for covariate effects (the 'constant p' assumption) scientifically plausible alternative models in which N (or its expectation) is non-identifiable or does not even exist, lead to data that are practically indistinguishable from data generated under an N-mixture model. This is particularly the case for sparse data as is commonly seen in applications. We conclude that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data then seen in typical applications. Relative abundance models for counts can be readily fitted using Poisson regression in standard software such as R and are sufficiently flexible to allow controlling for p through the use covariates while simultaneously modeling variation in relative abundance. If users require estimates of absolute abundance they should collect auxiliary data that help with estimation of p.
160829124021
Single-step genomic evaluation of New Zealand's sheep

Department of Mathematics and Statistics

Date: Thursday 15 September 2016

Quantitative genetics is the study of inheritance of quantitative traits, which are generally continuously distributed. It uses biometry to study the expression of quantitative differences among individuals and considers genetic relatedness and, environment. In the past, knowing the genetic structure of individuals has been very expensive to be used commercially. However, in the last decade, the price of genotyping has fallen rapidly, and now, there are commercial genotype chips available for most livestock species. Currently, dense marker maps are used to predict the genetic merit of animals, early in life. There are methods available for genomic evaluation. However, because they do not consider all the available information at the same time, bias or accuracy loss may occur. Single-step GBLUP is a method that uses all the genomic, pedigree and phenotypic data on all animals, simultaneously and is reported to be limit bias and in cases increase accuracy of prediction. Preliminary results of this approach on New Zealand Sheep will be presented.
160525125408
Clinical trial Data Monitoring Committees - aiding science

### Katrina Sharples

Department of Mathematics and Statistics

Date: Thursday 8 September 2016

The goal of a clinical trial is to obtain reliable evidence regarding the benefits and risks of a treatment while minimising the harm to patients. Recruitment and follow-up may take place over several years, accruing information over time, which allows the option of stopping the trial early if the trial objectives have been met or the risks to patients become too great. It has become standard practice for trials with significant risk to be overseen by an independent Data Monitoring Committee (DMC). These DMCs have sole access to the accruing trial data; they are responsible for safeguarding the rights of the patients in the trial, and for making recommendations to those running the trial regarding trial conduct and possible early termination. However interpreting the accruing evidence and making optimal recommendations is challenging. As the number of trials having DMCs has grown there has been increasing discussion of how train new DMC members. Some DMCs have published papers describing their decision-making processes for specific trials, and workshops are held fairly frequently. However it is recognised that DMC expertise is best acquired through apprenticeship. Opportunities for this are rare internationally but in New Zealand, in 1996, the Health Research Council established a unique system for monitoring clinical trials which incorporates apprenticeship positions. This talk will describe our system, discuss some of the issues and insights that have arisen along the way, and the effects it has had on the NZ clinical trial environment.
160524142027
A statistics-related guest seminar in Preventive and Social Medicine: A researcher's guide to understanding modern statistics

### Sander Greenland

University of California

Date: Monday 5 September 2016

Note day, time and venue of this special seminar
Sander Greenland is Research Professor and Emeritus Professor of Epidemiology and Statistics at the University of California, Los Angeles. He is a leading contributor to epidemiological statistics, theory, and methods, with a focus on the limitations and misuse of statistical methods in observational studies. He has authored or co-authored over 400 articles and book chapters in epidemiology, statistics, and medical publications, and co-authored the textbook Modern Epidemiology.

Professor Greenland has played an important role in the recent discussion following the American Statistical Association’s statement on the use of p values.[1-3] He will discuss lessons he took away from the process and how they apply to properly interpreting what is ubiquitous but rarely interpreted correctly by researchers: Statistical tests, P-values, power, and confidence intervals.

1. Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, 70, 129-133, DOI: 10.1080/00031305.2016.1154108
2. Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108; reprinted in the European Journal of Epidemiology, 31, 337-350.
3. Greenland, S. (2016). The ASA guidelines and null bias in current teaching and practice. The American Statistician, 70, online supplement 10 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108
160804123414
Sugars are not all the same to us! Empirical investigation of inter- and intra-individual variabilities in responding to common sugars

### Mei Peng

Department of Food Science

Date: Thursday 25 August 2016

Given the collective interests in sugar, from food scientists, geneticists, neurophysiologists, and many others (e.g., health professionals, food journalists, and YouTube experimenters), one would expect the picture of human sweetness perception should be reasonably complete by now. This is unfortunately not the case. Some seemingly fundamental questions have not yet been answered – is one’s sweetness sensitivity generalisable across different sugars? Can people discriminate sugars when they are equally sweet? Do common sugars have similar effects on people’s cognitive processing?

Answers to these questions have a close relevance to illuminating the sensory physiology of sugar metabolism, as well as to practical research of sucrose substitution. In this seminar, I would like to present findings from a few behavioural experiments focused on inter-individual and intra-individual differences in responding to common sugars, using methods from sensory science and cognitive psychology. Overall, our findings challenged some of the conventional beliefs about sweetness perception, and provided some insights into future research about sugar.
160517112909
New models for symbolic data

### Scott Sisson

University of New South Wales

Date: Thursday 18 August 2016

Symbolic data analysis is a fairly recent technique for the analysis of large and complex datasets based on summarising the data into a number of "symbols" prior to analysis. Inference is then based on the analysis of the data at the symbol level (modelling symbols, predicting symbols etc). In principle this idea works, however it would be more advantageous and natural to fit models at the level of the underlying data, rather than the symbol. Here we develop a new class of models for the analysis of symbolic data that fit directly to the data underlying the symbol, allowing for a more intuitive and flexible approach to analysis using this technique.
160520142124
Estimation of relatedness using low-depth sequencing data

### Ken Dodds

AgResearch

Date: Thursday 11 August 2016

Estimates of relatedness are used for traceability, parentage assignment, estimating genetic merit and for elucidating the genetic structure of populations. Relatedness can be estimated from large numbers of markers spread across the genome. A relatively new method of obtaining genotypes is to derive these directly from sequencing data. Often the sequencing protocol is designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. A method of estimating relatedness which accounts for this feature of GBS data is given. The method depends on the number of reads (the depth) at a particular position and also accommodates zero reads (missing). The theory for the method, simulations and some applications to real data are presented, along with further related research questions.
160517113606
The replication "crisis" in psychology, medicine, etc.: what should we do about it?

### Jeff Miller

Department of Psychology

Date: Thursday 4 August 2016

Recent large-scale replication studies and meta-analyses suggest that about 50—95% of the positive “findings” reported in top scientific journals are false positives, and that this is true across a range of fields including Psychology, Medicine, Neuroscience, Genetics, and Physical Education. Some causes of this alarmingly high percentage are easily identified, but what is the appropriate cure? In this talk I describe a simple model of the research process that researchers can use to identify the optimal attainable percentage of false positives and to plan their experiments accordingly.
160517113415
Do density-dependent processes structure biodiversity?

### Jon Waters

Department of Zoology

Date: Thursday 28 July 2016

New Zealand’s marine ecosystems have experienced rapid and dramatic changes over recent centuries. Notably, DNA comparisons of New Zealand’s archaeological versus modern pinniped and penguin assemblages have revealed sudden spatio-temporal genetic shifts, apparently in response to human-mediated extirpation events. These rapid biological changes in our marine environment apparently underscore the role of ‘founder takes all’ processes in shaping biogeographic distributions. Specifically, established high-density populations are seemingly able to exclude individuals dispersing from distant sources. Conversely, extirpation events, including those driven by human pressure and environmental change, can provide opportunities for range expansion of surviving lineages. The recent self-introductions of penguins and pinniped lineages from trans-oceanic sources highlight the dynamic biological history of coastal New Zealand.
160517113246
Musings of a statistical consultant

### Tim Jowett

Department of Mathematics and Statistics

Date: Thursday 21 July 2016

I will talk about my role as a consultant statistician and briefly discuss some interesting applied statistics projects that I have been working on.
160209092604
Quantitative genetics in the modern genomics era: new challenges, new opportunities

### Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 14 July 2016

Modern genomics technologies have led to vast amounts of data being generated on an increasingly wide range of species, particularly in non-human and non-model organisms. In this talk I will describe how these technology advances impacts the training of students, and how we’ve responded by developing of a new course offering in quantitative genetics. I will also describe some of the analytical challenges in my current research projects, and from the wider Virtual Institute of Statistical Genetics research programme, including opportunities for further statistical method development.
160524143246
Project presentations

### Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 27 May 2016

STATISTICS
Michel de Lange :Deep learning
Georgia Anderson : Probabilistic linear discriminant analysis
Nick Gelling : Automatic differentiation in R

15-MINUTE BREAK 2.40-2.55

MATHEMATICS
Alex Blennerhassett : Toeplitz algebra of a directed graph
Zoe Luo : Wavelet models for evolutionary distance
Xueyao Lu : Making sense of the λ-coalescent
Terry Collins-Hawkins : Reactive diffusion in systems with memory
Josh Ritchie : Linearisation of hyperbolic constraint equations

Also
CJ Marland : Extending matchings of graphs: a survey
This one mathematics project presentation takes place at 12 noon on Thursday 26 May, room 241
160520092655
The US obesity epidemic: evidence from the Economic Security Index

### Trent Smith

Department of Economics

Date: Thursday 26 May 2016

A growing body of research supports the "economic insecurity" theory of obesity, which posits that uncertainty with respect to one's material well- being may be an important root cause of the modern obesity epidemic. This literature has been limited in the past by a lack of reliable measures of economic insecurity. In this paper we use the newly developed Economic Security Index to explain changes in US adult obesity rates as measured by the National Health and Nutrition Examination Surveys (NHANES) from 1988-2012, a period capturing much of the recent rapid rise in obesity. We find a robust positive and statistically significant relationship between obesity and economic insecurity that holds for nearly every age, gender, and race/ethnicity group in our data, both in cross-section and over time.
160215153330
Spline-based approach to infer farm vehicle trajectories

### Jerome Cao

Department of Mathematics and Statistics

Date: Thursday 19 May 2016

GPS units mounted on a vehicle record its position, speed and bearing. The time series of positions then represents the trajectory of the vehicle. Noisy measurements and infrequent sampling, however, mean simplistic trajectory reconstruction will have unrealistic features, like sharp turns. Smoothing spline methods can efficiently build smoother, more realistic trajectories. In a conventional smoothing spline, the objective function includes a term for errors in position and also a penalty term, which has a single parameter that controls the smoothness of reconstruction. An adaptive smoothing spline extends the single parameter to a function that varies in different domains and performs local smoothing. In this talk, I will introduce a tractor spline that incorporates both position and velocity information but penalizes excessive accelerations. The penalty term is also dependent on the operational status of the tractor. The objective function now includes a term for errors in velocity that is controlled by a new parameter and an adjusted penalty term for better control of trajectory curvature. We develop cross validation techniques to find three parameters of interest. A short discussion of the relationship between spline methods and Gaussian process regression will be given. A simulation study and real data example are presented to demonstrate the effectiveness of this new method.
160217090057
Bayes of Thrones: Bayesian prediction for Game of Thrones

### Richard Vale

WRUG

Date: Tuesday 17 May 2016

I will describe an analysis from 2014 of the appearances of characters in a popular series of novels. Treating the number of appearances of each character as longitudinal data, a mixed model can be used to make probabilistic predictions of their appearances in future novels. I will discuss how the model is fitted and how its predictions should be evaluated.
160503133837
Designing a pilot study using adaptive DP-optimality

### Stephen Duffull

School of Pharmacy

Date: Thursday 12 May 2016

Managing the balance between developing a life threatening blood clot and the risk of major bleeding is a complicated clinical problem. When patients are at risk of a blood clot it is usual clinical practice to administer an anticoagulant, in this case enoxaparin, to reduce this risk. Anticoagulants, however, increase the risk of a major bleed. To help reduce the risks it has become common clinical practice to measure a biomarker that provides a measure of the overall risk profile. In the absence of a suitable biomarker the clinician is essentially wearing a blind fold. Recently a mathematical model has been developed that provides a description of the coagulation processes in the human body. This model was used to identify a target that may provide a suitable biomarker of the risk for enoxaparin treatment. While the model is reasonably complicated (77 ODEs) it is not able to describe the exact analytical conditions that are necessary to establish the experimental conditions from which to develop a suitable biomarker. In this work, an adaptive DP-optimal design method is used in conjunction with a simplified coagulation model to define the experimental conditions that can be used in clinical practice. These conditions were then later tested in a clinical study and found to perform well.
160215133140
Decoupled shrinkage and selection for Gaussian graphical models

### Beatrix Jones

Massey University

Date: Thursday 5 May 2016

Even when a Bayesian analysis has been carefully constructed to encourage sparsity, conventional posterior summaries with good predictive properties (eg the posterior mean) are typically not sparse. An approach called Decoupled Shrinkage and Selection (DSS), which uses a loss function that penalizes both poor fit and model complexity, has been used to address this problem in regression. This talk extends that approach to Gaussian graphical models. In this context, DSS not only provides a sparse summary of a posterior sample of graphical models, but allows us to obtain a sparse graphical structure that summarises the posterior even when the (inverse) covariance model fit is not a graphical model at all. This potentially offers huge computational advantages. We will examine simulated cases where the true graph is non-decomposable, a posterior is computed over decomposable graphical models, and DSS is able to recover the true non- decomposable structure. We will also consider computing the initial posterior based on a Bayesian factor model, and then recovering the graph structure using DSS. Finally, we illustrate the approach by creating a graphical model of dependencies across metabolites in a metabolic profile—in this case, a data set from the literature containing simultaneous measurements of 151 metabolites (small molecules in the blood), for 1020 subjects.
Joint work with Gideon Bistricer (Massey Honors Student) Carlos Carvalho (U Texas) and Richard Hahn (U Chicago)
160209092512
Is research methodology a latent subject of data science?

### Ben Daniel

HEDC

Date: Thursday 28 April 2016

As a field of study, research methodology is concerned with the utilisation of systematic approaches and procedures to investigate well-defined problems, underpinned by particular epistemological and ontological orientations. For many years, research methodology has occupied a central role in postgraduate education, with courses taught at all levels and across a wide range of disciplinary contexts.

In this seminar, I will first present findings from a large scale research project examining the concept of research methodology among academic staff involved in teaching methods courses from 139 universities in 9 countries. I will then discuss how this has ultimately influenced the way academics relate to and approach teaching of the subject.

Secondly, I will share key findings from another study aimed at exploring postgraduate students’ views on the value of research methodology and outline the challenges they face in learning the subject. To conclude, I will address the question whether the recognition of research methodology as an independent field of study within data science can contribute to better understanding of current and future challenges associated with the increasing availability of data from vast interconnected and loosely coupled systems within the higher education sector.
160211094606
Focussed model averaging in GLMs

### Chuen Yen Hong

Department of Mathematics and Statistics

Date: Thursday 21 April 2016

Parameter estimation is often based on a single model, usually chosen by a model selection process. However, this ignores the uncertainty in model selection. Model averaging takes this uncertainty into account. In the frequentist framework, the model-averaged point estimate is a weighted mean of the estimates obtained from each model. The weights are often based on an information criterion such as AIC or BIC, but can also be chosen to minimize an estimate of the mean squared error of the model-averaged estimator. The latter type of weight is focussed on the parameter of interest. We present an approach for deriving focussed weights for generalised linear models (GLMs), and compare it with existing approaches.
160209092422
Development of a next generation genetic evaluation system for the New Zealand sheep industry

### Benoit Auvray

Department of Mathematics and Statistics

Date: Thursday 14 April 2016

The Otago Quantitative Genetics Group, part of the Mathematics and Statistics Department of the University of Otago, is working in collaboration with Beef+Lamb New Zealand Genetics and other organisations to develop a new genetic evaluation system for New Zealand sheep. The new system will seamlessly combine traditional genetic evaluation data, typically millions of pedigree and phenotype records, along with new DNA data, that may include billions of data points, to give estimates of animal genetic merit for a wide variety of economically important traits, for selection and breeding purposes.

In this seminar, we will present the work undertaken by our group and compare it with the existing genetic evaluation system.
160218151236
Real-time updating for decision-making in emergency response to outbreaks of foot-and-mouth disease

### Will Probert

University of Nottingham

Date: Thursday 7 April 2016

During infectious disease outbreaks there may be uncertainty regarding both the extent of the outbreak and the optimality of alternative control interventions. As an outbreak progresses, and information accrues, so too does the level of confidence upon which decisions regarding control response are based. However, the longer the delay in making a decision the larger the potential opportunity cost of inaction.

We examine this trade-off using data from the UK 2001 outbreak of foot-and-mouth disease (FMD) by fitting a dynamic epidemic model to the observed infection data available at several points throughout each outbreak and compare forward simulations of the impact of alternative culling and vaccination interventions. For comparison, we repeat these forward simulations at each time point using the model fitted to data from the complete outbreak.

Results illustrate the impact of the accrual of knowledge on both model predictions and on the evaluation of candidate control actions, and highlight the importance of control policies that permit both rapid response and adaptive updating of control actions in response to additional information.
160209092652
Mixed graphical models with applications to integrative cancer genomics

### Genevera Allen

Rice University, Texas

Date: Thursday 24 March 2016

"Mixed Data'' comprising a large number of heterogeneous variables (e.g. count, binary, continuous, skewed continuous, among others) is prevalent in varied areas such as imaging genetics, national security, social networking, Internet advertising, and our particular motivation - high-throughput integrative genomics. There have been limited efforts at statistically modeling such mixed data jointly. In this talk, we address this by introducing several new classes of Markov Random Fields (MRFs), or graphical models, that yield joint densities which directly parameterize dependencies over mixed variables. To begin, we present a novel class of MRFs arising when all node-conditional distributions follow univariate exponential family distributions that, for instance, yield novel Poisson graphical models. Next, we present several new classes of Mixed MRF distributions built by assuming each node-conditional distribution follows a potentially different exponential family distribution. Fitting these models and using them to select the mixed graph in high-dimensional settings can be achieved via penalized conditional likelihood estimation that comes with strong statistical guarantees. Simulations as well as an application to integrative cancer genomics demonstrate the versatility of our methods.
Joint work with Eunho Yang, Pradeep Raviukmar, Zhandong Liu, Yulia Baker, and Ying-Wooi Wan
160209091439
Some interesting challenges in finite mixture and extreme modelling

### Kate Lee

Auckland University of Technology

Date: Wednesday 23 March 2016

Note day and time; not the usual
The goal of Bayesian inference is to infer a parameter and a model in a Bayesian setup. In this talk I will discuss some well-known problems in finite mixture and extreme modelling, and I will present my recent work.
Finite mixture model is a flexible tool for modelling multimodal data and has been used in many applications in statistical analysis. The model evidence is often approximated and it makes demands on computation due to a well-known lack of identifiability. I will present the dual importance sampling scheme to fit the demand of evidence approximation and show how to reduce the computational workload. Lastly, an extreme event is often described by modelling exceedances over the threshold and the threshold value plays a key role in the statistical inference. I will demonstrate that a suitable threshold can be determined using the Bayesian measure of surprise and this approach is easily implemented for both univariate and multivariate extremes.
160311140558
Improved estimation of intrinsic growth $r_{max}$ for long-lived species: mammals, birds, and sharks

### Peter Dillingham

University of New England, New South Wales

Date: Tuesday 22 March 2016

Note day and time; not the usual
Intrinsic population growth rate ( $r_{max}$ ) is an important parameter for many ecological applications, such as population risk assessment and harvest management. However, $r_{max}$ can be a difficult parameter to estimate, particularly for long-lived species, for which appropriate life table data or abundance time series are typically not obtainable. We developed a method for improving estimates of $r_{max}$ for long-lived species by integrating life-history theory (allometric models) and population-specific demographic data (life table models). Broad allometric relationships, such as those between life history traits and body size, have long been recognized by ecologists. These relationships are useful for deriving theoretical expectations for $r_{max}$ , but $r_{max}$ for real populations vary from simple allometric estimators for “archetypical” species of a given taxa or body mass. Meanwhile, life table approaches can provide population-specific estimates of $r_{max}$ from empirical data, but these may have poor precision from imprecise and missing vital rate parameter estimates. By integrating the two approaches, we provide estimates that are consistent with both life-history theory and population-specific empirical data. Ultimately, this yields estimates of $r_{max}$ that are likely to be more robust than estimates provided by either method alone.
160311135949
How to put together data science and sports science to understand expertise in sport

### Ludovic Seifert

University of Rouen

Date: Thursday 17 March 2016

Expertise is mostly analysed in terms of performance outcomes through speed, accuracy, and economy criteria. But understanding expertise goes beyond the questions, "how fast can you swim?", “how far can you jump?” or, “how fluently can you climb?”. Adaptability - the capacity of an expert to modify their behaviour to respond to subtle modification in the constraint acting on them - might also be a key concept to investigate (Seifert, Button, & Davids, 2013). According to Newell (1986), individuals interact with three types of constraints: environmental (e.g. wind, wave, temperature), task (e.g. required speed or frequency) and organismic (e.g. impact of size, shape and density of the body and its segments). By artificially generating perturbation during practice, we can explore how performers adapt their limb movements and limb coordination pattern to constraints, brought about by a subtle blend between behavioural stability and flexibility. Stability corresponds to the capability and the time an individual takes to resist a perturbation or to recover his initial motor behaviour after perturbation (Seifert et al., 2014). Flexibility relates to the fluctuations within a coordinative pattern to continually adapt to a given set of constraints (Davids, Araújo, Seifert, & Orth, 2015). Adaptability corresponds to the ratio between behavioural stability and flexibility, in the sense where an adaptive performer is stable and flexible when required, supporting functional movement and coordination variability. The aim of this talk is to present how data sciences such as data mining and machine learning can help to examine behavioural dynamics and variability within and between individuals in order to understand expertise in sport.

Davids, K., Araújo, D., Seifert, L., & Orth, D. (2015). Expert performance in sport: An ecological dynamics perspective. In J. Baker & D. Farrow (Eds.), Handbook of Sport Expertise (pp. 273–303). London, UK: Taylor & Francis.
Newell, K. M. (1986). Constraints on the development of coordination. In M. G. Wade & H. T. A. Whiting (Eds.), Motor development in children. Aspects of coordination and control (pp. 341–360). Dordrecht, Netherlands: Martinus Nijhoff.
Seifert, L., Button, C., & Davids, K. (2013). Key properties of expert movement systems in sport : an ecological dynamics perspective. Sports Medicine, 43(3), 167–78.
Seifert, L., Komar, J., Barbosa, T., Toussaint, H., Millet, G., & Davids, K. (2014). Coordination pattern variability provides functional adaptations to constraints in swimming performance. Sports Medicine, 44(10), 1333–45.
160218151340