Te Tari Pāngarau me te Tatauranga
Department of Mathematics & Statistics

Archived seminars in Statistics

Seminars 1 to 50

Next 50 seminars
Thinking statistically when constructing genetic maps

Timothy Bilton

Department of Mathematics and Statistics

Date: Thursday 14 September 2017

A genetic linkage map shows the relative position of and genetic distance between genetic markers, positions of the genome which exhibit variation, and underpins the study of species' genomes in a number of scientific applications. Genetic maps are constructed by tracking the transmission of genetic information from individuals to their offspring, which is frequently modelled using a hidden Markov model (HMM) since only the expression and not the transmission of genetic information is observed. However, data generated using the latest sequencing technology often results in partially observed information, which if unaccounted for, typically results in inflated estimates. Most approaches to circumvent this issue involves a combination of filtering and correcting individual data points using ad-hoc methods. Instead, we develop a new methodology that models the partially observed information by incorporating an additional layer of latent variables into the HMM. Results show that our methodology is able to produce accurate genetic map estimates, even in situations where a large proportion of the data is only partially observed.
Network tomography for integer valued traffic

Martin Hazelton

Massey University

Date: Thursday 7 September 2017

Volume network tomography is concerned with inference about traffic flow characteristics based on traffic measurements at fixed locations on the network. The quintessential example is estimation of the traffic volume between any pair of origin and destination nodes using traffic counts obtained from a subset of the links of the network. The data provide only indirect information about the target variables, generating a challenging type of statistical linear inverse problem.

In this talk I will discuss network tomography for a rather general class of traffic models. I will describe some recent progress on model identifiability. I will then discuss the development of effective MCMC samplers for simulation-based inference, based on insight provided by an examination of the geometry of the space of feasible route flows.
TensorFlow: a short intro

Lech Szymanski

Department of Computer Science

Date: Thursday 31 August 2017

TensorFlow is an open source software library for numerical computation. Its underlying paradigm of computation uses data flow graphs, which allow for automatic differentiation and effortless deployment that parallelises across CPUs or GPUs. I have been working in TensorFlow for about a year now, using it to build and train deep learning models for image classification. In this talk I will give a brief introduction to TensorFlow as well as share some of my experiences of working with it. I will try to make this talk not about deep learning with TensorFlow, but rather about TensorFlow itself, which I happen to use for deep learning.
Theory and application of latent variable models for multivariate binomial data

John Holmes

Department of Mathematics and Statistics

Date: Thursday 24 August 2017

A large body of work has been devoted to developing latent variable models for exponential family distributed multivariate data exhibiting interdependencies. For the binomial case however, extensions of models past analysis of binary data is almost entirely missing. Focusing on principal component/factor analysis representations, we will show that under the canonical logit link, latent variable models can be fitted in closed form, via Gibbs sampling, to multivariate binomial data of arbitrary trial size, by applying Pólya-gamma augmentation to the binomial likelihood. In this talk, the properties of binomial latent variable models under Pólya-gamma data augmentation will be discussed from both a theoretical perspective and through application to a range of simulated and real demographic datasets.
Māori student success: Findings from the Graduate Longitudinal Study New Zealand

Moana Theodore

Department of Psychology

Date: Thursday 17 August 2017

Māori university graduates are role models for educational success and important for the social and economic wellbeing of Māori whānau (extended family), communities and society in general. Describing their experiences can help to build an evidence base to inform practice, decision-making and policy. I will describe findings for Māori graduates from all eight New Zealand universities who are participants in the Graduate Longitudinal Study New Zealand. Data were collected when the Māori participants were in their final year of study in 2011 (n=626) and two years post-graduation in 2014 (n=455). First, I will focus on what Māori graduates describe as helping or hindering the completion of their qualifications, including external (e.g. family), institutional (e.g. academic support) and student/personal (e.g. persistence) factors. Second, I will describe Māori graduate outcomes at 2 years post-graduation. In particular, I will describe the private benefits of higher education, such as labour market outcomes (e.g. employment and income), as well as the social benefits such as civic participation and volunteerism. Overall, our findings suggest that boosting higher education success for Māori may reduce ethnic inequalities in New Zealand labour market outcomes and may impart substantial social benefits as a result of Māori graduates’ contribution to society.
Bayes factors, priors and mixtures

Matthew Schofield

Department of Mathematics and Statistics

Date: Thursday 10 August 2017

It is well known that Bayes factors are sensitive to the prior distribution chosen on the parameters. This has led to comments such as “Diffuse prior distributions ... must be used with care” (Robert 2014) and “We do not see Bayesian methods as generally useful for giving the posterior probability that a model is true, or the probability for preferring model A over model B” (Gelman and Shalizi 2013). We consider the calculation of Bayes factors for nested models. We show this is equivalent to a model with a mixture prior distribution, where the weights on the resulting posterior are related to the Bayes factor. These results allow us to directly compare Bayes factors to shrinkage priors, such as the Laplace prior used in the Bayesian lasso. We use these results as the basis for offering practical suggestions for estimation and selection in nested models.
Development and implementation of culturally informed guidelines for medical genomics research involving Māori communities

Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 3 August 2017

Medical genomic research is usually conducted within a ‘mainstream’ cultural context. Māori communities have been underrepresented in such research despite being impacted by heritable diseases and other conditions that could potentially be unravelled via modern genomic technologies. Reasons for low participation of Māori communities include negative experiences of genomics and genetics researchers - such as the notorious ‘Warrior Gene’ saga – and an unease with technologies that are often implemented by non-Māori researchers in a manner inconsistent with Māori values. In my talk I will describe recently developed guidelines for ethically appropriate genomics research with Māori communities; how these guidelines were informed by my iwi, Ngāti Rakaipaaka, who had previously been involved in a medical genomics investigation; and current efforts to complete that research via a partnership with Te Tari Pāngarau me Tātauranga ki Te Whare Wānaka o Otakou (Department of Mathematics and Statistics at the University of Otago).
Who takes Statistics? A look at student composition, 2000-2016

Peter Dillingham

Department of Mathematics and Statistics

Date: Thursday 27 July 2017

In this blended seminar and discussion, we will examine how student data can help inform curriculum development and review, focussing on the Statistics programme as an example. Currently, the Statistics academic staff are reviewing our programme to ensure that we continue to provide a high quality and modern curriculum that meets the needs of students. An important component of this process is to understand whom our students are and what they are interested in, from first-year service teaching through to students majoring in statistics. As academics, we often have a reasonable answer to these questions, but we can be more specific by poring over student data. While not glamorous, this sort of data can help confirm those things we think we know, identify opportunities or risks, and help answer specific questions where we know that we don’t know the answer.
A missing value approach for breeding value estimation

Alastair Lamont

Department of Mathematics and Statistics

Date: Thursday 20 July 2017

A key goal in quantitative genetics is the identification and selective breeding of individuals with high economic value. For a particular trait, an individual’s breeding value is the genetic worth it has for its progeny. While methods for estimating breeding values have existed since the middle of last century, the march of technology now allows the genotypes of individuals to be directly measured. This additional information allows for improved breeding value estimation, supplementing observed measurements and known pedigree information. However, while it can be cost efficient to genotype some animals, it is unfeasible to genotype every individual in most populations of interest, due to either cost or logistical issues. As such, any approach must be able to accommodate missing data, while also managing computational efficiency, as the dimensionality of data can be immense. Most modern approaches tend to impute or average over the missing data in some fashion, rather than fully incorporating it into the model. These approximations lead to a loss in estimation accuracy. Similar models are used within Human genetics, but for different purposes. With different data and different goals to quantitative genetics, these approaches natively include missing data within the model. We are developing an approach which utilises a human genetics framework, but adapted so as to estimate breeding values.
Assessing and dealing with imputation inaccuracy in genomic predictions

Michael Lee

Department of Mathematics and Statistics

Date: Thursday 13 July 2017

Genomic predictions rely on having genotypes from high density SNP Chips from many individuals. Many national animal evaluations, to predict breeding values, may include millions of animals, where an increasing proportion of these have genotype information. Imputation can be used to make genomic predictions more cost effective. For example, in the NZ Sheep industry genomic predictions can be done by genotyping animals with a SNP Chip of lower density (e.g. 5-15K) and imputing the genotypes for a given animal to a density of about 50K, where the imputation process needs a reference panel of 50K genotypes. The imputed genotypes are used in genomic predictions and the accuracy of imputation is a function of the quality of the reference panel. A study to assess the imputation accuracy of a wide range of animals was undertaken. The goal was to quantify the levels of inaccuracy and to determine a best strategy to deal with this inaccuracy in the context of single step genomic best linear unbiased prediction (ssGBLUP).
Project presentations

Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 2 June 2017

Jodie Buckby : Model checking for hidden Markov models
Jie Kang : Model averaging for renewal process
Yu Yang : Robustness of temperature reconstruction for the past 500 years

Fergus O'Leary : Stochastic spatial population models
Rachael Young : SIR epidemic models on networks
Sam Bremer : An effective model for particle distribution in groundwaters
Joshua Mills : Hyperbolic equations and finite difference schemes
Yiwen Qi : The heat equation and Brownian motion
Vijay Surada : Modelling the self-thinning rule
Twists and trends in exercise science

Jim Cotter

School of Physical Education, Sport and Exercise Sciences

Date: Thursday 1 June 2017

From my perspective, exercise science is entering an age of enlightenment, but misuse of statistics remains a serious limitation to its contributions and progress for human health, performance, and basic knowledge. This seminar will summarise our recent and current work in hydration, heat stress and patterns of dosing/prescribing exercise, and the implications for human health and performance. These contexts will be used to discuss methodological issues including of research design, analysis and interpretation.
Hidden Markov models for incompletely observed point processes

Amina Shahzadi

Department of Mathematics and Statistics

Date: Thursday 25 May 2017

Natural phenomena such as earthquakes and volcanic eruptions can cause catastrophic damage. Such phenomena can be modelled using point processes. However, this is complicated and potentially biased by the problem of missing data in the records. The degree of completeness of volcanic records varies dramatically over time. Often the older the record is, the more incomplete it is. One way to handle such records with missing data is to use hidden Markov models (HMMs). An HMM is a two-layered process based on an observed process and an unobserved first-order stationary Markov chain with the state duration geometrically distributed. This limits the application of HMMs in the field of volcanology, where the processes leading to missed observations do not necessarily behave in a memoryless and time-independent manner. We propose Inhomogeneous hidden semi-Markov models (IHSMMs) to investigate the time-inhomogeneity of the completeness of volcanic eruption catalogues to obtain the reliable hazard estimate.

Jin Zhang

Department of Accountancy and Finance

Date: Thursday 18 May 2017

The CBOE SKEW is an index launched by the Chicago Board Options Exchange (CBOE) in February 2011. Its term structure tracks the risk-neutral skewness of the S&P 500 (SPX) index for different maturities. In this paper, we develop a theory for the CBOE SKEW by modelling SPX using a jump-diffusion process with stochastic volatility and stochastic jump intensity. With the term structure data of VIX and SKEW, we estimate model parameters and obtain the four processes of variance, jump intensity and their long-term mean levels. Our results can be used to describe SPX risk-neutral distribution and to price SPX options.
Finding true identities in a sample using MCMC methods

Paula Bran

Department of Mathematics and Statistics

Date: Thursday 11 May 2017

Uncertainty about the true identities behind observations is known in statistics as a misidentification problem. The observations may be duplicated, wrongly reported or missing which results in error-prone data collection. This error can affect seriously the inferences and conclusions. A wide variety of MCMC algorithms have been developed for simulating the latent identities of individuals in a dataset using Bayesian inference. In this talk, the DIU (Direct Identity Updater) algorithm is introduced. It is a Metropolis-Hastings sampler with an application-specific proposal density. Its performance and efficiency is compared with two other algorithms solving similar problems. The convergence to the correct stationary distribution is discussed by using a toy example where the data is comprised of genotypes which includes uncertainty. As the state space is small, the behaviour of the chains is easily visualized. Interestingly, while they converge to the same stationary distribution, the transition matrices for the different algorithms have little in common.
Correlated failures in multicomponent systems

Richard Arnold

Victoria University Wellington

Date: Thursday 4 May 2017

Multicomponent systems may experience failures with correlations amongst failure times of groups of components, and some subsets of components may experience common cause, simultaneous failures. We present a novel, general approach to model construction and inference in multicomponent systems incorporating these correlations in an approach that is tractable even in very large systems. In our formulation the system is viewed as being made up of Independent Overlapping Subsystems (IOS). In these systems components are grouped together into overlapping subsystems, and further into non-overlapping subunits. Each subsystem has an independent failure process, and each component's failure time is the time of the earliest failure in all of the subunits of which it is a part.

This is joint work with Stefanka Chukova (VUW) and Yu Hayakawa (Waseda University, Tokyo)
Integration of IVF technologies with genomic selection to generate high merit AI bulls: a simulation study

Fiona Hely


Date: Thursday 27 April 2017

New reproductive technologies such as genotyping of embryos prior to cloning and IVF allow the possibility of targeting elite AI bull calves from high merit sires and dams. A stochastic simulation model was set up to replicate both progeny testing and genomic selection dairy genetic improvement schemes with, and without the use of IVF to generate bull selection candidates. The reproductive process was simulated using a series of random variates to assess the likelihood of a given cross between a selected sire and dam producing a viable embryo, and the superiority of these viable bulls assessed from the perspective of a commercial breeding company.
Recovery and recolonisation by New Zealand southern right whales: making the most of limited sampling opportunities

Will Rayment

Department of Marine Science

Date: Thursday 13 April 2017

Studies of marine megafauna are often logistically challenging, thus limiting our ability to gain robust insights into the status of populations. This is especially true for southern right whales, a species which was virtually extirpated in New Zealand waters by commercial whaling in the 19th century, and restricted to breeding around the remote sub-Antarctic Auckland Islands. We have gathered photo-ID and distribution data during annual 3-week duration trips to study right whales at the Auckland Islands since 2006. Analysis of the photo-ID data has yielded estimates of demographic parameters including survival rate and calving interval, essential for modelling the species’ recovery, while species-distribution models have been developed to reveal the specific habitat preferences of calving females. These data have been supplemented by visual and acoustic autonomous monitoring, in order to investigate seasonal occurrence of right whales in coastal habitats. Understanding population recovery, and potential recolonization of former habitats around mainland New Zealand, is essential if the species is to be managed effectively in the future.
Ion-selective electrode sensor arrays: calibration, characterisation, and estimation

Peter Dillingham

Department of Mathematics and Statistics

Date: Thursday 6 April 2017

Ion-selective electrodes (ISEs) have undergone a renaissance over the last 20 years. New fabrication techniques, which allow mass production, have led to their increasing use in demanding environmental and health applications. These deployable low-cost sensors are now capable of measuring sub-micromolar concentrations in complex and variable solutions, including blood, sweat, and soil. However, these measurement challenges have highlighted the need for modern calibration techniques to properly characterise ISEs and report measurement uncertainty. In this talk, our group’s developments will be discussed, with a focus on modelling ISEs, properly defining the limit of detection, and extensions to sensor arrays.
What in the world caused that? Statistics of sensory spike trains and neural computation for inference

Mike Paulin

Department of Zoology

Date: Thursday 30 March 2017

Before the “Cambrian explosion” 542 million years ago, animals without nervous systems reacted to environmental signals mapped onto the body surface. Later animals constructed internal maps from noisy partial observations gathered at the body surface. Considering the energy costs of data acquisition and inference versus the costs of not doing this in late Precambrian ecosystems leads us to model spike trains recorded from sensory neurons (in sharks, frogs and other animals) as samples from a family of Inverse Gaussian-censored Poisson, a.k.a. Exwald, point-processes. Neurons that evolved for other reasons turn out to be natural mechanisms for generating samples from Exwald processes, and natural computers for inferring the posterior density of their parameters. This is a consequence of a curious correspondence between the likelihood function for sequential inference from a censored Poisson process and the impulse response function of a neuronal membrane. We conclude that modern animals, including humans, are natural Bayesians because when neurons evolved 560 million years ago they provided our ancestors with a choice between being Bayesian or being dead.
This is joint work with recent Otago PhD students Kiri Pullar and Travis Monk, honours student Ethan Smith, and UCLA neuroscientist Larry Hoffman.
Brewster Glacier - a benchmark for investigating glacier-climate interactions in the Southern Alps of New Zealand

Nicolas Cullen

Department of Geography

Date: Thursday 23 March 2017

The advance of some fast-responding glaciers in the Southern Alps of New Zealand at the end of the 20th and beginning of the 21st century during three of the warmest decades of the instrumental era provides clear evidence that changes in large-scale atmospheric circulation in the Southern Hemisphere can act as a counter-punch to global warming. The Southern Alps are surrounded by vast areas of ocean and are strongly influenced by both subtropical and polar air masses, with the interaction of these contrasting air masses in the prevailing westerly airflow resulting in weather systems having a strong influence on glacier mass balance. Until recently, one of the challenges in assessing how large-scale atmospheric circulation influences glacier behaviour has been the lack of observational data from high-elevation sites in the Southern Alps. However, high-quality meteorological and glaciological observations from Brewster Glacier allow us to now assess in detail how atmospheric processes at different scales influence glacier behaviour. This talk will provide details about the observational programme on Brewster Glacier, which has been continuous for over a decade, and then target how weather systems influence daily ablation and precipitation (snowfall).
Estimating overdispersion in sparse multinomial data

Farzana Afroz

Department of Mathematics and Statistics

Date: Thursday 16 March 2017

When overdispersion is present in a data set, ignoring it may lead to serious underestimation of standard errors and potentially misleading model comparisons. Generally we estimate the overdispersion parameter $\phi$ by dividing the Pearson's goodness of fit statistic $X^2$ by the residual degrees of freedom. But when the data are sparse, that is when there are many zero or small counts, it may not be reasonable to use this statistic since $X^2$ is unlikely to be $\chi^2$-distributed. This study presents a comparison of four estimators of the overdispersion parameter $\phi$, in terms of bias, root mean squared error and standard deviation, when the data are sparse and multinomial. Dead recovery data on Herring gulls from Kent Island, Canada are used to provide a practical example of sparse multinomial data. In a simulation study, we consider Dirichlet-multinomial distribution and finite mixture distribution, which are widely used to model extra variation in multinomial data.
Fast computation of spatially adaptive kernel smooths

Tilman Davies

Department of Mathematics and Statistics

Date: Thursday 9 March 2017

Kernel smoothing of spatial point data can often be improved using an adaptive, spatially-varying bandwidth instead of a fixed bandwidth. However, computation with a varying bandwidth is much more demanding, especially when edge correction and bandwidth selection are involved. We propose several new computational methods for adaptive kernel estimation from spatial point pattern data. A key idea is that a variable-bandwidth kernel estimator for d-dimensional spatial data can be represented as a slice of a fixed-bandwidth kernel estimator in (d+1)-dimensional "scale space", enabling fast computation using discrete Fourier transforms. Edge correction factors have a similar representation. Different values of global bandwidth correspond to different slices of the scale space, so that bandwidth selection is greatly accelerated. Potential applications include estimation of multivariate probability density and spatial or spatiotemporal point process intensity, relative risk, and regression functions. The new methods perform well in simulations and real applications.
Joint work with Professor Adrian Baddeley, Curtin University, Perth.
Detection and replenishment of missing data in the observation of point processes with independent marks

Jiancang Zhuang

Institute of Statistical Mathematics, Tokyo

Date: Thursday 2 March 2017

Records of processes of geophysical events, which are usually modeled as marked point processes, such as earthquakes and volcanic eruptions, often have missing data that result in underestimate of corresponding hazards. This study presents a fast approach for replenishing missing data in the record of a temporal point process with time independent marks. The basis of this method is that, if such a point process is completely observed, it can be transformed into a homogeneous Poisson process on the unit square $[0,1]^2$ by a biscale empirical transformation. This method is tested on a synthetic dataset and applied to the record of volcanic eruptions at the Hakone Volcano, Japan and several datasets of the aftershock sequences following some large earthquakes. Especially, by comparing the analysis results from the original and the replenished datasets of aftershock sequence, we have found that both the Omori-Utsu formula and ETAS model are stable, and the variations in the estimated parameters with different magnitude thresholds in past studies are caused by the influence of short-term missing of small events.
A new multidimensional stress release statistical model based on coseismic stress transfer

Shiyong Zhou

Peking University

Date: Tuesday 14 February 2017

NOTE venue is not our usual
Following the stress release model (SRM) proposed by Vere-Jones (1978), we developed a new multidimensional SRM, which is a space-time-magnitude version based on multidimensional point processes. First, we interpreted the exponential hazard functional of the SRM as the mathematical expression of static fatigue failure caused by stress corrosion. Then, we reconstructed the SRM in multidimensions through incorporating four independent submodels: the magnitude distribution function, the space weighting function, the loading rate function and the coseismic stress transfer model. Finally, we applied the new model to analyze the historical earthquake catalogues in North China. An expanded catalogue, which contains the information of origin time, epicentre, magnitude, strike, dip angle, rupture length, rupture width and average dislocation, is composed for the new model. The estimated model can simulate the variations of seismicity with space, time and magnitude. Compared with the previous SRMs with the same data, the new model yields much smaller values of Akaike information criterion and corrected Akaike information criterion. We compared the predicted rates of earthquakes at the epicentres just before the related earthquakes with the mean spatial seismic rate. Among all 37 earthquakes in the expanded catalogue, the epicentres of 21 earthquakes are located in the regions of higher rates.
Next generation ABO blood type genetics and genomics

Keolu Fox

University of San Diego

Date: Wednesday 1 February 2017

The ABO gene encodes a glycosyltransferase, which adds sugars (N-acetylgalactos-amine for A and α-D- galactose for B) to the H antigen substrate. Single nucleotide variants in the ABO gene affect the function of this glycosyltransferase at the molecular level by altering the specificity and efficiency of this enzyme for these specific sugars. Characterizing variation in ABO is important in transfusion and transplantation medicine because variants in ABO have significant consequences with regard to recipient compatibility. Additionally, variation in the ABO gene has been associated with cardiovascular disease risk (e.g., myocardial infarction) and quantitative blood traits (von Willebrand factor (VWF), Factor VIII (FVIII) and Intercellular Adhesion molecule 1 (ICAM-1). Relating ABO genotypes to actual blood antigen phenotype requires the analysis of haplotypes. Here we will explore variation (single nucleotide, insertion and deletions, and structural variation) in blood cell train gene loci (ABO) using multiple datasets enriched for heart, lung and blood-related diseases (including both African-Americans and European-Americans) from multiple NGS datasets (e.g. the NHLBI Exome Sequencing Project (ESP) dataset). I will also describe the use of a new ABO haplotyping method, ABO-seq, to increase the accuracy of ABO blood type and subtype calling using variation in multiple NGS datasets. Finally, I will describe the use of multiple read-depth based approaches to discover previously unsuspected structural variation (SV) in genes not shown to harbor SV, such as the ABO gene, by focusing on understudied populations, including individuals of Hispanic and African ancestry.

Keolu has a strong background in using genomic technologies to understand human variation and disease. Throughout his career he has made it his priority to focus on the interface of minority health and genomic technologies. Keolu earned a Ph.D. in Debbie Nickerson's lab in the University of Washington's Department of Genome Sciences (August, 2016). In collaboration with experts at Bloodworks Northwest, (Seattle, WA) he focused on the application of next-generation genome sequencing to increase compatibility for blood transfusion therapy and organ transplantation. Currently Keolu is a postdoc in Alan Saltiel's lab at the University of California San Diego (UCSD) School of Medicine, Division of Endocrinology and Metabolism and the Institute for Diabetes and Metabolic Health. His current project focuses on using genome editing technologies to investigate the molecular events involved in chronic inflammatory states resulting in obesity and catecholamine resistance.
To be or not to be (Bayesian) Non-Parametric: A tale about Stochastic Processes

Roy Costilla

Victoria University Wellington

Date: Tuesday 24 January 2017

Thanks to the advances in the last decades in theory and computation, Bayesian Non-Parametric (BNP) models are now use in many fields including Biostatistics, Bioinformatics, Machine Learning, Linguistics and many others.

Despite its name however, BNP models are actually massively parametric. A parametric model uses a function with finite dimensional parameter vector as prior. Bayesian inference then proceeds to approximate the posterior of these parameters given the observed data. In contrast to that, a BNP model is defined on an infinite dimensional probability space thanks to the use of a stochastic process as a prior. In other words, the prior for a BNP model is a space of functions with an infinite dimensional parameter vector. Therefore, instead of avoiding parametric forms, BNP inference uses a large number of them to gain more flexibility.

To illustrate this, we present simulations and also a case study where we use life satisfaction in NZ over 2009-2013. We estimate the models using a finite Dirichlet Process Mixture (DPM) prior. We show that this BNP model is tractable, i.e. is easily computed using Markov Chain Monte Carlo (MCMC) methods; allowing us to handle data with big sample sizes and estimate correctly the model parameters. Coupled with a post-hoc clustering of the DPM locations, the BNP model also allows an approximation of the number of mixture components, a very important parameter in mixture modelling.
Computational methods and statistical modelling in the analysis of co-ocurrences: where are we now?

Jorge Navarro Alberto

Universidad Autónoma de Yucatán (UADY)

Date: Wednesday 9 November 2016

NOTE day and time of this seminar
The subject of the talk is statistical methods (both theoretical and applied) and computational algorithms for the analysis of binary data, which have been applied in ecology in the study of species composition in systems of patches with the ultimate goal to uncover ecological patterns. As a starting point, I review Gotelli and Ulrich's (2012) six statistical challenges in null model analysis in Ecology. Then, I exemplify the most recent research carried out by me and other statisticians and ecologists to face those challenges, and applications of the algorithms outside the biological sciences. Several topics of research are proposed, seeking to motivate statisticians and computer scientists to venture and, eventually, to specialize in the subject of the analysis of co-occurrences.
Reference: Gotelli, NJ and Ulrich, W, 2012. Statistical challenges in null model analysis. Oikos 121: 171-180
Extensions of the multiset sampler

Scotland Leman

Virginia Tech, USA

Date: Tuesday 8 November 2016

NOTE day and time of this seminar
In this talk I will primarily discuss the Multiset Sampler (MSS): a general ensemble based Markov Chain Monte Carlo (MCMC) method for sampling from complicated stochastic models. After which, I will briefly introduce the audience to my interactive visual analytics based research.

Proposal distributions for complex structures are essential for virtually all MCMC sampling methods. However, such proposal distributions are difficult to construct so that their probability distribution match that of the true target distribution, in turn hampering the efficiency of the overall MCMC scheme. The MSS entails sampling from an augmented distribution that has more desirable mixing properties than the original target model, while utilizing a simple independent proposal distributions that are easily tuned. I will discuss applications of the MSS for sampling from tree based models (e.g. Bayesian CART; phylogenetic models), and for general model selection, model averaging and predictive sampling.

In the final 10 minutes of the presentation I will discuss my research interests in interactive visual analytics and the Visual To Parametric Interaction (V2PI) paradigm. I'll discuss the general concepts in V2PI with an application of Multidimensional Scaling, its technical merits, and the integration of such concepts into core statistics undergraduate and graduate programs.
New methods for estimating spectral clustering change points for multivariate time series

Ivor Cribben

University of Alberta

Date: Wednesday 19 October 2016

NOTE day and time of this seminar
Spectral clustering is a computationally feasible and model-free method widely used in the identification of communities in networks. We introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Spectral clustering allows us to consider high dimensional time series where the number of time series is greater than the number of time points. NCPD allows for estimation of both the time of change in the network structure and the graph between each pair of change points, without prior knowledge of the number or location of the change points. Permutation and bootstrapping methods are used to perform inference on the change points. NCPD is applied to various simulated high dimensional data sets as well as to a resting state functional magnetic resonance imaging (fMRI) data set. The new methodology also allows us to identify common functional states across subjects and groups. Extensions of the method are also discussed. Finally, the method promises to offer a deep insight into the large-scale characterisations and dynamics of the brain.
Inverse prediction for paleoclimate models

John Tipton

Colorado State University

Date: Tuesday 18 October 2016

NOTE day and time of this seminar
Many scientific disciplines have strong traditions of developing models to approximate nature. Traditionally, statistical models have not included scientific models and have instead focused on regression methods that exploit correlation structures in data. The development of Bayesian methods has generated many examples of forward models that bridge the gap between scientific and statistical disciplines. The ability to fit forward models using Bayesian methods has generated interest in paleoclimate reconstructions, but there are many challenges in model construction and estimation that remain.

I will present two statistical reconstructions of climate variables using paleoclimate proxy data. The first example is a joint reconstruction of temperature and precipitation from tree rings using a mechanistic process model. The second reconstruction uses microbial species assemblage data to predict peat bog water table depth. I validate predictive skill using proper scoring rules in simulation experiments, providing justification for the empirical reconstruction. Results show forward models that leverage scientific knowledge can improve paleoclimate reconstruction skill and increase understanding of the latent natural processes.
Ultrahigh dimensional variable selection for interpolation of point referenced spatial data

Benjamin Fitzpatrick

Queensland University of Technology

Date: Monday 17 October 2016

NOTE day and time of this seminar
When making inferences concerning the environment, ground truthed data will frequently be available as point referenced (geostatistical) observations accompanied by a rich ensemble of potentially relevant remotely sensed and in-situ observations.
Modern soil mapping is one such example characterised by the need to interpolate geostatistical observations from soil cores and the availability of data on large numbers of environmental characteristics for consideration as covariates to aid this interpolation.

In this talk I will outline my application of Least Absolute Shrinkage Selection Opperator (LASSO) regularized multiple linear regression (MLR) to build models for predicting full cover maps of soil carbon when the number of potential covariates greatly exceeds the number of observations available (the p > n or ultrahigh dimensional scenario). I will outline how I have applied LASSO regularized MLR models to data from multiple (geographic) sites and discuss investigations into treatments of site membership in models and the geographic transferability of models developed. I will also present novel visualisations of the results of ultrahigh dimensional variable selection and briefly outline some related work in ground cover classification from remotely sensed imagery.

Key references:
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Ultrahigh Dimensional Variable Selection for Interpolation of Point Referenced Spatial Data: A Digital Soil Mapping Case Study. PLoS ONE, 11(9): e0162489.
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Assessing Site Effects and Geographic Transferability when Interpolating Point Referenced Spatial Data: A Digital Soil Mapping Case Study.
New Zealand master sample using balanced acceptance sampling

Paul van Dam-Bates

Department of Conservation

Date: Thursday 13 October 2016

Environmental monitoring for management organisations like the Department of Conservation is critical. Without good information about outcomes, poor management actions may persist much longer than they should or initial intervention may occur too late. The Department currently conducts focused research at key natural heritage sites (Tier 3) as well as a long term national monitoring (Tier 1). The link between the two tiers of investigation to assess the impact of management across New Zealand (Tier 2) is yet to be implemented but faces unique challenges for working at many different spatial scales and coordinating with multiple agencies. The solution is to implement a Master Sample using Balanced Acceptance Sampling (BAS). To do this some practical aspects of the sample design are addressed such as stratification, unequal probability sampling, rotating panel designs and regional intensification. Incorporating information from Tier 1 monitoring directly is also discussed.

Authors: Paul van Dam-Bates[1], Ollie Gansell[1] and Blair Roberston[2]
1 Department of Conservation, New Zealand
2 University of Canterbury, Department of Mathematics and Statistics
How robust are capture–recapture estimators of animal population density?

Murray Efford

Department of Mathematics and Statistics

Date: Thursday 6 October 2016

Data from passive detectors (traps, automatic cameras etc.) may be used to estimate animal population density, especially if individuals can be distinguished. However, the spatially explicit capture–recapture (SECR) models used for this purpose rest on specific assumptions that may or may not be justified, and uncertainty regarding the robustness of SECR methods has led some to resist their use. I consider the robustness of SECR estimates to deviations from key spatial assumptions – uniform spatial distribution of animals, circularity of home ranges, and the shape of the radial detection function. The findings are generally positive, although variance estimates are sensitive to over-dispersion. The method is also somewhat robust to transience and other misspecifications of the detection model, but it is not foolproof, as I show with a counter example.
Bootstrapped model-averaged confidence intervals

Jimmy Zeng

Department of Preventive and Social Medicine

Date: Thursday 29 September 2016

Model-averaging is commonly used to allow for model uncertainty in parameter estimation. In the frequentist setting, a model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights being based on an information criterion, such as AIC. A Wald confidence interval based on this estimate will often perform poorly, as its sampling distribution will generally be distinctly non-normal and estimation of the standard error is problematic. We propose a new method that uses a studentized bootstrap approach. We illustrate its use with a lognormal example, and perform a simulation study to compare its coverage properties with those of existing intervals.
N-mixture models vs Poisson regression

Richard Barker

Department of Mathematics and Statistics

Date: Thursday 22 September 2016

N-mixture models describe count data replicated in time and across sites in terms of abundance N and detectability p. They are popular because they allow inference about N while controlling for factors that influence p without the need for marking animals. Using a capture-recapture perspective we show that the loss of information that results from not marking animals is critical, making reliable statistical modeling of N and p problematic using just count data. We are unable to fit a model in which the detection probabilities are distinct among repeat visits as this model is overspecified. This makes uncontrolled variation in p problematic. By counter example we show that even if p is constant after adjusting for covariate effects (the 'constant p' assumption) scientifically plausible alternative models in which N (or its expectation) is non-identifiable or does not even exist, lead to data that are practically indistinguishable from data generated under an N-mixture model. This is particularly the case for sparse data as is commonly seen in applications. We conclude that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data then seen in typical applications. Relative abundance models for counts can be readily fitted using Poisson regression in standard software such as R and are sufficiently flexible to allow controlling for p through the use covariates while simultaneously modeling variation in relative abundance. If users require estimates of absolute abundance they should collect auxiliary data that help with estimation of p.
Single-step genomic evaluation of New Zealand's sheep

Mohammad Ali Nilforooshan

Department of Mathematics and Statistics

Date: Thursday 15 September 2016

Quantitative genetics is the study of inheritance of quantitative traits, which are generally continuously distributed. It uses biometry to study the expression of quantitative differences among individuals and considers genetic relatedness and, environment. In the past, knowing the genetic structure of individuals has been very expensive to be used commercially. However, in the last decade, the price of genotyping has fallen rapidly, and now, there are commercial genotype chips available for most livestock species. Currently, dense marker maps are used to predict the genetic merit of animals, early in life. There are methods available for genomic evaluation. However, because they do not consider all the available information at the same time, bias or accuracy loss may occur. Single-step GBLUP is a method that uses all the genomic, pedigree and phenotypic data on all animals, simultaneously and is reported to be limit bias and in cases increase accuracy of prediction. Preliminary results of this approach on New Zealand Sheep will be presented.
Clinical trial Data Monitoring Committees - aiding science

Katrina Sharples

Department of Mathematics and Statistics

Date: Thursday 8 September 2016

The goal of a clinical trial is to obtain reliable evidence regarding the benefits and risks of a treatment while minimising the harm to patients. Recruitment and follow-up may take place over several years, accruing information over time, which allows the option of stopping the trial early if the trial objectives have been met or the risks to patients become too great. It has become standard practice for trials with significant risk to be overseen by an independent Data Monitoring Committee (DMC). These DMCs have sole access to the accruing trial data; they are responsible for safeguarding the rights of the patients in the trial, and for making recommendations to those running the trial regarding trial conduct and possible early termination. However interpreting the accruing evidence and making optimal recommendations is challenging. As the number of trials having DMCs has grown there has been increasing discussion of how train new DMC members. Some DMCs have published papers describing their decision-making processes for specific trials, and workshops are held fairly frequently. However it is recognised that DMC expertise is best acquired through apprenticeship. Opportunities for this are rare internationally but in New Zealand, in 1996, the Health Research Council established a unique system for monitoring clinical trials which incorporates apprenticeship positions. This talk will describe our system, discuss some of the issues and insights that have arisen along the way, and the effects it has had on the NZ clinical trial environment.
A statistics-related guest seminar in Preventive and Social Medicine: A researcher's guide to understanding modern statistics

Sander Greenland

University of California

Date: Monday 5 September 2016

Note day, time and venue of this special seminar
Sander Greenland is Research Professor and Emeritus Professor of Epidemiology and Statistics at the University of California, Los Angeles. He is a leading contributor to epidemiological statistics, theory, and methods, with a focus on the limitations and misuse of statistical methods in observational studies. He has authored or co-authored over 400 articles and book chapters in epidemiology, statistics, and medical publications, and co-authored the textbook Modern Epidemiology.

Professor Greenland has played an important role in the recent discussion following the American Statistical Association’s statement on the use of p values.[1-3] He will discuss lessons he took away from the process and how they apply to properly interpreting what is ubiquitous but rarely interpreted correctly by researchers: Statistical tests, P-values, power, and confidence intervals.

1. Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, 70, 129-133, DOI: 10.1080/00031305.2016.1154108
2. Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at suppl/10.1080/00031305.2016.1154108; reprinted in the European Journal of Epidemiology, 31, 337-350.
3. Greenland, S. (2016). The ASA guidelines and null bias in current teaching and practice. The American Statistician, 70, online supplement 10 at suppl/10.1080/00031305.2016.1154108
Sugars are not all the same to us! Empirical investigation of inter- and intra-individual variabilities in responding to common sugars

Mei Peng

Department of Food Science

Date: Thursday 25 August 2016

Given the collective interests in sugar, from food scientists, geneticists, neurophysiologists, and many others (e.g., health professionals, food journalists, and YouTube experimenters), one would expect the picture of human sweetness perception should be reasonably complete by now. This is unfortunately not the case. Some seemingly fundamental questions have not yet been answered – is one’s sweetness sensitivity generalisable across different sugars? Can people discriminate sugars when they are equally sweet? Do common sugars have similar effects on people’s cognitive processing?

Answers to these questions have a close relevance to illuminating the sensory physiology of sugar metabolism, as well as to practical research of sucrose substitution. In this seminar, I would like to present findings from a few behavioural experiments focused on inter-individual and intra-individual differences in responding to common sugars, using methods from sensory science and cognitive psychology. Overall, our findings challenged some of the conventional beliefs about sweetness perception, and provided some insights into future research about sugar.
New models for symbolic data

Scott Sisson

University of New South Wales

Date: Thursday 18 August 2016

Symbolic data analysis is a fairly recent technique for the analysis of large and complex datasets based on summarising the data into a number of "symbols" prior to analysis. Inference is then based on the analysis of the data at the symbol level (modelling symbols, predicting symbols etc). In principle this idea works, however it would be more advantageous and natural to fit models at the level of the underlying data, rather than the symbol. Here we develop a new class of models for the analysis of symbolic data that fit directly to the data underlying the symbol, allowing for a more intuitive and flexible approach to analysis using this technique.
Estimation of relatedness using low-depth sequencing data

Ken Dodds


Date: Thursday 11 August 2016

Estimates of relatedness are used for traceability, parentage assignment, estimating genetic merit and for elucidating the genetic structure of populations. Relatedness can be estimated from large numbers of markers spread across the genome. A relatively new method of obtaining genotypes is to derive these directly from sequencing data. Often the sequencing protocol is designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. A method of estimating relatedness which accounts for this feature of GBS data is given. The method depends on the number of reads (the depth) at a particular position and also accommodates zero reads (missing). The theory for the method, simulations and some applications to real data are presented, along with further related research questions.
The replication "crisis" in psychology, medicine, etc.: what should we do about it?

Jeff Miller

Department of Psychology

Date: Thursday 4 August 2016

Recent large-scale replication studies and meta-analyses suggest that about 50—95% of the positive “findings” reported in top scientific journals are false positives, and that this is true across a range of fields including Psychology, Medicine, Neuroscience, Genetics, and Physical Education. Some causes of this alarmingly high percentage are easily identified, but what is the appropriate cure? In this talk I describe a simple model of the research process that researchers can use to identify the optimal attainable percentage of false positives and to plan their experiments accordingly.
Do density-dependent processes structure biodiversity?

Jon Waters

Department of Zoology

Date: Thursday 28 July 2016

New Zealand’s marine ecosystems have experienced rapid and dramatic changes over recent centuries. Notably, DNA comparisons of New Zealand’s archaeological versus modern pinniped and penguin assemblages have revealed sudden spatio-temporal genetic shifts, apparently in response to human-mediated extirpation events. These rapid biological changes in our marine environment apparently underscore the role of ‘founder takes all’ processes in shaping biogeographic distributions. Specifically, established high-density populations are seemingly able to exclude individuals dispersing from distant sources. Conversely, extirpation events, including those driven by human pressure and environmental change, can provide opportunities for range expansion of surviving lineages. The recent self-introductions of penguins and pinniped lineages from trans-oceanic sources highlight the dynamic biological history of coastal New Zealand.
Musings of a statistical consultant

Tim Jowett

Department of Mathematics and Statistics

Date: Thursday 21 July 2016

I will talk about my role as a consultant statistician and briefly discuss some interesting applied statistics projects that I have been working on.
Quantitative genetics in the modern genomics era: new challenges, new opportunities

Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 14 July 2016

Modern genomics technologies have led to vast amounts of data being generated on an increasingly wide range of species, particularly in non-human and non-model organisms. In this talk I will describe how these technology advances impacts the training of students, and how we’ve responded by developing of a new course offering in quantitative genetics. I will also describe some of the analytical challenges in my current research projects, and from the wider Virtual Institute of Statistical Genetics research programme, including opportunities for further statistical method development.
Project presentations

Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 27 May 2016

Michel de Lange :Deep learning
Georgia Anderson : Probabilistic linear discriminant analysis
Nick Gelling : Automatic differentiation in R

15-MINUTE BREAK 2.40-2.55

Alex Blennerhassett : Toeplitz algebra of a directed graph
Zoe Luo : Wavelet models for evolutionary distance
Xueyao Lu : Making sense of the λ-coalescent
Terry Collins-Hawkins : Reactive diffusion in systems with memory
Josh Ritchie : Linearisation of hyperbolic constraint equations

CJ Marland : Extending matchings of graphs: a survey
This one mathematics project presentation takes place at 12 noon on Thursday 26 May, room 241
The US obesity epidemic: evidence from the Economic Security Index

Trent Smith

Department of Economics

Date: Thursday 26 May 2016

A growing body of research supports the "economic insecurity" theory of obesity, which posits that uncertainty with respect to one's material well- being may be an important root cause of the modern obesity epidemic. This literature has been limited in the past by a lack of reliable measures of economic insecurity. In this paper we use the newly developed Economic Security Index to explain changes in US adult obesity rates as measured by the National Health and Nutrition Examination Surveys (NHANES) from 1988-2012, a period capturing much of the recent rapid rise in obesity. We find a robust positive and statistically significant relationship between obesity and economic insecurity that holds for nearly every age, gender, and race/ethnicity group in our data, both in cross-section and over time.
Spline-based approach to infer farm vehicle trajectories

Jerome Cao

Department of Mathematics and Statistics

Date: Thursday 19 May 2016

GPS units mounted on a vehicle record its position, speed and bearing. The time series of positions then represents the trajectory of the vehicle. Noisy measurements and infrequent sampling, however, mean simplistic trajectory reconstruction will have unrealistic features, like sharp turns. Smoothing spline methods can efficiently build smoother, more realistic trajectories. In a conventional smoothing spline, the objective function includes a term for errors in position and also a penalty term, which has a single parameter that controls the smoothness of reconstruction. An adaptive smoothing spline extends the single parameter to a function that varies in different domains and performs local smoothing. In this talk, I will introduce a tractor spline that incorporates both position and velocity information but penalizes excessive accelerations. The penalty term is also dependent on the operational status of the tractor. The objective function now includes a term for errors in velocity that is controlled by a new parameter and an adjusted penalty term for better control of trajectory curvature. We develop cross validation techniques to find three parameters of interest. A short discussion of the relationship between spline methods and Gaussian process regression will be given. A simulation study and real data example are presented to demonstrate the effectiveness of this new method.
Bayes of Thrones: Bayesian prediction for Game of Thrones

Richard Vale


Date: Tuesday 17 May 2016

I will describe an analysis from 2014 of the appearances of characters in a popular series of novels. Treating the number of appearances of each character as longitudinal data, a mixed model can be used to make probabilistic predictions of their appearances in future novels. I will discuss how the model is fitted and how its predictions should be evaluated.