Statistics, University of Otago

Archived seminars in Statistics

Seminars 1 to 50

Combined Statistics Talks

Joint Statistics Seminar Joel Carman, Jessica Allen, Lara Najim

University of Otago

Date: Thursday 6 October 2022

Forecasting the next VEI ≥ 3 eruption at Mt Taranaki
Joel Carman
Forecasting of volcanic eruptions can be challenging due to the small amount of data available in the eruption records. The data from volcanos which share similar physical properties and statistical behaviour (statistical analogues) have been used to help estimate model parameters and forecast future eruptions at a target volcano. This project uses a series of hierarchical renewal processes and trend renewal processes, as a way to forecast the next VEI ≥ 3 eruption at Mt Taranaki by using the eruption records from Mt Taranaki and different sets of statistical analogue volcanos. Model averaging is used to combine the posterior distribution of the forecast times from each of the considered models to allow for model uncertainty.
Spatiotemporal variation in low frequency earthquake recurrence along the San Andreas fault
Jessica Allen
Major earthquakes have devastating impacts on both human and wildlife activity, and are relatively unpredictable with current seismic monitoring technology. Modelling other forms of persistent fault activity, such as low frequency earthquakes (LFEs), provides an opportunity to better understand the unobservable processes underlying large earthquakes. My Masters project uses hidden Markov models to analyse patterns of LFE activity detected at a wide range of positions along the San Andreas Fault. This will allow better understanding of the evolution and migration of activity, provide clues about changes in underlying fault composition, and enable us to link LFE activity patterns to slow slip events and thus large earthquakes.
Weathering the storm: Space weather forecasting using Hidden Markov models
Lara Najim
The study of space weather concerns interactions between the Sun and Earth. Charged particles originating from the Sun entering Earth’s atmosphere (called solar wind) interact with the Earth’s magnetic field. Strong solar wind can lead to perturbations of the magnetic field on the surface of the Earth called geomagnetic storms. Extreme geomagnetic storms can damage energy infrastructure, causing power outages and danger to human life. This project develops statistical models to categorise the activity of geomagnetic storms to understand the temporal occurrence patterns of storms with different magnitudes, with the aim to forecast large geomagnetic storms.

221004112044

Estimating abundance with capture-recapture: the importance of model, estimator, and prior choice

Matthew Schofield

Statistics Department University of Otago

Date: Thursday 15 September 2022

This talk is motivated by a mark-recapture distance sampling analysis. We found unexpectedly large differences between Bayesian and frequentist estimates of abundance despite a moderately large number of observations (~600). Further exploration revealed similar sensitivity to estimator choice when focusing on frequentist estimation. To understand these differences, we consider abundance estimation from general mark-recapture models with three estimation strategies (maximum likelihood estimation, conditional maximum likelihood estimation, and Bayesian estimation) for both binomial and Poisson capture-recapture models. We find that assuming the data have a binomial or multinomial distribution introduces implicit and unnoticed assumptions that are not addressed when fitting with maximum likelihood estimation. This can have an important effect, particularly if our data arise from multiple populations. We compare our results to those of restricted maximum likelihood in linear mixed effects models.

220909141906

On computationally efficient methods for testing multivariate distributions with unknown parameters

Sara Algeri

University of Minnesota

Date: Thursday 14 July 2022

Despite the popularity of classical goodness fit tests such as Pearson's chi-squared and Kolmogorov-Smirnov, their applicability often faces serious challenges in practical applications. For instance, in a binned data regime, low counts may affect the validity of the asymptotic results. Excessively large bins, on the other hand, may lead to loss of power. In the unbinned data regime, tests such as Kolmogorov-Smirnov and Cramer-von Mises do not enjoy distribution-freeness if the models under study are multivariate and/or involve unknown parameters. As a result, one needs to simulate the distribution of the test statistic on a case-by-case basis. In this talk, I will discuss a testing strategy that allows us to overcome these shortcomings and equips experimentalists with a novel tool to perform goodness-of-fit while reducing substantially the computational costs.

220704134438

Pseudoreplication in experiments is, or is not, a sin

Peter Dillingham

University of Otago

Date: Thursday 27 May 2021

Pseudoreplication was introduced in a classic paper by S. A. Hurlbert (Ecological Monographs, 1984), describing a common flaw in ecological studies where treatments were not replicated, or replicates were treated as independent even when they weren’t. Ever since, there has been much discussion around pseudoreplication in field- and laboratory-based studies.
We approach this discussion through the lens of multi-driver experiments, focusing on split-plot designs. Split-plot experiments manipulate and replicate factors at different levels, usually due to logistical constraints such as the number of available fields in an agricultural experiment, or header tanks available for an ocean global change experiment (Figure 1). However, the split-plot nature of experiments is commonly ignored, leading to charges of pseudoreplication.
Rather than echoing others’ criticism of pseudoreplication, we examine when it is, and is not, an issue. Importantly, there are instances where an ‘incorrect’ analysis with pseudoreplication substantially outperforms a ‘correct’ split-plot analysis; in other instances, the incorrect analysis performs abysmally. Here, we describe a model-averaging approach we developed, explain why many laboratory-based experiments may benefit from it, and how this work informs the discussion around pseudoreplication.
This is joint work with Chuen Yen Hong, Christopher Cornwall, David Fletcher, Christina McGraw, and Jiaxu Zeng

210524100537

PIGS IN SPACE! Estimating wild pig abundance in Hong Kong using camera trapping data when individuals are not uniquely identifiable.

Darryl I. MacKenzie

Proteus

Date: Thursday 20 May 2021

Wild pigs are native to Hong Kong, but are a public nuisance in some areas due to pigs scavenging food from rubbish, or rooting in gardens and other public spaces. The Hong Kong government have implemented camera trapping programmes in multiple county parks to monitor the wild pig population.
Camera traps are widely used throughout the globe to study a broad range of species in many different ecosystems. When individuals of the species are uniquely identifiable from the images, conventional, or spatially explicit, mark-recapture methods may be used to estimate abundance, or density, and related parameters. When individuals are not uniquely identifiable, the data has often been used in occupancy-style analyses where images of the target species are regarded as a detection of the species presence at the camera location, which is sometimes an unsatisfactory use of the data. For many years it has been a commonly held viewpoint that abundance estimation for these types of species is difficult without making restrictive assumptions due to concerns about potential double counting of the same individuals, and imperfect detection due to motion sensors not triggering.
However in recent years there have been a few applications where it has been demonstrated that it is possible to obtain unbiased estimates of abundance for such species, with relatively few assumptions, provided that: 1) the field of view for each camera can be reliably determined; and 2) detection probability can be estimated (when necessary).
In this talk I shall summarise the underlying concepts behind the approaches, how they could lead to more flexible modelling of the populations, and important study design requirements that likely differ from how many camera trapping studies are undertaken at present.

210512112537

Foiling foul freeloaders: modeling to manage the risk of marine invaders

Prof Andrew Robinson

University of Melbourne

Date: Thursday 13 May 2021

Biofouling is the accumulation of organisms on surfaces immersed in water. It is of particular concern to the international shipping industry because it increases fuel costs and presents a biosecurity risk by providing a pathway for non-indigenous marine species to establish in new areas. Marine biosecurity is coming under increased scrutiny as trade increases. I will discuss two projects germane to the management of the biosecurity risk of biofouling, namely (i) anticipation of biofouling risk exposure for the purposes of planning post-border surveillance investment, and (ii) efficient monitoring for biofouling using imagery and machine-learning algorithms.

210505134423

Statistical Modelling and Machine Learning, and their conflicting philosophical bases

Professor Murray Aitkin

University of Melbourne

Date: Thursday 6 May 2021

The Data Science/Big Data/Machine Learning era is upon us. Some Statistics departments are morphing into Data Science departments. The new era is focussed on flexibility and innovation. In this process the history of these developments has become obscure.
This talk traces these developments back to the arguments between Fisher and Neyman over the roles of models and likelihood in statistical inference. For many flexible model-free analyses there is a model-based analysis in the background. We illustrate with examples of the bootstrap and smoothing.

210428152340

Revisiting the sins of my past

James Curran

University of Auckland

Date: Thursday 29 April 2021

In this talk, I will discuss how some of the work that I did as a PhD student, has eventually (or in some cases continually), been revisited by other researchers. My PhD was centred on problems arising from the statistical interpretation of forensic glass evidence. This is not something that most people are familiar with, so I will start with an actual case to introduce the general methodology, and then highlight some issues that have arisen from my research on that methodology.
A friend of mine once said, ''Anything new in this talk is a typo''. I add to this my corollary, ''But you might not have heard it before''.

210413121546

Smooth nonparametric regression under shape restrictions

Hongbin Guo and Yong Wang

University of Otago and University of Auckland

Date: Thursday 25 March 2021

Estimation of a function under shape restriction is of considerable interest in many practical applications. It is not uncommon that in many fields, researchers are in the position of having strong presumptions about certain relationships satisfying qualitative restrictions, such as monotonicity and convexity (concavity). Typical examples include the study of utility functions, cost functions, and profit functions in economics (Gallant, 1984; Terrell, 1996), the study of dose response curve in medicine, growth curves of animals and plants in ecology and the estimation of the hazard rate in survival analysis (Chang et al., 2007). Imposing shape-restrictions can improve the predictive performance and reduce overfitting, if the underlying regression function takes the specific form. The classic least squared solutions for shape-restricted estimation are typically neither smooth nor parsimonious. There has been many researches pursuing smooth shape-restricted regressors in recent years (Wang&Ghosh 2012, Mayer, 2008, etc.). We propose a new non-parametric estimator for univariate regression subject to monotonicity, convexity and concavity constraints with simple structures, by replacing the discrete measures in the non-smooth least squared solutions with continuous ones. Our estimator is composed as the linear combinations of several constructed component functions which satisfy corresponding shape constraints. The smoothness of our model is controlled by one tuning parameter. A fast gradient-based iterative algorithm is used to find the least square estimate with efficiency (Wang, 2007). Asymptotic properties including the consistency of both the estimator and its derivatives have been investigated. Numerical studies show that our estimator is having a better predictive performance comparing to other shape-restricted estimators in most scenarios.

210318152645

Dependencies within and among Forensic Match Probabilities

Bruce Weir

Department of Biostatistics, University of Washington

Date: Friday 5 February 2021

DNA profiling has become an integral tool in forensic science, with widespread public acceptance of the power of matching between an evidence sample and a person of interest. The rise of direct to consumer genetic profiling has extended this acceptance to findings of matches to distant relatives of the perpetrator of a crime. Along with the greater discriminating power of profiles as forensic scientists have moved from Alec Jeffreys’ ``DNA fingerprinting'' to next-generation sequencing, has come the need to re-examine the usual assumptions of independence among the components of forensic profiles. It may still be appropriate to regard variants at a single marker as being inherited independently, but it is doubtful that all 40 components in a 20-locus STR profile are independent, let alone neighbouring sites in an NGS profile. The very basis for forensic genealogy is that human populations contain many pairs of distant relatives, whose profile probabilities are not independent. The expansion of forensic typing to include Y-chromosome and mitochondrial markers and protein variants raises even further questions of independence. These issues will be discussed and illustrated with forensic and other genetic data, all within a re-examination of the concept of identity by descent and current work to estimate measures of identity within and between individuals, and within and between populations.

210201131314

Trans-dimensional bayesian inference for gravitational lens substructures

David Huijser

Department of Statistics, University of Auckland

Date: Monday 7 December 2020

Parameter estimation can be very challenging for problems with a variable number of parameters, the so-called trans-dimensional problems. When looking at gravitational lenses in particular , the inference of the density profile of strong gravitational lenses when the lens galaxy may contain multiple dark or faint substructures poses a big problem. This research attempts to resolve this gravitational lensing problem by applying a combination of Diffusive Nested Sampling, Reversible Jump, and Metropolis-Hastings method. The model applied consisted of a main galaxy model augmented with an unknown number of satellite galaxies and sources. After this the model is applied to three datasets, the posterior distributions agreed with the literature. In summary, plotting the posterior distribution in a two-dimensional surface plot enables us to identify areas of higher object density. With the application of a simple clustering algorithm I was able to group objects potentially associated with the same galaxies using the posterior distribution.

201203105153

Stochastic degradation modelling - from products to systems

Dr Xun Xiao

School of Fundamental Sciences, Massey University

Date: Monday 7 December 2020

In this talk, I will start from the basic concept of stochastic degradation modelling in the context of reliability engineering by reviewing some popular degradation models and their generalizations, e.g. Wiener process, Gamma process, inverse Gaussian process, etc. Particularly, I will discuss the properties of a class of models with random initial degradation values and its applications on industrial products and agricultural products. Furthermore, I will present a spatiotemporal degradation model for railway track systems. Finally, I will briefly discuss some recent advancement on modelling control system with aging actuators.

201203091755

Bayesian sequential inference (filtering) in a functional tensor train representation.

Colin Fox

Physics University of Otago

Date: Tuesday 17 November 2020

Colin Fox (Physics, Otago), joint work with Sergey Dolgov (Mathematics, Bath) Bayesian sequential inference, a.k.a. optimal continuous-discrete filtering, over a nonlinear system requires evolving the forward Kolmogorov equation, that is a Fokker--Planck equation, in alternation with Bayes’ conditional updating. We present a numerical grid-method that represents density functions on a mesh, or grid, in a tensor train (TT) representation, equivalent to the matrix product states in quantum mechanics. By utilizing an efficient implicit PDE solver and Bayes' update step in TT format, we develop a provably optimal filter with cost that scales linearly with problem size and dimension. This ability to overcome the `curse of dimensionality' is a remarkable feature of the TT representation, and is why the recent introduction of low-rank hierarchical tensor methods, such as TT, is a significant development in scientific computing for multi-dimensional problems. The only other work that gets close to the scaling we demonstrate in high-dimensional settings is due to Stephen S.T. Yau and his Fields-medallist brother Shing-Tung Yau. We give a gentle introduction to filtering, functional tensor train representations and computation, present some examples of filtering in up to 80 dimensions, and explain why we can do better than a Fields medallist.

201110160229

Improving predictability while maintaining statistical inferences for individual differences in task-based fMRI using Elastic Net and permutation

Narun Pat

Department of Psychology University of Otago

Date: Thursday 1 October 2020

The tantalizing possibility of predicting individual differences in specific cognitive processes using brain activity would transform research and clinical applications. For 20+ years, neuroscientists have attempted to develop task-based functional Magnetic Resonance Imaging (fMRI) to do exactly this. Conventionally, neuroscientists use a mass-univariate approach to analyse task-based fMRI data, by treating every single brain area as independent from each other and drawing information from each area separately. However, the mass-univariate approach has recently been questioned for its predictivity. In this talk, I will show that using a regularized, multivariate ‘Elastic Net’ approach that draws information across brain regions, as opposed to a single region, can markedly enhance out-of-sample prediction. As a proof of concept, I applied Elastic Net to predict individual differences in working memory from fMRI data during a working memory task in a large sample of children (n = 4,350). Moreover, combining Elastic Net with permutation, a technique called eNetXplorer, allows us to compute empirical p-value estimates for individual features. Thus, using Elastic Net along with permutation enables us to statistically infer which brain regions contribute to individual difference variables we are predicting, thus adding to scientific knowledge regarding the brain regions underlying the model as much as to prediction.

200924150054

Fishy business: linking foraging behaviour to breeding success in Adélie penguins

Taylor Hamlin

Mathematics and Statistics Department University of Otago

Date: Thursday 24 September 2020

The movement of animals has repeatedly been shown to be a crucial element of a range of different phenomena. This includes individual-level processes like foraging, competition, and breeding, as well as ecosystem services such as pollination, fertilization, and bio control. Despite this, our ability to model between these external outcomes and the movement patterns of animals has lagged. This seminar will discuss the preliminary findings of work done on the foraging movements and breeding success of Adélie penguins in the Ross Sea. Specifically, our attempts to mechanistically link foraging behavior to demographic outcomes such as adult survival and chick fledging success through the use of hidden Markov and state space models. As bonus they will be plenty of cute penguin photos so please waddle along.

200921124835

Drug delivery for neonates - Modelling and dose calculators

Natalie Medlicott

University of Otago

Date: Thursday 13 August 2020

Drug administration in babies less than one-month old presents a number of challenges - some of which arise from the patient’s small size and their immature and developing drug handling systems. The medications used for neonates have typically been developed for older children or adults, and dosages (mg) are scaled down to suit the younger patients. This may be done using a mg/kg approach, however, it doesn’t really account for the developmental changes and variability in drug handling that occurs in the first few weeks and months of life. For some drugs, pharmacokinetic models are available and/or are being researched to guide dosing during the neonatal period. Their use could improve patient outcomes if information derived from such models could be readily incorporated into dosing decisions. An example where there has been considerable pharmacokinetic research in neonates is the dosing of aminoglycoside antibiotics e.g. gentamicin – and this will be used to illustrate the principles of dosage adjustment from pharmacokinetic models for neonates and to highlight a potential for research to use models to understand dose and variability and improve dosing decisions.

200807154204

National seismic hazard model for New Zealand: 2022 update

Mark Stirling

Department of Geology

Date: Thursday 30 July 2020

New Zealand's national seismic hazard model (NSHM) is being updated for the first time since 2010. In the interim, several major well-instrumented and well studied earthquakes have occurred in the country. These events increased interest and awareness of seismic hazards, and necessitated urgent updates of the NSHM at the regional scale. The current national update is embracing these regional advances, but also incorporating new ideas and state-of-the-art methods in seismic hazard analysis. In my talk I will provide an overview of the history of the NSHM, lessons learned from the major earthquakes, and the present efforts to update the model at the national scale. The work being carried out by our Otago earthquake science group will of course be emphasised.

200724111515

Making Better Use of Genotyping-by-Sequencing Data

Jie Kang

Mathematics and Statistics, University of Otago

Date: Thursday 12 March 2020

Advances in sequencing technologies enable us to characterise variation in the genome of non-model but agriculturally important species. Approaches such as Genotyping-by-Sequencing (GBS) can produce abundant markers at relatively low cost. This has encouraged implementation of Genomic Selection (GS) to accelerate genetic gains in plant breeding, but has also raised the challenge of how to make better use of the genomic information. Unlike animal breeding, where high-quality reference genomes and well-developed modelling strategies already exist, a versatile analysis pipeline is needed for out-breeding plant species, such as perennial ryegrass (Lolium perenne). In addition, we want approaches that take the highly polymorphic nature of ryegrass into consideration when analysing (low-depth) GBS data. We thus hypothesise that existing GS models can be enhanced by accounting for short haplotypes or 'ShortHaps', that is, multiple variants in small genomic segments such as those captured within a GBS read. In this talk, I will (1) describe the bioinformatics workflows associated with ShortHaps calling, and (2) discuss why ShortHaps should work better than SNPs in terms of breeding value predictions and relatedness estimation.

200310094725

Count Data Regression Models: Properties, Applications and Extensions

John Hinde

National University of Ireland, Galway

Date: Thursday 10 October 2019

The basis of regression models for count data is the Poisson log-linear model that can be applied to both raw counts and aggregate rates. In practice, many observed counts exhibit overdispersion, where the count variance is greater than the mean, and this can arise in many different ways. One specific source of overdispersion is the occurrence of excess zero counts, a situation often referred to as zero-inflation. Over the last 30 years or so many models for overdispersed and zero-inflated count data have been developed, although, in practice, distinguishing between these two aspects can be difficult. A less common phenomenon is that of underdispersion, where the count variance is less than the mean. Underdispersion has received little attention in the literature, although there are various simple ways in which it can arise. In this talk we will consider some families of count regression models that can incorporate both under and over-dispersion, these include extended Poisson–Tweedie models, the COM-Poisson model, and Gamma and Weibull-count models. I will discuss some of the possible causes of over and under-dispersion, the nature and basis of the various models, their estimation from a likelihood perspective, and software implementations (typically in R). The use of these models will be illustrated with examples from different application areas. I will also discuss the merits of different models and estimation approaches, implications for inference on covariates of interest, and a simple graphical approach for checking model adequacy and comparing competing models.

191001110405

Whakatipu te Mohiotanga o te Ira: Growing Māori capability and content in genetics-related education

Phillip Wilcox

Mathematics and Statistics, University of Otago

Date: Thursday 3 October 2019

This Seminar will focus on recent efforts at the University of Otago to increase (a) Māori content in statistics, genetics and biochemistry courses, and (b) Māori involvement in genetics-based research and applications.

191001110542

What on Earth is this? Applying deep learning for species recognition

Varvara Vetrova

University of Canterbury

Date: Thursday 19 September 2019

Can we use a smartphone camera, take a picture of an animal or a plant in the wild and identify its species automatically using convolutional neural networks? What about very similar-looking or rare organisms? How many images are enough? This talk will try to reveal some answers to these questions. This talk is based on the MBIE-funded research project "BioSecure-ID".

190911144642

Instrumental Limit of Detection, Non-linear Sensors, and Statistics

Peter Dillingham

Mathematics and Statistics, University of Otago

Date: Thursday 12 September 2019

For more than 20 years, the International Union of Pure and Applied Chemistry (IUPAC) has recommended a probabilistic approach to defining the limit of detection (LOD) of analytical instruments. In this talk, I will describe the background to the recommendation, its link to Neyman-Pearson hypothesis testing, and practical implementation. Particularly, the process of estimating LOD and reporting its uncertainty is as necessary as correctly defining it. Finally, calculation, estimation, and scientific importance of LOD for ion-selective electrodes will be described. The detection threshold is determined by the distribution of the blank signal and an acceptable false positive rate (FP, set to 0.05). The limit of detection (LOD) is the smallest non-blank with power equal to 1 minus the false negative (FN) rate.

190903141757

Single-Fit Bootstrapping

David Fletcher

Mathematics and Statistics, University of Otago

Date: Thursday 5 September 2019

The bootstrap is a useful tool for assessing the uncertainty associated with a frequentist parameter estimate. There are many variations of the basic idea, which involves simulation of new data, fitting the model to these data, and obtaining a new estimate. In this talk I will consider two settings in which we can avoid refitting the model. Use of such a "single-fit" bootstrap clearly has advantages when fitting the model is time-consuming.

190828125446

An overview of respondent driven sampling

Lisa Avery

University of Otago Mathematics and Statistics Department

Date: Thursday 22 August 2019

Respondent driven sampling is essentially glorified snowball sampling. However, it is arguably the best method we have of measuring the health of difficult to reach populations, and in particular disease prevalence (HIV is commonly studied using RDS). I will review the sampling method, some of the most popular prevalence estimators and highlight some of the difficulties in drawing inferences from these samples. In particular, I’ll talk about difficulties we’ve encountered applying regression methods to these samples.

190816100029

NeighborNet, with plans for a sequel

David Bryant

Mathematics and Statistics, University of Otago

Date: Thursday 15 August 2019

NeighborNet is an unsupervised clustering algorithm developed mainly for evolutionary genetics. It is hierarchical, however unlike standard classification algorithms it infers clusters which can overlap. The method has proved quite popular across a range of disciplines, with several thousand citations, but hasn't yet been picked up much by statisticians. We are currently working on a sequel which will be more computationally efficient and, hopefully, a bit more elegant. I'll spend most of this seminar introducing the method, then go on to talk about the current research directions.

190808114132

Autoencoders, archetypal analysis and an application

Matthew Parry

Department of Mathematics and Statistics

Date: Thursday 8 August 2019

The idea of an autoencoder comes from machine learning but it is implicit in a number of statistical techniques. I give a brief review of autoencoders, focusing mainly on their use as a dimension reduction technique and how one can construct probabilistic versions of autoencoders. I then pivot to the archetypal analysis of Cutler and Breiman (1994), which is a form of cluster analysis given in terms of extremal points, i.e. archetypes. Following Bauckhage et al. (2015), I show how archetypal analysis can be viewed as an autoencoder. I finish with an application of archetypal analysis to data imputation. This is based on joint work with Pórtya Piscitelli Cavalcanti.

190806103947

computer vision for culture and heritage

Steven Mills

Department of Computer Science

Date: Tuesday 30 July 2019

In this talk I will present some of our recent and ongoing work, with an emphasis on cultural and heritage applications. These include historic document analysis, 3D modelling for archaeology and recording the built environment, and tracking for augmented spectator experiences. I will also outline some of the outstanding issues we have where collaboration with mathematicians and statisticians might be valuable.

190724111803

The reliability of latent class models for diagnostic testing with no gold standard

Matthew Schofield

Department of Mathematics and Statistics

Date: Thursday 18 July 2019

Latent class models are commonly used for diagnostic testing in situations where there is no gold standard. Our motivating example is a Leptospirosis study in Tanzania, where four possible testing procedures were considered. A two-state latent class model appears to fit the data well but returns estimates that do not conform to prior expectations. The diagnostic test that was believed to be most reliable was estimated as the worst of the four. In this talk we attempt to understand this problem. We show using simulation that the assumption that the latent class corresponds to disease status can be problematic. This can lead to large bias in the estimated sensitivities while having minimal effect on the fit of the model.

190715153813

Managing sensitive data within distributed software systems

David Eyers

Department of Computer Science

Date: Thursday 23 May 2019

Cloud computing and the Internet of Things are making distributed software systems increasingly commonplace. Within these systems, an increasing volume of sensitive data is being transferred, such as personally identifiable information. This talk examines some of the mechanisms I have explored with collaborators that aim to assist software developers to build systems that can handle sensitive data in a more secure and accountable manner.
~~David has broad research interests in computer science topics, including distributed systems and information security. One theme of his research has been seeking security techniques that are usable and accessible to end users and software developers.~~

190515134259

The life of a consulting biometrician

Assoc. Prof. Darryl MacKenzie

Proteus & Department of Mathematics and Statistics

Date: Thursday 16 May 2019

In this Statchat-style presentation, I will talk about being a consulting biometrician/statistician. I shall cover a range of topics including; what lead me to becoming one, the types of projects that I’ve been involved with, statistically-interesting applications, skills I’ve learnt and what skills I’ve found most useful, and the highs and lows of the job. I’ll also talk about practical aspects such as frequency of work, charge-out rates, etc. Students that are considering life after study are encouraged to come along to hear more about a non-academic career option.

190507140011

Better understanding the effect of using poorly imputed genotypes in genetic evaluations

Michael Lee

University of Otago Statistics

Date: Thursday 9 May 2019

The metric for selection of animals in a breeding program is generally based on breeding values which are random effects predicted via Best linear Unbiased Prediction (BLUP). Increasingly, genomic information from individual animals is also included to better predict breeding values termed genomic breeding values (GBVs) with Single Step Genomic BLUP (ssGBLUP). In the NZ Sheep Industry, in order to make the prediction of GBVs more cost effective imputation is used to allow a lower density of markers to be used. This seminar will describe the process used to predict GBVs and in particular some results associated with the inclusion of imputed genotypes that are imputed inaccurately.

190506090746

A tale of two paradigms: A case study of frequentist and Bayesian modelling for genetic linkage maps

Timothy Bilton

Department of Mathematics and Statistics

Date: Thursday 2 May 2019

A genetic linkage map shows the relative position of and genetic distance between genetic markers, positions of the genome which exhibit variation, and underpins the study of species' genomes in a number of scientific applications. Genetic maps are constructed by tracking the transmission of genetic information from individuals to their offspring, which is frequently modelled using a hidden Markov model (HMM) since only the expression and not the transmission of genetic information is observed. Typically, HMMs for genetic maps are fitted using maximum likelihood. However, the uncertainty associated with genetic map estimates are rarely presented, and construction of confidence intervals using traditional frequentist methods are difficult, as many of the parameter estimates lie on the boundary of the parameter space. We investigate Bayesian approaches for fitting HMMs of genetic maps to facilitate characterizing uncertainty, and consider including a hierarchical component to improve estimation. Focus is given to constructing genetic maps using high-throughput sequencing data. Using simulated and real data, we compare the frequentist and Bayesian approaches and examine some of their properties. Lastly, the advantages/disadvantages of the two procedures and some issues encountered are discussed.

190412111057

Biostatistics in nutrition-related research

Dr Jill Haszard

Division of Sciences Biostatistician

Date: Thursday 18 April 2019

Working as a biostatistician in the Department of Human Nutrition has exposed me to a wide variety of study designs and data. In particular, I handle a large amount of dietary data and am familiar with many of the statistical methods that are used to overcome the difficulties inherent when investigating dietary intake and nutritional status. As well as nutrition studies, I am also involved in studies exploring the influence of physical activity, sedentary behaviour, and sleep – all of which co-exist in a constrained space (the 24-hour day). This type of data requires compositional data analysis. However, using compositional data analysis needs careful interpretation of the statistical output. This is also an issue when analysing studies that assess associations with the gut microbiota.

190409093214

Time-inhomogeneous hidden Markov models for incompletely observed point processes

Amina Shahzadi

Department of Mathematics and Statistics

Date: Thursday 11 April 2019

Natural phenomena such as earthquakes and volcanic eruptions can be modelled using point processes with the primary aim of predicting future hazard based on past data. However, this is complicated and potentially biased by the problem of missing data in the records. The degree of completeness of the records varies dramatically over time. Often the older the record is, the more incomplete it is. We developed different types of time-inhomogeneous hidden Markov models (HMMs) to tackle the problem of time-varying missing data in volcanic eruption records. In these models, the hidden process has states of completeness and incompleteness. The state of completeness represents no missing events between each pair of consecutively observed events. The states of incompleteness represent different mean numbers of missing events between each pair of consecutively observed events. We apply the proposed models to a global volcanic eruption record to analyze the time-dependent incompleteness and demonstrate how we estimate the completeness of the record and the future hazard rate.

190402133251

What everyone who use and teaches confidence intervals should know

Richard Barker

PVC, Division of Sciences

Date: Thursday 4 April 2019

The meaning of a confidence interval is one of those things that everyone thinks they know until they are asked to explain what it is. Confidence intervals have some surprising properties that call into question their value as an inferential tool. Using a couple of simple examples I discuss these and related foundational issues.

190325105901

Non-linear economic value functions in breeding objectives

Cheryl Quinton

AbacusBio Limited, Dunedin

Date: Thursday 28 March 2019

Genetic improvement programs typically include a breeding objective that describes the traits of interest in the program and their importance. A breeding objective function is built that calculates the overall genetic value for each individual in a population based on the aggregate of trait predictions of genetic merit and trait economic values (EV). Most breeding objective functions are built as linear functions. Linear EVs have several advantages for ease of implementation and common quantitative genetics calculations, but they may be over-simplifications for diverse populations that span a wide range of economic and biological conditions. We have been helping an increasing number of breeding programs by applying non-linear EV functions. In these cases, more complex functions such as quadratic, exponential, and combinations are built to calculate the contribution of an individuals’ trait value to the overall aggregate merit combining multiple traits. Although breeding objectives with non-linear EV functions are more complex to implement, they can provide more specific and more robust valuation of traits and therefore of each individuals' overall genetic value. In this presentation, we describe some non-linear EV functions for prolificacy, wool quality, dystocia, and maternal ability in sheep and cattle breeding objectives.

190318142737

CEBRA: mathematical and statistical solutions to biosecurity risk challenges

Andrew Robinson

University of Melbourne

Date: Thursday 21 March 2019

CEBRA is the Centre of Excellence for Biosecurity Risk Analysis, jointly funded by the Australian and New Zealand governments. Our problem-based research focuses on developing and implementing quantitative tools to assist in the management of biosecurity risk at national and international levels. I will describe a few showcase mathematical and statistical projects, underline some of our soaring successes, underplay our dismal failures, and underscore the lessons that we've learned.

190311141043

Bayesian inference and model selection for stochastic epidemics

Simon Spencer

University of Warwick

Date: Thursday 14 March 2019

Model fitting for epidemics is challenging because not all of the information needed to write down the likelihood function is observable, for example the times of infection and recovery are not usually observed. Furthermore, the data that are available from diagnostic tests may not be perfectly accurate. These considerations are typically overcome by applying computationally intensive data augmentation techniques such as Markov chain Monte Carlo. To make things even more difficult, most of the interesting epidemiological questions are best expressed as model selection problems and so fitting just one model is not sufficient to answer them. Instead we must fit a range of different models, each representing an important epidemiological hypothesis, and then make meaningful comparisons between them. I will describe how to overcome (most of) these difficulties to learn about the epidemiology of Escherichia coli O157:H7 in cattle. This is joint work with Panayiota Touloupou, Bärbel Finkenstädt Rand, Pete Neal and TJ McKinley.

190307145742

Fast evaluation of study designs for spatially explicit capture-recapture

Murray Efford

Department of Mathematics and Statistics

Date: Thursday 7 March 2019

he density of some animal populations is routinely estimated by the method of spatially explicit capture–recapture using data from automatic cameras, traps or DNA hair snags. However, data collection is expensive and most studies do not meet minimum standards for precision. Improved study design is the key to improved power. Simulation is often recommended for evaluating study designs, but it can be painfully slow. Another approach for evaluating novel designs is to compute intermediate variables such as the expected number of detected individuals E(n) and the expected number of recapture events E(r). Computation of E(n) and E(r) is deterministic and much faster than simulation. Intriguingly, the relative standard error of estimated density is closely approximated by the reciprocal of the square root of whichever is smaller, and for maximum precision E(n) is approximately equal to E(r). I show how these findings can be applied in interactive software for designing ecological studies.

190226151345

Accounting for read depth in analyses of genotypes from low-depth sequencing

Ken Dodds

AgResearch

Date: Thursday 28 February 2019

Sequencing technology provides information on genomes including an individual’s genotype at a site with variation, such as single nucleotide polymorphisms (SNPs). To reduce costs, the sequencing protocol may be designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing result delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. Methods for accounting for this issue will be described for several different analyses including the estimation of relatedness, parentage assignment, testing for Hardy Weinberg equilibrium and the description of the genetic diversity of a population. These methods have applications in plant and animal breeding, ecology and conservation.

190222153244

Modeling high-dimensional intermittent hypoxia

Abdis Sattar

Department of Population and Quantitative Health Sciences Case Western Research University USA

Date: Tuesday 19 February 2019

Many remarkable advances have been made in the nonparametri and semiparametric methods for high-dimensional longitudinal data. However, there is a lack of a method for addressing missing data in these important methods. Motivated by oxygenation of retinopathy of prematurity (ROP) study, we developed a penalized spline mixed effects model for a highdimensional nonlinear longitudinal continuous response variable using the Bayesian approach. The ROP study is complicated by the fact that there are non-ignorable missing response values. To address the non-ignorable missing data in the Bayesian penalized spline model, we applied a selection model. Properties of the estimators are studied using Markov Chain Monte Carlo (MCMC) simulation. In the simulation study, data were generated with three different percentages of non-ignorable missing values, and three different sample sizes. Parameters were estimated under various scenarios. The proposed new approach did better compare to the semiparametric mixed effects model with nonignorable missing values under missing at random (MAR) assumption in terms of bias and percent bias in all scenarios of non-ignorable missing longitudinal data. We performed sensitivity analysis for the hyper-prior distribution choices for the variance parameters of spline coefficients on the proposed joint model. The results indicated that half-t distribution with three different degrees of freedom did not influence to the posterior distribution. However, inverse-gamma distribution as a hyperprior density influenced to the posterior distribution. We applied our novel method to the sample entropy data in ROP study for handling nonlinearity and the non-ignorable missing response variable. We also analyzed the sample entropy data under missing at random.

190211144502

Using functional data analysis to exploit high-resolution “Omics” data

Marzia Cremona

Penn State University

Date: Wednesday 30 January 2019

Recent progress in sequencing technology has revolutionized the study of genomic and epigenomic processes, by allowing fast, accurate and cheap whole-genome DNA sequencing, as well as other high-throughput measurements. Functional data analysis (FDA) can be broadly and effectively employed to exploit the massive, high-dimensional and complex “Omics” data generated by these technologies. This approach involves considering “Omics” data at high resolution, representing them as “curves” of measurements over the DNA sequence. I will demonstrate the effectiveness of FDA in this setting with two applications. In the first one, I will present a novel method, called probabilistic K-mean with local alignment, to locally cluster misaligned curves and to address the problem of discovering functional motifs, i.e. typical “shapes” that may recur several times along and across a set of curves, capturing important local characteristics of these curves. I will demonstrate the performance of the method on simulated data, and I will apply it to discover functional motifs in “Omics” signals related to mutagenesis and genome dynamics. In the second one, I will show how a recently developed functional hypothesis test, IWTomics, and multiple functional logistic regression can be employed to characterize the genomic landscape surrounding transposable elements, and to detect local changes in the speed of DNA polymerization due to the presence of non-canonical 3D structures.

190116124306

Bayesian Latent Class Analysis for Diagnostic Test Evaluation

Geoff Jones

Massey University

Date: Thursday 25 October 2018

Evaluating the performance of diagnostic tests for infection or disease is of crucial importance, both for the treatment of individuals and for the monitoring of populations. In many situations there is no “gold standard” test that can be relied upon to give 100% accuracy, and the use of a particular test will typically lead to false positives or false negative outcomes. The performance characteristics of an imperfect test are summarized by its sensitivity, i.e. the probability of correct diagnosis for a diseased individual, and its specificity i.e. the probability of a correct diagnosis when disease-free. When these parameters are known, valid statistical inference can be made for the disease status of tested individuals and the prevalence of disease in a monitored population. In the absence of a “gold standard”, true disease status is unobservable so the sensitivity and specificity cannot be reliably determined in the absence of additional information. In some circumstances, information from a number of imperfect tests allows estimation of the prevalence, sensitivities and specificities even in the absence of gold standard data. Latent class analysis in a Bayesian framework gives a flexible and comprehensive way of doing this which has become common in the epidemiology literature. This talk will give an introduction and review of the basic ideas, and highlight some of the current research in this area.

181015103145

Applied Data Science: A Small Business Perspective

Benoit Auvray

Iris Data Science & Department of Mathematics and Statistics

Date: Thursday 27 September 2018

Iris Data Science is a small Dunedin company established in 2013 providing data science solutions (predictive analytics) for clients in a range of areas, particularly in the agricultural and health sectors. In this talk, we will briefly describe deep learning, a machine learning tool we use extensively at Iris Data Science, and give a few examples of our work for some of our clients. We will also discuss the term “data scientist” and share our experiences running a small business using data science, statistics and machine learning as part of our core service offering. Finally, we will outline some of the practical aspects of developing a predictive tool for commercial use, from data collection and storage to timely and convenient delivery of the predictive model outputs to a client.

180920133826

Bayesian Hierarchical Modelling

Matt Schofield

Department of Mathematics and Statistics

Date: Thursday 20 September 2018

Bayesian hierarchical modelling is an increasingly popular approach for data analysis. This talk is intended to introduce Bayesian hierarchical models with the aid of examples from genetics, anatomy and ecology. We will discuss various advantages to using such models, including improved estimation and a better description of the underlying scientific process. If time permits, we will also consider situations where hierarchical models may lead to misleading conclusions and a healthy dose of skepticism is required.

180914084622

A missing value approach for breeding value estimation

Alastair Lamont

Department of Mathematics and Statistics

Date: Thursday 13 September 2018

For a particular trait, an individual’s breeding value is the genetic value it has for its progeny. Accurate breeding value estimation is a critical component of selective breeding, necessary to identify which animals will have the best offspring. As technology has improved, genetic data is often available, and can be utilised for improved breeding value estimation. While it is cost efficient to genotype some animals, it is unfeasible to genotype every individual in most populations of interest, due to either cost or logistical issues. This missing data creates challenges in the estimation of breeding values. Most modern approaches tend to impute or average over the missing data in some fashion, rather than fully incorporating it into the model. I will discuss how statistical models that account for inheritance can be specified and fitted, in work done jointly with Matthew Schofield and Richard Barker. Including inheritance allows missing genotype data to be natively included within the model, while directly working with scientific theory.

180905123305

A 2D hidden Markov model with extra zeros for spatiotemporal recurrence patterns of tremors

Ting Wang

Department of Mathematics and Statistics

Date: Thursday 6 September 2018

Tectonic tremor activity was observed to accompany slow slip events in some regions. Slow slip events share a similar occurrence style to that of megathrust earthquakes and have been reported to have occurred within the source region of some large megathrust earthquakes. Finding the relationship among the three types of seismic activities may therefore aid forecasts of large destructive earthquakes. Before examining their relationship, it is essential to understand quantitatively the spatiotemporal migration patterns of tremors.

We developed a 2D hidden Markov model to automatically analyse and forecast the spatiotemporal behaviour of tremor activity in the regions Kii and Shikoku, southwest Japan. This new automated procedure classifies the tremor source regions into distinct segments in 2D space and infers a clear hierarchical structure of tremor activity, where each region consists of several subsystems and each subsystem contains several segments. The segments can be quantitatively categorized into three different types according to their occurrence patterns: episodic, weak concentration, and background. Moreover, a significant increase in the proportion of tremor occurrence was detected in a segment in southwest Shikoku before the 2003 and 2010 long-term slow slip events in the Bungo channel. This highlights the possible correlation between tectonic tremor and slow slip events.

180831104723

Developing forage cultivars for the grazing industries of New Zealand

Zulfi Jahufer

AgResearch and Massey University

Date: Thursday 23 August 2018

Grass and legume based swards play a key role in forage dry matter production for the grazing industries of New Zealand. The genetic merit of this feed base is a primary driver in the profitability, production and environmental footprint of our pastoral systems. A significant challenge to sustainability of this dynamic ecosystem will be climate change. Elevation of ambient temperature and increases in the occurrence of moisture stress events will be a major constraint to forage plant vegetative persistence and seasonal dry matter production. Successful animal breeding has resulted in developing breeds that have higher feed requirements, resulting in increased grazing pressure on swards. The forage science group at AgResearch is actively focused on developing high merit forage grass, legume and herb cultivars. The aim is to optimise plant breeding systems and maximise rates of genetic gain applying conventional plant breeding methods, high throughput phenotyping and new molecular research tools.

~~Dr Zulfi Jahufer is a senior research scientist in quantitative genetics and forage plant breeding. He also conducts the Massey University course in plant breeding. His seminar will focus on the development of novel forage grass and legume cultivars; he will also introduce the new plant breeding software tool DeltaGen.~~

180813110047

Lattice polytope samplers for statistical inverse problems

Martin Hazelton

Massey University

Date: Thursday 16 August 2018

Statistical inverse problems occur when we wish to learn about some random process that is observed only indirectly. Inference in such situations typically involves sampling possible values for the latent variables of interest conditional on the indirect observations. This talk is concerned with linear inverse problems for count data, for which the latent variables are constrained to lie on the integer lattice within a convex polytope (a bounded multidimensional polyhedron). An illustrative example arises in transport engineering where we observe vehicle counts entering or leaving each zone of the network, then want to sample possible inter-zonal patterns of traffic flow consistent with those entry/exit counts. Other areas of application include inference for contingency tables, and capture-recapture modelling in ecology.

In principle such sampling can be conducted using Markov chain Monte Carlo methods, through a random walk on the lattice polytope. However, it is challenging to design algorithms for doing so that are both computationally efficient and have guaranteed theoretical properties. In this talk I will describe some current work that seeks to combine methods from algebraic statistics with geometric insights in order to develop and study new polytope samplers that address these issues.

180806102759

A faster algorithm for updating the likelihood of a phylogeny

David Bryant

Department of Mathematics and Statistics

Date: Thursday 9 August 2018

##Note day and time. A joint Mathematics and Statistics seminar taking place in the usual slot for Statistics seminars## Both Bayesian and Maximum Likelihood approaches to phylogenetic inference depend critically on a dynamic programming algorithm developed by Joe Felsenstein over 35 years ago. The algorithm computes the probability of sequence data conditional on a given tree. It is executed for every site, every set of parameters, every tree, and is the bottleneck of phylogenetic inference. This computation comes at a cost: Herve Philippe estimated that his research-associated computing (most of which would have been running Felsenstein's algorithm) resulted in an emission of over 29 tons of $CO_2$ in just one year. In the talk I will introduce the problem and describe an updating algorithm for likelihood calculation which runs in worst case O(log ~~n~~) time instead of O(~~n~~) time, where ~~n~~ is the number of leaves/species. This is joint work with Celine Scornavacca.

180731153252