Statistics
Te Tari Pāngarau me te Tatauranga
Department of Mathematics & Statistics

## Archived seminars in Statistics

Seminars 1 to 50

Next 50 seminars
Why does the stochastic gradient method work?

### Matthew Parry

Department of Mathematics and Statistics

Date: Tuesday 24 October 2017

The stochastic gradient (SG) method seems particularly suited to the numerical optimization problems that arise in large-scale machine learning applications. In a recent paper, Bottou et al. give a comprehensive theory of the SG algorithm and make some suggestions as to how it can be further improved. In this talk, I will briefly give the background to the optimization problems of interest and contrast the batch and stochastic approaches to optimization. I will then give the mathematical basis for the success of the SG method. If time allows, I will discuss how the SG method can also be applied to sampling algorithms.
170619154758
The changing face of undergraduate mathematics education: a US perspective

### Rachel Weir

Allegheny College, Pennsylvania

Date: Monday 16 October 2017

##Note day and time of this seminar##
A common theme in the United States in recent years has been a call to increase the number of graduates in STEM (science, technology, engineering, and mathematics) fields and to enhance the scientific literacy of students in other disciplines. For example, in the 2012 report Engage to Excel, the Obama administration announced a goal of "producing, over the next decade, 1 million more college graduates in STEM fields than expected under current assumptions." Achieving these types of goals will require us to harness the potential of all students, forcing us to identify and acknowledge the barriers encountered by students from traditionally underrepresented groups. Over the past few years, I have been working to understand these barriers to success, particularly in mathematics. In this talk, I will share what I have learned so far and how it has influenced my teaching.
170913161028
What is n?

### David Fletcher

Department of Mathematics and Statistics

Date: Thursday 12 October 2017

In some settings, the definition of "sample size" will depend on the purpose of the analysis. I will consider several examples that illustrate this issue, and point out some of the problems that can arise if we are not clear about what we mean by "n".
170711162501
Project presentations

### Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 6 October 2017

STATISTICS
Jodie Buckby : ~~Model checking for hidden Markov models~~
Jie Kang : ~~Model averaging for renewal process~~
Yu Yang : ~~Robustness of temperature reconstruction for the past 500 years~~

MATHEMATICS
Sam Bremer : ~~An effective model for particle distribution in waterways~~
Joshua Mills : ~~Hyperbolic equations and finite difference schemes~~
161007091338
Gems of Ramanujan and their lasting impact on mathematics

### Ken Ono

Emory University; 2017 NZMS/AMS Maclaurin Lecturer

Date: Thursday 5 October 2017

##Note venue of this public lecture##
Ramanujan’s work has had a truly transformative effect on modern mathematics, and continues to do so as we understand further lines from his letters and notebooks. In this lecture, some of the studies of Ramanujan that are most accessible to the general public will be presented and how Ramanujan’s findings fundamentally changed modern mathematics, and also influenced the lecturer’s work, will be discussed. The speaker is an Associate Producer of the film ~~The Man Who Knew Infinity~~ (starring Dev Patel and Jeremy Irons) about Ramanujan. He will share several clips from the film in the lecture.

Biography: Ken Ono is the Asa Griggs Candler Professor of Mathematics at Emory University. He is considered to be an expert in the theory of integer partitions and modular forms. He has been invited to speak to audiences all over North America, Asia and Europe. His contributions include several monographs and over 150 research and popular articles in number theory, combinatorics and algebra. He received his Ph.D. from UCLA and has received many awards for his research in number theory, including a Guggenheim Fellowship, a Packard Fellowship and a Sloan Fellowship. He was awarded a Presidential Early Career Award for Science and Engineering (PECASE) by Bill Clinton in 2000 and he was named the National Science Foundation’s Distinguished Teaching Scholar in 2005. In addition to being a thesis advisor and postdoctoral mentor, he has also mentored dozens of undergraduates and high school students. He serves as Editor-in-Chief for several journals and is an editor of The Ramanujan Journal. He is also a member of the US National Committee for Mathematics at the National Academy of Science.
170814094354
A statistics-related seminar in Physics: Where do your food and clothes come from? Oritain finds the answer in chemistry and statistics

### Katie Jones and Olya Shatova

Oritain Dunedin

Date: Monday 2 October 2017

##A statistics-related seminar in the Physics Department##
##Note day, time and venue##
Oritain Global Ltd is a scientific traceability company that verifies the origin of food, fibre, and pharmaceutical product by combining trace element and isotope chemistry with statistics. Born in the research labs at the Chemistry Department in the University of Otago, Oritain has grown to become a multinational company with offices in Dunedin, London, and Sydney, and with clients from around the globe. Dr Katie Jones and Dr Olya Shatova are Otago alumni working as scientists at Oritain Dunedin. They will provide an overview of the science behind Oritain and discuss their transition from academic research to commercialized science.
170929111935
Quantitative genetics in forest tree breeding

### Mike and Sue Carson

Carson Associates Ltd

Date: Thursday 28 September 2017

Forest tree breeding, utilising quantitative genetic (QG) methods, is employed across a broad range of plant species for improvement of a wide diversity of products, or ‘breeding objectives’. Examples of breeding objectives range from the traditional sawn timber and pulpwood products desired largely from pines and eucalypts, to antibiotic factors in honey obtained from NZ manuka, and including plant oil products from oil palms. The standard population breeding approach recognises a hierarchy of populations (the ‘breeding triangle’) with a broad and diverse gene resource population at the base, and a highly-improved but less diverse deployment population at the peak. With the constraint that the deployment population must contain a ‘safe’ amount of genetic diversity, the main goal for any tree improvement program is to use selection and recombination to maximise deployment population gains in the target traits. The key QG tools used in tree improvement programs for trial data analysis, estimation of breeding values, index ranking and selection, and mating and control of pedigree are in common with most other plant and livestock breeding programs. However, the perennial nature of most tree crops requires tree breeders to place a greater emphasis on the use of well-designed, long-term field trials, in combination with efficient and secure databases like Gemview. Recent advances using factor analytic models are providing useful tools for examining and interpreting genotype and site effects and their interaction on breeding values. Genomic selection is expected to enhance, rather than replace, conventional field screening methods for at least the medium term.
170711162635
Genomic data analysis: bioinformatics, statistics or data science?

### Mik Black

Department of Biochemistry

Date: Thursday 21 September 2017

Analysis of large-scale genomic data has become a core component of modern genetics, with public data repositories providing enormous opportunities for both exploratory and confirmatory studies. To take advantage of these opportunities, however, potential data analysts need to possess a range of skills, including those drawn from the disciplines of bioinformatics, data science and statistics, as well as domain-specific knowledge about their biological area of interest. While traditional biology-based teaching programmes provide an excellent foundation in the latter skill set, relatively little time is spent equipping students with the skills required for genomic data analysis, despite high demand for graduates with this knowledge. In this talk I will work through a fairly typical analysis of publicly accessible genomic data, highlighting the various bioinformatics, statistical and data science concepts and techniques being utilized. I will also discuss current efforts being undertaken at the University of Otago to provide training in these areas, both inside and outside the classroom.
170711162430
Thinking statistically when constructing genetic maps

### Timothy Bilton

Department of Mathematics and Statistics

Date: Thursday 14 September 2017

A genetic linkage map shows the relative position of and genetic distance between genetic markers, positions of the genome which exhibit variation, and underpins the study of species' genomes in a number of scientific applications. Genetic maps are constructed by tracking the transmission of genetic information from individuals to their offspring, which is frequently modelled using a hidden Markov model (HMM) since only the expression and not the transmission of genetic information is observed. However, data generated using the latest sequencing technology often results in partially observed information, which if unaccounted for, typically results in inflated estimates. Most approaches to circumvent this issue involves a combination of filtering and correcting individual data points using ad-hoc methods. Instead, we develop a new methodology that models the partially observed information by incorporating an additional layer of latent variables into the HMM. Results show that our methodology is able to produce accurate genetic map estimates, even in situations where a large proportion of the data is only partially observed.
170711162348
Network tomography for integer valued traffic

### Martin Hazelton

Massey University

Date: Thursday 7 September 2017

Volume network tomography is concerned with inference about traffic flow characteristics based on traffic measurements at fixed locations on the network. The quintessential example is estimation of the traffic volume between any pair of origin and destination nodes using traffic counts obtained from a subset of the links of the network. The data provide only indirect information about the target variables, generating a challenging type of statistical linear inverse problem.

In this talk I will discuss network tomography for a rather general class of traffic models. I will describe some recent progress on model identifiability. I will then discuss the development of effective MCMC samplers for simulation-based inference, based on insight provided by an examination of the geometry of the space of feasible route flows.
170711161529
TensorFlow: a short intro

### Lech Szymanski

Department of Computer Science

Date: Thursday 31 August 2017

TensorFlow is an open source software library for numerical computation. Its underlying paradigm of computation uses data flow graphs, which allow for automatic differentiation and effortless deployment that parallelises across CPUs or GPUs. I have been working in TensorFlow for about a year now, using it to build and train deep learning models for image classification. In this talk I will give a brief introduction to TensorFlow as well as share some of my experiences of working with it. I will try to make this talk not about deep learning with TensorFlow, but rather about TensorFlow itself, which I happen to use for deep learning.
170823103736
Theory and application of latent variable models for multivariate binomial data

### John Holmes

Department of Mathematics and Statistics

Date: Thursday 24 August 2017

A large body of work has been devoted to developing latent variable models for exponential family distributed multivariate data exhibiting interdependencies. For the binomial case however, extensions of models past analysis of binary data is almost entirely missing. Focusing on principal component/factor analysis representations, we will show that under the canonical logit link, latent variable models can be fitted in closed form, via Gibbs sampling, to multivariate binomial data of arbitrary trial size, by applying Pólya-gamma augmentation to the binomial likelihood. In this talk, the properties of binomial latent variable models under Pólya-gamma data augmentation will be discussed from both a theoretical perspective and through application to a range of simulated and real demographic datasets.
170711161436
Māori student success: Findings from the Graduate Longitudinal Study New Zealand

### Moana Theodore

Department of Psychology

Date: Thursday 17 August 2017

Māori university graduates are role models for educational success and important for the social and economic wellbeing of Māori whānau (extended family), communities and society in general. Describing their experiences can help to build an evidence base to inform practice, decision-making and policy. I will describe findings for Māori graduates from all eight New Zealand universities who are participants in the Graduate Longitudinal Study New Zealand. Data were collected when the Māori participants were in their final year of study in 2011 (n=626) and two years post-graduation in 2014 (n=455). First, I will focus on what Māori graduates describe as helping or hindering the completion of their qualifications, including external (e.g. family), institutional (e.g. academic support) and student/personal (e.g. persistence) factors. Second, I will describe Māori graduate outcomes at 2 years post-graduation. In particular, I will describe the private benefits of higher education, such as labour market outcomes (e.g. employment and income), as well as the social benefits such as civic participation and volunteerism. Overall, our findings suggest that boosting higher education success for Māori may reduce ethnic inequalities in New Zealand labour market outcomes and may impart substantial social benefits as a result of Māori graduates’ contribution to society.
170711160426
Bayes factors, priors and mixtures

### Matthew Schofield

Department of Mathematics and Statistics

Date: Thursday 10 August 2017

It is well known that Bayes factors are sensitive to the prior distribution chosen on the parameters. This has led to comments such as “Diffuse prior distributions ... must be used with care” (Robert 2014) and “We do not see Bayesian methods as generally useful for giving the posterior probability that a model is true, or the probability for preferring model A over model B” (Gelman and Shalizi 2013). We consider the calculation of Bayes factors for nested models. We show this is equivalent to a model with a mixture prior distribution, where the weights on the resulting posterior are related to the Bayes factor. These results allow us to directly compare Bayes factors to shrinkage priors, such as the Laplace prior used in the Bayesian lasso. We use these results as the basis for offering practical suggestions for estimation and selection in nested models.
170711160153
Development and implementation of culturally informed guidelines for medical genomics research involving Māori communities

### Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 3 August 2017

Medical genomic research is usually conducted within a ‘mainstream’ cultural context. Māori communities have been underrepresented in such research despite being impacted by heritable diseases and other conditions that could potentially be unravelled via modern genomic technologies. Reasons for low participation of Māori communities include negative experiences of genomics and genetics researchers - such as the notorious ‘Warrior Gene’ saga – and an unease with technologies that are often implemented by non-Māori researchers in a manner inconsistent with Māori values. In my talk I will describe recently developed guidelines for ethically appropriate genomics research with Māori communities; how these guidelines were informed by my iwi, Ngāti Rakaipaaka, who had previously been involved in a medical genomics investigation; and current efforts to complete that research via a partnership with Te Tari Pāngarau me Tātauranga ki Te Whare Wānaka o Otakou (Department of Mathematics and Statistics at the University of Otago).
170711160113
Who takes Statistics? A look at student composition, 2000-2016

### Peter Dillingham

Department of Mathematics and Statistics

Date: Thursday 27 July 2017

In this blended seminar and discussion, we will examine how student data can help inform curriculum development and review, focussing on the Statistics programme as an example. Currently, the Statistics academic staff are reviewing our programme to ensure that we continue to provide a high quality and modern curriculum that meets the needs of students. An important component of this process is to understand whom our students are and what they are interested in, from first-year service teaching through to students majoring in statistics. As academics, we often have a reasonable answer to these questions, but we can be more specific by poring over student data. While not glamorous, this sort of data can help confirm those things we think we know, identify opportunities or risks, and help answer specific questions where we know that we don’t know the answer.
170711094742
A missing value approach for breeding value estimation

### Alastair Lamont

Department of Mathematics and Statistics

Date: Thursday 20 July 2017

A key goal in quantitative genetics is the identification and selective breeding of individuals with high economic value. For a particular trait, an individual’s breeding value is the genetic worth it has for its progeny. While methods for estimating breeding values have existed since the middle of last century, the march of technology now allows the genotypes of individuals to be directly measured. This additional information allows for improved breeding value estimation, supplementing observed measurements and known pedigree information. However, while it can be cost efficient to genotype some animals, it is unfeasible to genotype every individual in most populations of interest, due to either cost or logistical issues. As such, any approach must be able to accommodate missing data, while also managing computational efficiency, as the dimensionality of data can be immense. Most modern approaches tend to impute or average over the missing data in some fashion, rather than fully incorporating it into the model. These approximations lead to a loss in estimation accuracy. Similar models are used within Human genetics, but for different purposes. With different data and different goals to quantitative genetics, these approaches natively include missing data within the model. We are developing an approach which utilises a human genetics framework, but adapted so as to estimate breeding values.
170711093504
Assessing and dealing with imputation inaccuracy in genomic predictions

### Michael Lee

Department of Mathematics and Statistics

Date: Thursday 13 July 2017

Genomic predictions rely on having genotypes from high density SNP Chips from many individuals. Many national animal evaluations, to predict breeding values, may include millions of animals, where an increasing proportion of these have genotype information. Imputation can be used to make genomic predictions more cost effective. For example, in the NZ Sheep industry genomic predictions can be done by genotyping animals with a SNP Chip of lower density (e.g. 5-15K) and imputing the genotypes for a given animal to a density of about 50K, where the imputation process needs a reference panel of 50K genotypes. The imputed genotypes are used in genomic predictions and the accuracy of imputation is a function of the quality of the reference panel. A study to assess the imputation accuracy of a wide range of animals was undertaken. The goal was to quantify the levels of inaccuracy and to determine a best strategy to deal with this inaccuracy in the context of single step genomic best linear unbiased prediction (ssGBLUP).
170710093302
Twists and trends in exercise science

### Jim Cotter

School of Physical Education, Sport and Exercise Sciences

Date: Thursday 1 June 2017

From my perspective, exercise science is entering an age of enlightenment, but misuse of statistics remains a serious limitation to its contributions and progress for human health, performance, and basic knowledge. This seminar will summarise our recent and current work in hydration, heat stress and patterns of dosing/prescribing exercise, and the implications for human health and performance. These contexts will be used to discuss methodological issues including of research design, analysis and interpretation.
170202102235
Hidden Markov models for incompletely observed point processes

Department of Mathematics and Statistics

Date: Thursday 25 May 2017

Natural phenomena such as earthquakes and volcanic eruptions can cause catastrophic damage. Such phenomena can be modelled using point processes. However, this is complicated and potentially biased by the problem of missing data in the records. The degree of completeness of volcanic records varies dramatically over time. Often the older the record is, the more incomplete it is. One way to handle such records with missing data is to use hidden Markov models (HMMs). An HMM is a two-layered process based on an observed process and an unobserved first-order stationary Markov chain with the state duration geometrically distributed. This limits the application of HMMs in the field of volcanology, where the processes leading to missed observations do not necessarily behave in a memoryless and time-independent manner. We propose Inhomogeneous hidden semi-Markov models (IHSMMs) to investigate the time-inhomogeneity of the completeness of volcanic eruption catalogues to obtain the reliable hazard estimate.
170111085822
The CBOE SKEW

### Jin Zhang

Department of Accountancy and Finance

Date: Thursday 18 May 2017

The CBOE SKEW is an index launched by the Chicago Board Options Exchange (CBOE) in February 2011. Its term structure tracks the risk-neutral skewness of the S&P 500 (SPX) index for different maturities. In this paper, we develop a theory for the CBOE SKEW by modelling SPX using a jump-diffusion process with stochastic volatility and stochastic jump intensity. With the term structure data of VIX and SKEW, we estimate model parameters and obtain the four processes of variance, jump intensity and their long-term mean levels. Our results can be used to describe SPX risk-neutral distribution and to price SPX options.
170110164745
Finding true identities in a sample using MCMC methods

### Paula Bran

Department of Mathematics and Statistics

Date: Thursday 11 May 2017

Uncertainty about the true identities behind observations is known in statistics as a misidentification problem. The observations may be duplicated, wrongly reported or missing which results in error-prone data collection. This error can affect seriously the inferences and conclusions. A wide variety of MCMC algorithms have been developed for simulating the latent identities of individuals in a dataset using Bayesian inference. In this talk, the DIU (Direct Identity Updater) algorithm is introduced. It is a Metropolis-Hastings sampler with an application-specific proposal density. Its performance and efficiency is compared with two other algorithms solving similar problems. The convergence to the correct stationary distribution is discussed by using a toy example where the data is comprised of genotypes which includes uncertainty. As the state space is small, the behaviour of the chains is easily visualized. Interestingly, while they converge to the same stationary distribution, the transition matrices for the different algorithms have little in common.
170202102134
Correlated failures in multicomponent systems

### Richard Arnold

Victoria University Wellington

Date: Thursday 4 May 2017

Multicomponent systems may experience failures with correlations amongst failure times of groups of components, and some subsets of components may experience common cause, simultaneous failures. We present a novel, general approach to model construction and inference in multicomponent systems incorporating these correlations in an approach that is tractable even in very large systems. In our formulation the system is viewed as being made up of Independent Overlapping Subsystems (IOS). In these systems components are grouped together into overlapping subsystems, and further into non-overlapping subunits. Each subsystem has an independent failure process, and each component's failure time is the time of the earliest failure in all of the subunits of which it is a part.

This is joint work with Stefanka Chukova (VUW) and Yu Hayakawa (Waseda University, Tokyo)
170110163825
Integration of IVF technologies with genomic selection to generate high merit AI bulls: a simulation study

### Fiona Hely

AbacusBio

Date: Thursday 27 April 2017

New reproductive technologies such as genotyping of embryos prior to cloning and IVF allow the possibility of targeting elite AI bull calves from high merit sires and dams. A stochastic simulation model was set up to replicate both progeny testing and genomic selection dairy genetic improvement schemes with, and without the use of IVF to generate bull selection candidates. The reproductive process was simulated using a series of random variates to assess the likelihood of a given cross between a selected sire and dam producing a viable embryo, and the superiority of these viable bulls assessed from the perspective of a commercial breeding company.
170116105335
Recovery and recolonisation by New Zealand southern right whales: making the most of limited sampling opportunities

### Will Rayment

Department of Marine Science

Date: Thursday 13 April 2017

Studies of marine megafauna are often logistically challenging, thus limiting our ability to gain robust insights into the status of populations. This is especially true for southern right whales, a species which was virtually extirpated in New Zealand waters by commercial whaling in the 19th century, and restricted to breeding around the remote sub-Antarctic Auckland Islands. We have gathered photo-ID and distribution data during annual 3-week duration trips to study right whales at the Auckland Islands since 2006. Analysis of the photo-ID data has yielded estimates of demographic parameters including survival rate and calving interval, essential for modelling the species’ recovery, while species-distribution models have been developed to reveal the specific habitat preferences of calving females. These data have been supplemented by visual and acoustic autonomous monitoring, in order to investigate seasonal occurrence of right whales in coastal habitats. Understanding population recovery, and potential recolonization of former habitats around mainland New Zealand, is essential if the species is to be managed effectively in the future.
170111085932
Ion-selective electrode sensor arrays: calibration, characterisation, and estimation

### Peter Dillingham

Department of Mathematics and Statistics

Date: Thursday 6 April 2017

Ion-selective electrodes (ISEs) have undergone a renaissance over the last 20 years. New fabrication techniques, which allow mass production, have led to their increasing use in demanding environmental and health applications. These deployable low-cost sensors are now capable of measuring sub-micromolar concentrations in complex and variable solutions, including blood, sweat, and soil. However, these measurement challenges have highlighted the need for modern calibration techniques to properly characterise ISEs and report measurement uncertainty. In this talk, our group’s developments will be discussed, with a focus on modelling ISEs, properly defining the limit of detection, and extensions to sensor arrays.
170110135500
What in the world caused that? Statistics of sensory spike trains and neural computation for inference

### Mike Paulin

Department of Zoology

Date: Thursday 30 March 2017

Before the “Cambrian explosion” 542 million years ago, animals without nervous systems reacted to environmental signals mapped onto the body surface. Later animals constructed internal maps from noisy partial observations gathered at the body surface. Considering the energy costs of data acquisition and inference versus the costs of not doing this in late Precambrian ecosystems leads us to model spike trains recorded from sensory neurons (in sharks, frogs and other animals) as samples from a family of Inverse Gaussian-censored Poisson, a.k.a. Exwald, point-processes. Neurons that evolved for other reasons turn out to be natural mechanisms for generating samples from Exwald processes, and natural computers for inferring the posterior density of their parameters. This is a consequence of a curious correspondence between the likelihood function for sequential inference from a censored Poisson process and the impulse response function of a neuronal membrane. We conclude that modern animals, including humans, are natural Bayesians because when neurons evolved 560 million years ago they provided our ancestors with a choice between being Bayesian or being dead.
This is joint work with recent Otago PhD students Kiri Pullar and Travis Monk, honours student Ethan Smith, and UCLA neuroscientist Larry Hoffman.
170216163720
Brewster Glacier - a benchmark for investigating glacier-climate interactions in the Southern Alps of New Zealand

### Nicolas Cullen

Department of Geography

Date: Thursday 23 March 2017

The advance of some fast-responding glaciers in the Southern Alps of New Zealand at the end of the 20th and beginning of the 21st century during three of the warmest decades of the instrumental era provides clear evidence that changes in large-scale atmospheric circulation in the Southern Hemisphere can act as a counter-punch to global warming. The Southern Alps are surrounded by vast areas of ocean and are strongly influenced by both subtropical and polar air masses, with the interaction of these contrasting air masses in the prevailing westerly airflow resulting in weather systems having a strong influence on glacier mass balance. Until recently, one of the challenges in assessing how large-scale atmospheric circulation influences glacier behaviour has been the lack of observational data from high-elevation sites in the Southern Alps. However, high-quality meteorological and glaciological observations from Brewster Glacier allow us to now assess in detail how atmospheric processes at different scales influence glacier behaviour. This talk will provide details about the observational programme on Brewster Glacier, which has been continuous for over a decade, and then target how weather systems influence daily ablation and precipitation (snowfall).
170131154008
Estimating overdispersion in sparse multinomial data

### Farzana Afroz

Department of Mathematics and Statistics

Date: Thursday 16 March 2017

When overdispersion is present in a data set, ignoring it may lead to serious underestimation of standard errors and potentially misleading model comparisons. Generally we estimate the overdispersion parameter $\phi$ by dividing the Pearson's goodness of fit statistic $X^2$ by the residual degrees of freedom. But when the data are sparse, that is when there are many zero or small counts, it may not be reasonable to use this statistic since $X^2$ is unlikely to be $\chi^2$-distributed. This study presents a comparison of four estimators of the overdispersion parameter $\phi$, in terms of bias, root mean squared error and standard deviation, when the data are sparse and multinomial. Dead recovery data on Herring gulls from Kent Island, Canada are used to provide a practical example of sparse multinomial data. In a simulation study, we consider Dirichlet-multinomial distribution and finite mixture distribution, which are widely used to model extra variation in multinomial data.
170110164941
Fast computation of spatially adaptive kernel smooths

### Tilman Davies

Department of Mathematics and Statistics

Date: Thursday 9 March 2017

Kernel smoothing of spatial point data can often be improved using an adaptive, spatially-varying bandwidth instead of a fixed bandwidth. However, computation with a varying bandwidth is much more demanding, especially when edge correction and bandwidth selection are involved. We propose several new computational methods for adaptive kernel estimation from spatial point pattern data. A key idea is that a variable-bandwidth kernel estimator for d-dimensional spatial data can be represented as a slice of a fixed-bandwidth kernel estimator in (d+1)-dimensional "scale space", enabling fast computation using discrete Fourier transforms. Edge correction factors have a similar representation. Different values of global bandwidth correspond to different slices of the scale space, so that bandwidth selection is greatly accelerated. Potential applications include estimation of multivariate probability density and spatial or spatiotemporal point process intensity, relative risk, and regression functions. The new methods perform well in simulations and real applications.
170110164649
Detection and replenishment of missing data in the observation of point processes with independent marks

### Jiancang Zhuang

Institute of Statistical Mathematics, Tokyo

Date: Thursday 2 March 2017

Records of processes of geophysical events, which are usually modeled as marked point processes, such as earthquakes and volcanic eruptions, often have missing data that result in underestimate of corresponding hazards. This study presents a fast approach for replenishing missing data in the record of a temporal point process with time independent marks. The basis of this method is that, if such a point process is completely observed, it can be transformed into a homogeneous Poisson process on the unit square $[0,1]^2$ by a biscale empirical transformation. This method is tested on a synthetic dataset and applied to the record of volcanic eruptions at the Hakone Volcano, Japan and several datasets of the aftershock sequences following some large earthquakes. Especially, by comparing the analysis results from the original and the replenished datasets of aftershock sequence, we have found that both the Omori-Utsu formula and ETAS model are stable, and the variations in the estimated parameters with different magnitude thresholds in past studies are caused by the influence of short-term missing of small events.
170110164432
A new multidimensional stress release statistical model based on coseismic stress transfer

### Shiyong Zhou

Peking University

Date: Tuesday 14 February 2017

NOTE venue is not our usual
Following the stress release model (SRM) proposed by Vere-Jones (1978), we developed a new multidimensional SRM, which is a space-time-magnitude version based on multidimensional point processes. First, we interpreted the exponential hazard functional of the SRM as the mathematical expression of static fatigue failure caused by stress corrosion. Then, we reconstructed the SRM in multidimensions through incorporating four independent submodels: the magnitude distribution function, the space weighting function, the loading rate function and the coseismic stress transfer model. Finally, we applied the new model to analyze the historical earthquake catalogues in North China. An expanded catalogue, which contains the information of origin time, epicentre, magnitude, strike, dip angle, rupture length, rupture width and average dislocation, is composed for the new model. The estimated model can simulate the variations of seismicity with space, time and magnitude. Compared with the previous SRMs with the same data, the new model yields much smaller values of Akaike information criterion and corrected Akaike information criterion. We compared the predicted rates of earthquakes at the epicentres just before the related earthquakes with the mean spatial seismic rate. Among all 37 earthquakes in the expanded catalogue, the epicentres of 21 earthquakes are located in the regions of higher rates.
170110163941
Next generation ABO blood type genetics and genomics

### Keolu Fox

University of San Diego

Date: Wednesday 1 February 2017

The ABO gene encodes a glycosyltransferase, which adds sugars (N-acetylgalactos-amine for A and α-D- galactose for B) to the H antigen substrate. Single nucleotide variants in the ABO gene affect the function of this glycosyltransferase at the molecular level by altering the specificity and efficiency of this enzyme for these specific sugars. Characterizing variation in ABO is important in transfusion and transplantation medicine because variants in ABO have significant consequences with regard to recipient compatibility. Additionally, variation in the ABO gene has been associated with cardiovascular disease risk (e.g., myocardial infarction) and quantitative blood traits (von Willebrand factor (VWF), Factor VIII (FVIII) and Intercellular Adhesion molecule 1 (ICAM-1). Relating ABO genotypes to actual blood antigen phenotype requires the analysis of haplotypes. Here we will explore variation (single nucleotide, insertion and deletions, and structural variation) in blood cell train gene loci (ABO) using multiple datasets enriched for heart, lung and blood-related diseases (including both African-Americans and European-Americans) from multiple NGS datasets (e.g. the NHLBI Exome Sequencing Project (ESP) dataset). I will also describe the use of a new ABO haplotyping method, ABO-seq, to increase the accuracy of ABO blood type and subtype calling using variation in multiple NGS datasets. Finally, I will describe the use of multiple read-depth based approaches to discover previously unsuspected structural variation (SV) in genes not shown to harbor SV, such as the ABO gene, by focusing on understudied populations, including individuals of Hispanic and African ancestry.

Keolu has a strong background in using genomic technologies to understand human variation and disease. Throughout his career he has made it his priority to focus on the interface of minority health and genomic technologies. Keolu earned a Ph.D. in Debbie Nickerson's lab in the University of Washington's Department of Genome Sciences (August, 2016). In collaboration with experts at Bloodworks Northwest, (Seattle, WA) he focused on the application of next-generation genome sequencing to increase compatibility for blood transfusion therapy and organ transplantation. Currently Keolu is a postdoc in Alan Saltiel's lab at the University of California San Diego (UCSD) School of Medicine, Division of Endocrinology and Metabolism and the Institute for Diabetes and Metabolic Health. His current project focuses on using genome editing technologies to investigate the molecular events involved in chronic inflammatory states resulting in obesity and catecholamine resistance.
170125161950
To be or not to be (Bayesian) Non-Parametric: A tale about Stochastic Processes

### Roy Costilla

Victoria University Wellington

Date: Tuesday 24 January 2017

Thanks to the advances in the last decades in theory and computation, Bayesian Non-Parametric (BNP) models are now use in many fields including Biostatistics, Bioinformatics, Machine Learning, Linguistics and many others.

Despite its name however, BNP models are actually massively parametric. A parametric model uses a function with finite dimensional parameter vector as prior. Bayesian inference then proceeds to approximate the posterior of these parameters given the observed data. In contrast to that, a BNP model is defined on an infinite dimensional probability space thanks to the use of a stochastic process as a prior. In other words, the prior for a BNP model is a space of functions with an infinite dimensional parameter vector. Therefore, instead of avoiding parametric forms, BNP inference uses a large number of them to gain more flexibility.

To illustrate this, we present simulations and also a case study where we use life satisfaction in NZ over 2009-2013. We estimate the models using a finite Dirichlet Process Mixture (DPM) prior. We show that this BNP model is tractable, i.e. is easily computed using Markov Chain Monte Carlo (MCMC) methods; allowing us to handle data with big sample sizes and estimate correctly the model parameters. Coupled with a post-hoc clustering of the DPM locations, the BNP model also allows an approximation of the number of mixture components, a very important parameter in mixture modelling.
170116145247
Computational methods and statistical modelling in the analysis of co-ocurrences: where are we now?

### Jorge Navarro Alberto

Date: Wednesday 9 November 2016

NOTE day and time of this seminar
The subject of the talk is statistical methods (both theoretical and applied) and computational algorithms for the analysis of binary data, which have been applied in ecology in the study of species composition in systems of patches with the ultimate goal to uncover ecological patterns. As a starting point, I review Gotelli and Ulrich's (2012) six statistical challenges in null model analysis in Ecology. Then, I exemplify the most recent research carried out by me and other statisticians and ecologists to face those challenges, and applications of the algorithms outside the biological sciences. Several topics of research are proposed, seeking to motivate statisticians and computer scientists to venture and, eventually, to specialize in the subject of the analysis of co-occurrences.
Reference: Gotelli, NJ and Ulrich, W, 2012. Statistical challenges in null model analysis. Oikos 121: 171-180
161101160727
Extensions of the multiset sampler

### Scotland Leman

Virginia Tech, USA

Date: Tuesday 8 November 2016

NOTE day and time of this seminar
In this talk I will primarily discuss the Multiset Sampler (MSS): a general ensemble based Markov Chain Monte Carlo (MCMC) method for sampling from complicated stochastic models. After which, I will briefly introduce the audience to my interactive visual analytics based research.

Proposal distributions for complex structures are essential for virtually all MCMC sampling methods. However, such proposal distributions are difficult to construct so that their probability distribution match that of the true target distribution, in turn hampering the efficiency of the overall MCMC scheme. The MSS entails sampling from an augmented distribution that has more desirable mixing properties than the original target model, while utilizing a simple independent proposal distributions that are easily tuned. I will discuss applications of the MSS for sampling from tree based models (e.g. Bayesian CART; phylogenetic models), and for general model selection, model averaging and predictive sampling.

In the final 10 minutes of the presentation I will discuss my research interests in interactive visual analytics and the Visual To Parametric Interaction (V2PI) paradigm. I'll discuss the general concepts in V2PI with an application of Multidimensional Scaling, its technical merits, and the integration of such concepts into core statistics undergraduate and graduate programs.
161011102333
New methods for estimating spectral clustering change points for multivariate time series

### Ivor Cribben

University of Alberta

Date: Wednesday 19 October 2016

NOTE day and time of this seminar
Spectral clustering is a computationally feasible and model-free method widely used in the identification of communities in networks. We introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Spectral clustering allows us to consider high dimensional time series where the number of time series is greater than the number of time points. NCPD allows for estimation of both the time of change in the network structure and the graph between each pair of change points, without prior knowledge of the number or location of the change points. Permutation and bootstrapping methods are used to perform inference on the change points. NCPD is applied to various simulated high dimensional data sets as well as to a resting state functional magnetic resonance imaging (fMRI) data set. The new methodology also allows us to identify common functional states across subjects and groups. Extensions of the method are also discussed. Finally, the method promises to offer a deep insight into the large-scale characterisations and dynamics of the brain.
161007094119
Inverse prediction for paleoclimate models

### John Tipton

Date: Tuesday 18 October 2016

NOTE day and time of this seminar
Many scientific disciplines have strong traditions of developing models to approximate nature. Traditionally, statistical models have not included scientific models and have instead focused on regression methods that exploit correlation structures in data. The development of Bayesian methods has generated many examples of forward models that bridge the gap between scientific and statistical disciplines. The ability to fit forward models using Bayesian methods has generated interest in paleoclimate reconstructions, but there are many challenges in model construction and estimation that remain.

I will present two statistical reconstructions of climate variables using paleoclimate proxy data. The first example is a joint reconstruction of temperature and precipitation from tree rings using a mechanistic process model. The second reconstruction uses microbial species assemblage data to predict peat bog water table depth. I validate predictive skill using proper scoring rules in simulation experiments, providing justification for the empirical reconstruction. Results show forward models that leverage scientific knowledge can improve paleoclimate reconstruction skill and increase understanding of the latent natural processes.
161007103441
Ultrahigh dimensional variable selection for interpolation of point referenced spatial data

### Benjamin Fitzpatrick

Queensland University of Technology

Date: Monday 17 October 2016

NOTE day and time of this seminar
When making inferences concerning the environment, ground truthed data will frequently be available as point referenced (geostatistical) observations accompanied by a rich ensemble of potentially relevant remotely sensed and in-situ observations.
Modern soil mapping is one such example characterised by the need to interpolate geostatistical observations from soil cores and the availability of data on large numbers of environmental characteristics for consideration as covariates to aid this interpolation.

In this talk I will outline my application of Least Absolute Shrinkage Selection Opperator (LASSO) regularized multiple linear regression (MLR) to build models for predicting full cover maps of soil carbon when the number of potential covariates greatly exceeds the number of observations available (the p > n or ultrahigh dimensional scenario). I will outline how I have applied LASSO regularized MLR models to data from multiple (geographic) sites and discuss investigations into treatments of site membership in models and the geographic transferability of models developed. I will also present novel visualisations of the results of ultrahigh dimensional variable selection and briefly outline some related work in ground cover classification from remotely sensed imagery.

Key references:
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Ultrahigh Dimensional Variable Selection for Interpolation of Point Referenced Spatial Data: A Digital Soil Mapping Case Study. PLoS ONE, 11(9): e0162489.
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Assessing Site Effects and Geographic Transferability when Interpolating Point Referenced Spatial Data: A Digital Soil Mapping Case Study. https://arxiv.org/abs/1608.00086
161007111343
New Zealand master sample using balanced acceptance sampling

### Paul van Dam-Bates

Department of Conservation

Date: Thursday 13 October 2016

Environmental monitoring for management organisations like the Department of Conservation is critical. Without good information about outcomes, poor management actions may persist much longer than they should or initial intervention may occur too late. The Department currently conducts focused research at key natural heritage sites (Tier 3) as well as a long term national monitoring (Tier 1). The link between the two tiers of investigation to assess the impact of management across New Zealand (Tier 2) is yet to be implemented but faces unique challenges for working at many different spatial scales and coordinating with multiple agencies. The solution is to implement a Master Sample using Balanced Acceptance Sampling (BAS). To do this some practical aspects of the sample design are addressed such as stratification, unequal probability sampling, rotating panel designs and regional intensification. Incorporating information from Tier 1 monitoring directly is also discussed.

Authors: Paul van Dam-Bates[1], Ollie Gansell[1] and Blair Roberston[2]
1 Department of Conservation, New Zealand
2 University of Canterbury, Department of Mathematics and Statistics
160525145234
How robust are capture–recapture estimators of animal population density?

### Murray Efford

Department of Mathematics and Statistics

Date: Thursday 6 October 2016

Data from passive detectors (traps, automatic cameras etc.) may be used to estimate animal population density, especially if individuals can be distinguished. However, the spatially explicit capture–recapture (SECR) models used for this purpose rest on specific assumptions that may or may not be justified, and uncertainty regarding the robustness of SECR methods has led some to resist their use. I consider the robustness of SECR estimates to deviations from key spatial assumptions – uniform spatial distribution of animals, circularity of home ranges, and the shape of the radial detection function. The findings are generally positive, although variance estimates are sensitive to over-dispersion. The method is also somewhat robust to transience and other misspecifications of the detection model, but it is not foolproof, as I show with a counter example.
160527115814
Bootstrapped model-averaged confidence intervals

### Jimmy Zeng

Department of Preventive and Social Medicine

Date: Thursday 29 September 2016

Model-averaging is commonly used to allow for model uncertainty in parameter estimation. In the frequentist setting, a model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights being based on an information criterion, such as AIC. A Wald confidence interval based on this estimate will often perform poorly, as its sampling distribution will generally be distinctly non-normal and estimation of the standard error is problematic. We propose a new method that uses a studentized bootstrap approach. We illustrate its use with a lognormal example, and perform a simulation study to compare its coverage properties with those of existing intervals.
160520152426
N-mixture models vs Poisson regression

### Richard Barker

Department of Mathematics and Statistics

Date: Thursday 22 September 2016

N-mixture models describe count data replicated in time and across sites in terms of abundance N and detectability p. They are popular because they allow inference about N while controlling for factors that influence p without the need for marking animals. Using a capture-recapture perspective we show that the loss of information that results from not marking animals is critical, making reliable statistical modeling of N and p problematic using just count data. We are unable to fit a model in which the detection probabilities are distinct among repeat visits as this model is overspecified. This makes uncontrolled variation in p problematic. By counter example we show that even if p is constant after adjusting for covariate effects (the 'constant p' assumption) scientifically plausible alternative models in which N (or its expectation) is non-identifiable or does not even exist, lead to data that are practically indistinguishable from data generated under an N-mixture model. This is particularly the case for sparse data as is commonly seen in applications. We conclude that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data then seen in typical applications. Relative abundance models for counts can be readily fitted using Poisson regression in standard software such as R and are sufficiently flexible to allow controlling for p through the use covariates while simultaneously modeling variation in relative abundance. If users require estimates of absolute abundance they should collect auxiliary data that help with estimation of p.
160829124021
Single-step genomic evaluation of New Zealand's sheep

Department of Mathematics and Statistics

Date: Thursday 15 September 2016

Quantitative genetics is the study of inheritance of quantitative traits, which are generally continuously distributed. It uses biometry to study the expression of quantitative differences among individuals and considers genetic relatedness and, environment. In the past, knowing the genetic structure of individuals has been very expensive to be used commercially. However, in the last decade, the price of genotyping has fallen rapidly, and now, there are commercial genotype chips available for most livestock species. Currently, dense marker maps are used to predict the genetic merit of animals, early in life. There are methods available for genomic evaluation. However, because they do not consider all the available information at the same time, bias or accuracy loss may occur. Single-step GBLUP is a method that uses all the genomic, pedigree and phenotypic data on all animals, simultaneously and is reported to be limit bias and in cases increase accuracy of prediction. Preliminary results of this approach on New Zealand Sheep will be presented.
160525125408
Clinical trial Data Monitoring Committees - aiding science

### Katrina Sharples

Department of Mathematics and Statistics

Date: Thursday 8 September 2016

The goal of a clinical trial is to obtain reliable evidence regarding the benefits and risks of a treatment while minimising the harm to patients. Recruitment and follow-up may take place over several years, accruing information over time, which allows the option of stopping the trial early if the trial objectives have been met or the risks to patients become too great. It has become standard practice for trials with significant risk to be overseen by an independent Data Monitoring Committee (DMC). These DMCs have sole access to the accruing trial data; they are responsible for safeguarding the rights of the patients in the trial, and for making recommendations to those running the trial regarding trial conduct and possible early termination. However interpreting the accruing evidence and making optimal recommendations is challenging. As the number of trials having DMCs has grown there has been increasing discussion of how train new DMC members. Some DMCs have published papers describing their decision-making processes for specific trials, and workshops are held fairly frequently. However it is recognised that DMC expertise is best acquired through apprenticeship. Opportunities for this are rare internationally but in New Zealand, in 1996, the Health Research Council established a unique system for monitoring clinical trials which incorporates apprenticeship positions. This talk will describe our system, discuss some of the issues and insights that have arisen along the way, and the effects it has had on the NZ clinical trial environment.
160524142027
A statistics-related guest seminar in Preventive and Social Medicine: A researcher's guide to understanding modern statistics

### Sander Greenland

University of California

Date: Monday 5 September 2016

Note day, time and venue of this special seminar
Sander Greenland is Research Professor and Emeritus Professor of Epidemiology and Statistics at the University of California, Los Angeles. He is a leading contributor to epidemiological statistics, theory, and methods, with a focus on the limitations and misuse of statistical methods in observational studies. He has authored or co-authored over 400 articles and book chapters in epidemiology, statistics, and medical publications, and co-authored the textbook Modern Epidemiology.

Professor Greenland has played an important role in the recent discussion following the American Statistical Association’s statement on the use of p values.[1-3] He will discuss lessons he took away from the process and how they apply to properly interpreting what is ubiquitous but rarely interpreted correctly by researchers: Statistical tests, P-values, power, and confidence intervals.

1. Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, 70, 129-133, DOI: 10.1080/00031305.2016.1154108
2. Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108; reprinted in the European Journal of Epidemiology, 31, 337-350.
3. Greenland, S. (2016). The ASA guidelines and null bias in current teaching and practice. The American Statistician, 70, online supplement 10 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108
160804123414
Sugars are not all the same to us! Empirical investigation of inter- and intra-individual variabilities in responding to common sugars

### Mei Peng

Department of Food Science

Date: Thursday 25 August 2016

Given the collective interests in sugar, from food scientists, geneticists, neurophysiologists, and many others (e.g., health professionals, food journalists, and YouTube experimenters), one would expect the picture of human sweetness perception should be reasonably complete by now. This is unfortunately not the case. Some seemingly fundamental questions have not yet been answered – is one’s sweetness sensitivity generalisable across different sugars? Can people discriminate sugars when they are equally sweet? Do common sugars have similar effects on people’s cognitive processing?

Answers to these questions have a close relevance to illuminating the sensory physiology of sugar metabolism, as well as to practical research of sucrose substitution. In this seminar, I would like to present findings from a few behavioural experiments focused on inter-individual and intra-individual differences in responding to common sugars, using methods from sensory science and cognitive psychology. Overall, our findings challenged some of the conventional beliefs about sweetness perception, and provided some insights into future research about sugar.
160517112909
New models for symbolic data

### Scott Sisson

University of New South Wales

Date: Thursday 18 August 2016

Symbolic data analysis is a fairly recent technique for the analysis of large and complex datasets based on summarising the data into a number of "symbols" prior to analysis. Inference is then based on the analysis of the data at the symbol level (modelling symbols, predicting symbols etc). In principle this idea works, however it would be more advantageous and natural to fit models at the level of the underlying data, rather than the symbol. Here we develop a new class of models for the analysis of symbolic data that fit directly to the data underlying the symbol, allowing for a more intuitive and flexible approach to analysis using this technique.
160520142124
Estimation of relatedness using low-depth sequencing data

### Ken Dodds

AgResearch

Date: Thursday 11 August 2016

Estimates of relatedness are used for traceability, parentage assignment, estimating genetic merit and for elucidating the genetic structure of populations. Relatedness can be estimated from large numbers of markers spread across the genome. A relatively new method of obtaining genotypes is to derive these directly from sequencing data. Often the sequencing protocol is designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. A method of estimating relatedness which accounts for this feature of GBS data is given. The method depends on the number of reads (the depth) at a particular position and also accommodates zero reads (missing). The theory for the method, simulations and some applications to real data are presented, along with further related research questions.
160517113606
The replication "crisis" in psychology, medicine, etc.: what should we do about it?

### Jeff Miller

Department of Psychology

Date: Thursday 4 August 2016

Recent large-scale replication studies and meta-analyses suggest that about 50—95% of the positive “findings” reported in top scientific journals are false positives, and that this is true across a range of fields including Psychology, Medicine, Neuroscience, Genetics, and Physical Education. Some causes of this alarmingly high percentage are easily identified, but what is the appropriate cure? In this talk I describe a simple model of the research process that researchers can use to identify the optimal attainable percentage of false positives and to plan their experiments accordingly.
160517113415