Predictive modelling is the classic tool of artificial intelligence and machine learning. Predictive modelling is most often classification (automatically putting data into known categories) or regression (automatically estimating some target quantity for each data point). There are several challenges in predictive modelling relating to how to best exploit information to predict a target, how to measure how well a system predicts and whether it is successful, and how to measure the confidence of the system in its predictions.
Below we showcase several projects in which SIH has used predictive modelling. See all projects.
Wheat yield prediction with uncertainty estimates
Predicting crop yield using a range of proximal and remote sensor measurements is area of active research. Such predictions are important for optimisation of crop management (e.g. nitrogen application) and robust associated uncertainty estimates help to improve this process and understand its limitations. We wrote code implementing a Bayesian regression model with spatially correlated residuals for application to wheat crop yield forecasting using a range of sensor data. We used this to generate predictive maps of wheat yield with robust uncertainty bounds.
GemPy MCMC Sampling
GemPy is a Python-based, open-source library for implicitly generating 3D structural geological models. It was designed from the ground up to support easy embedding in probabilistic frameworks for the uncertainty analysis of subsurface structures.
SIH investigated the code’s viability for the implementation of various additional sensor models as well the implementation of a data science-friendly Bayesian inference wrapper for easy experimentation with priors, likelihoods and sampling schemes.
Grant application support for cardiovascular AI impact
We assisted the Westmead Applied Research Centre to apply for the Google Impact AI Challenge, and are involved in the ongoing design and engineering to make use of the $1M prize. WARC called on our expertise in data science and natural language processing to revise their application and to support their case in interview for the award. Our involvement ensured that the proposal amounted to a realistic, operationalised application of artificial intelligence technologies.
Bayesian Updating for Childhood Obesity Grant Proposal
SIH supported a grant proposal by the Centre for Translational Data Science, by demonstrating the value of using Bayesian modelling when collecting and analysing longitudinal data on childhood obesity. We built cross-sectional Bayesian variable selection models to select important factors and models for predicting children’s BMI, mental health and sleep quality across multiple ages, for each child in the Longitudinal Study of Australian Children (LSAC) study. A vector-autoregressive model was then applied to visualise the unexplained variation in the preceding models. We constructed visualisations to demonstrate the importance of understanding uncertainty over the course of data collection, and the potential for using Bayesian adaptive trials during collection.
Labelling Clause Type at Scale for LCT
LCT studies how knowledge is built through teaching, and in order to determine the trajectory of knowledge building, proposes to categorise each clause in a teaching transcript. SIH made this process of labelling clauses much faster and scalable. They did so firstly by developing software with natural language processing technology that converts a lesson transcript into a spreadsheet where each row contains a clause to be categorised. Secondly, they developed a machine learning classifier to learn from these spreadsheets and predict the labels of future clauses. Finally, SIH developed techniques to visualise the trajectory of knowledge building through a lesson where clauses have been categorised.
Identifying ram mating behaviour
Monitoring livestock has historically been labour intensive. The advent of on-animal sensors means this monitoring can be conducted remotely, continuously, and accurately. The ability to identify the precise time when sheep are mating using ram-mounted accelerometer data would unlock unprecedented information on the reproductive performance of these animals. We fit a classifier model to data from collar accelerometers labelled by videoing rams in the presence of ewes in oestrus. We then wrote code to detect change points in new acceleration data and to predict the occurence of mating events.
1000-fold speedup in Dynamic Bayesian network model
A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic.
SIH helped the researcher by speeding up the R-package used to fit the dynamic bayesian network model by 1000x. The R-package is now available at https://github.com/FrankD/EDISON/tree/MultipleTimeSeries
Where can deep-sea iron nodules be found?
Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining data from thousands of ocean floor samples with global maps of oceanic variables. The environments in which these deposits do and do not occur could then be characterised to generate insight into potential consequences of proposed mining.
Predicting unnecessary CT scans
- Professor Jonathan Morris, Kolling Institute of Medical Research and Sydney Medical School; Dr Felicity Gallimore
- The University of Sydney Medical School
- Data Science (Dr Aldo Saavedra , Dr Madhura Killedar, Dr Joel Nothman and Mr Peter Thiem)
- Predictive modelling Inferential modelling Description and basic visualization Language as data
Diagnostic imaging in hospitals is costly due to expensive machines and their operators, as well as the cost of moving patients in and out of radiography. Published studies of emergency presentations have shown that the number of brain computer tomography (CT-Brain) scans performed is increasing with time while the proportion of scans giving no cause for concern remains the same and represents the largest category.
We sought to determine whether a substantial portion of CT Scans performed in North Sydney LHD were unnecessary. We translated this research question into something determinable from data: identify CT-Brain cases where the unconcerning outcome of scans could be predicted from clinical knowledge available prior to the scan. By first constructing a text classifier to label CT Scan reports as unconcerning, we were able to use clustering and predictive modelling to weakly identify some patient features that predicted unconcerning CT results.
While the project had the potential to impact clinical policy surrounding the application of CT Scans in Emergency Departments, the weak results suggests that if any excessive expenditure problem exists it is not simple to resolve. At the same time, we have developed methodologies for performing similar studies towards rationalising diagnostic scan expenditure.
Predicting Crime using a Spatial-Demographic Framework
Responding to domestic violence related assaults dominate much of the NSW Police’s resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The social-demographic-crime network and its inter-dependencies were modelled using a Bayesian vector autoregression model. We built a collaboration with BOCSAR, the crime database of all offences in NSW over the last 20 years, and sourced demographic data for multiple census years. The results of this study will help inform policy decision-making by government and police.
Optimal Image Reconstruction for the SAMI Galaxy Survey
The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data cube from ~500 spectra taken at different spatial locations across a galaxy. The previous method resulted in undesirable artefacts due to under-sampling and the astronomical sources changing spatial location within the data due to differential atmospheric refraction. We have developed a novel method using probabilistic image fusion that delivers optimal combination of the spectral fibre bundle data into a cube with uniform image quality while maintaining spectral details. This innovative technology has further demonstrated capabilities to achieve super-resolution and is implemented as flexible software framework that can eventually be used by a wide range of worldwide telescopes.
Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.
Identifying Nerve Function Profiles in Motor Neurodegenerative Disorders
Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years’ of peripheral nerve excitability studies. A software package, QTRAC, is used to generate ~35 properties that are analysed in a research context. Additional information is incorporated to help make a diagnosis, such as clinical survey data, and the temperature of the nerve at the time of the test. Importantly, diagnosis of the disorder is not always 100% accurate. SIH used machine learning to predict the likelihood motor neuron disease for a patient given nerve excitability measurements. The model had reasonable ability to rank individual cases in order of increasing MND risk. SIH delivered this model in a software package for future use in research as well as a clinical setting, with the intention of improving the speed and accuracy of MND diagnosis to improve treatment outcomes for patients.
Predictive Project Profile
Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals.
This project used survey data collected over the lifecycle of 1000 projects to build a machine learning model and prototype web application that would indicate the likelihood of success of the project.
This tool would be used for further demonstration and discussion of the idea with expert groups of project managers, such they can develop better and data driven approaches to lead to successful project management.