This is a small sample of over 500 completed projects which Sydney Informatics Hub has worked on.
If you are a University of Sydney researcher and would like us to work on your project, please fill out the assistance form, attend our training courses, or contact firstname.lastname@example.org for more information.
Wheat yield prediction with uncertainty estimates
Predicting crop yield using a range of proximal and remote sensor measurements is area of active research. Such predictions are important for optimisation of crop management (e.g. nitrogen application) and robust associated uncertainty estimates help to improve this process and understand its limitations. We wrote code implementing a Bayesian regression model with spatially correlated residuals for application to wheat crop yield forecasting using a range of sensor data. We used this to generate predictive maps of wheat yield with robust uncertainty bounds.
GemPy MCMC Sampling
GemPy is a Python-based, open-source library for implicitly generating 3D structural geological models. It was designed from the ground up to support easy embedding in probabilistic frameworks for the uncertainty analysis of subsurface structures.
SIH investigated the code’s viability for the implementation of various additional sensor models as well the implementation of a data science-friendly Bayesian inference wrapper for easy experimentation with priors, likelihoods and sampling schemes.
SPEED-EXTRACT: Data Science for Electronic Medical Records
- Dr Charmaine Tam, Dr Aldo Saavedra
- Faculty of Engineering and Information Technologies
- Data Science (Madhura Killedar, Di Lu, Gordon McDonald, Vijay Raghunath)
The Centre for Translational Data Science’s SPEED-EXTRACT project led by Dr Charmaine Tam and Dr Aldo Saavedra aims to harness insights from electronic medical record data associated with patients that present to hospital with symptoms associated with acute coronary syndrome. This project is supported by Sydney Health Partners and NSW Health and engages a multidisciplinary team of cardiology and data experts across local health districts. Sydney Informatics Hub has contributed to components of this project including developing the large-scale data pipeline, exploratory data analysis, consulting on study design, and a data audit tool. This project has just been successful in securing additional funding from NSW Health.
Grant application support for cardiovascular AI impact
We assisted the Westmead Applied Research Centre to apply for the Google Impact AI Challenge, and are involved in the ongoing design and engineering to make use of the $1M prize. WARC called on our expertise in data science and natural language processing to revise their application and to support their case in interview for the award. Our involvement ensured that the proposal amounted to a realistic, operationalised application of artificial intelligence technologies.
ProteHome: Proteomics Experimental Results Database
The Metabolic Cybernetics Lab at the Charles Perkins Centre has generated disparate Mass Spectrometry datasets for protein metabolism studies. With datasets stored in various formats and places, it has been difficult to search and compare experimental results.
SIH developed ProteHome, a bioinformatics system providing a centralised data repository, with a web-based interface to facilitate the:
- Standardisation of quantitative analysis results, with common specification of experimental metadata and formatting of analysis data.
- Retrieving those results for protein(s) or modification(s) of interest, regardless of the version of protein identification number used in the stored experiment.
- Management of submitted datasets using a comprehensive hierarchical storage structure.
Bayesian Updating for Childhood Obesity Grant Proposal
SIH supported a grant proposal by the Centre for Translational Data Science, by demonstrating the value of using Bayesian modelling when collecting and analysing longitudinal data on childhood obesity. We built cross-sectional Bayesian variable selection models to select important factors and models for predicting children’s BMI, mental health and sleep quality across multiple ages, for each child in the Longitudinal Study of Australian Children (LSAC) study. A vector-autoregressive model was then applied to visualise the unexplained variation in the preceding models. We constructed visualisations to demonstrate the importance of understanding uncertainty over the course of data collection, and the potential for using Bayesian adaptive trials during collection.
Milk yield optimisation using Bayesian regression
Dairy cows can have their pasture feeding supplemented with concentrate, which can help to manage deficits in pasture nutrition and ultimately increase milk production. Traditionally, all cows are allocated the same amount of concentrate. However, past studies have shown some success with tailoring the allocation of concentrate to the individual cow. This project aimed to optimise individual allocation of concentrate supplement to improve the amount of milk produced and the overall production of a pasture-based dairy system. It deployed a multifactorial model developed previously by the Centre for Translational Data Science, Sydney Informatics Hub and Camden within the University of Sydney.
Stoichiometry Table Widget for LabArchives eNotebooks
A stoichiometry table is useful for calculating and recording quantities of reagents. SIH coded a flexible and reliable widget for entering this data in a lab notebook, supporting University’s use of the LabArchives notebooks in chemistry. Use it: labarchives-stoichiometry-widget!
Labelling Clause Type at Scale for LCT
LCT studies how knowledge is built through teaching, and in order to determine the trajectory of knowledge building, proposes to categorise each clause in a teaching transcript. SIH made this process of labelling clauses much faster and scalable. They did so firstly by developing software with natural language processing technology that converts a lesson transcript into a spreadsheet where each row contains a clause to be categorised. Secondly, they developed a machine learning classifier to learn from these spreadsheets and predict the labels of future clauses. Finally, SIH developed techniques to visualise the trajectory of knowledge building through a lesson where clauses have been categorised.
cBioPortal Melanoma Data
cBioportal is the Westmead Institute for Medial Research’s cancer genomics database and web platform. It provides visualization, analysis and download of large-scale cancer genomics data sets, so that users can find altered genes and networks within the genomic studies on the database. SIH extended the app’s capabilities to be able to incorporate melanoma images, genetic data and studies.
Identifying ram mating behaviour
Monitoring livestock has historically been labour intensive. The advent of on-animal sensors means this monitoring can be conducted remotely, continuously, and accurately. The ability to identify the precise time when sheep are mating using ram-mounted accelerometer data would unlock unprecedented information on the reproductive performance of these animals. We fit a classifier model to data from collar accelerometers labelled by videoing rams in the presence of ewes in oestrus. We then wrote code to detect change points in new acceleration data and to predict the occurence of mating events.
Meta-analysis into Adolescent Oral Health Interventions
Oral health promotion for younger-aged children is more widely conducted and better understood than that directed at adolescents. The aim of this systematic review was to evaluate the effectiveness of oral health interventions in improving the knowledge, attitudes, behaviour and oral health status of healthy adolescents. We looked at gingival health, plaque levels, and dental caries within randomized controlled trials, as reported in the literature. The interventions reported ranged from single session interventions to community-wide programs with many also including clinical preventive procedures and take-home products. Half of the programs used a health behaviour change theory to inform their intervention. The meta-analysis showed an improvement in all three of gingival score, plaque score and the number of decayed missing and filled tooth surfaces after an oral health promotion intervention, and with respect to a control group that did not receive the intervention.
eSCAPE parallel landscape evolution benchmarking
eSCAPE is a parallel landscape evolution model, built to simulate topography dynamic at various space and time scales. SIH benchmarked eSCAPE’s performance across multiple CPUs and nodes on the University of Sydney’s Artemis HPC, visualizing the program’s runtimes as well as the runtimes of specific functions within the program. SIH created reusable scripts to allow the researcher to easily assess eSCAPE’s performance in the future as code development continues.
Video Tracking Predator-Prey Interactions in Fish.
By video-tracking the interaction between prey mosquitofish, Gambusia holbrooki, and their predator, jade perch, Scortum barcoo, under controlled conditions, we provide some of the first fine-scale characterisation of how prey adapt their behaviour according to their continuous assessment of risk based on both predator behaviour and angular distance to the predator’s mouth. When these predators were inactive and posed less of an immediate threat, prey were often found within the attack cone of the predator showing reductions in speed and acceleration, characteristic of predator-inspection behaviour. However, when predators became active, prey swam faster with greater acceleration and were closer together within the attack cone of predators. Most importantly, this study provides evidence that prey do not adopt a uniform response to the presence of a predator. Instead, we demonstrate that prey are capable of rapidly and dynamically updating their assessment of risk and showing fine-scale adjustments to their behaviour.
Paper: “Fine-scale behavioural adjustments of prey on a continuum of risk”. M.I.A. Kent, J.E. Herbert-Read, G.D. McDonald, A.J. Wood, A.J.W. Ward. Proceedings of the Royal Society B. 2019
Discharge Against Medical Advice in Culturally and Linguistically Diverse Patients
In this study we examined discharge against medical advice (DAMA) and its relation to the cultural and linguistic diversity (CALD) of 600,000 patients over 9 years in the Sydney Children’s Hospital Network. Using a bayesian logistic regression framework, we found CALD status to be significantly positively correlated with DAMA rates. Identification of this link provides opportunities for intervention at a practice and policy level in order to prevent adverse outcomes for CALD patients.
1000-fold speedup in Dynamic Bayesian network model
A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic.
SIH helped the researcher by speeding up the R-package used to fit the dynamic bayesian network model by 1000x. The R-package is now available at https://github.com/FrankD/EDISON/tree/MultipleTimeSeries
Repackaging software for modelling topic structure in language
Altmann with Martin Gerlach wanted other researchers to try out their new language modelling technique, so they made it open-source. SIH made their work more accessible by following software best practices: restructuring the code so that it conformed to the established Scikit-learn estimator API; adding automated software testing and continuous integration; extending and publishing documentation; and releasing version 0.1 of the software to the Python package index. See http://topsbm.readthedocs.io
Where can deep-sea iron nodules be found?
Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining data from thousands of ocean floor samples with global maps of oceanic variables. The environments in which these deposits do and do not occur could then be characterised to generate insight into potential consequences of proposed mining.
Predicting unnecessary CT scans
- Professor Jonathan Morris, Kolling Institute of Medical Research and Sydney Medical School; Dr Felicity Gallimore
- The University of Sydney Medical School
- Data Science (Dr Aldo Saavedra , Dr Madhura Killedar, Dr Joel Nothman and Mr Peter Thiem)
- Predictive modelling Inferential modelling Description and basic visualization Language as data
Diagnostic imaging in hospitals is costly due to expensive machines and their operators, as well as the cost of moving patients in and out of radiography. Published studies of emergency presentations have shown that the number of brain computer tomography (CT-Brain) scans performed is increasing with time while the proportion of scans giving no cause for concern remains the same and represents the largest category.
We sought to determine whether a substantial portion of CT Scans performed in North Sydney LHD were unnecessary. We translated this research question into something determinable from data: identify CT-Brain cases where the unconcerning outcome of scans could be predicted from clinical knowledge available prior to the scan. By first constructing a text classifier to label CT Scan reports as unconcerning, we were able to use clustering and predictive modelling to weakly identify some patient features that predicted unconcerning CT results.
While the project had the potential to impact clinical policy surrounding the application of CT Scans in Emergency Departments, the weak results suggests that if any excessive expenditure problem exists it is not simple to resolve. At the same time, we have developed methodologies for performing similar studies towards rationalising diagnostic scan expenditure.
Predicting Crime using a Spatial-Demographic Framework
Responding to domestic violence related assaults dominate much of the NSW Police’s resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The social-demographic-crime network and its inter-dependencies were modelled using a Bayesian vector autoregression model. We built a collaboration with BOCSAR, the crime database of all offences in NSW over the last 20 years, and sourced demographic data for multiple census years. The results of this study will help inform policy decision-making by government and police.
Optimal Image Reconstruction for the SAMI Galaxy Survey
The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data cube from ~500 spectra taken at different spatial locations across a galaxy. The previous method resulted in undesirable artefacts due to under-sampling and the astronomical sources changing spatial location within the data due to differential atmospheric refraction. We have developed a novel method using probabilistic image fusion that delivers optimal combination of the spectral fibre bundle data into a cube with uniform image quality while maintaining spectral details. This innovative technology has further demonstrated capabilities to achieve super-resolution and is implemented as flexible software framework that can eventually be used by a wide range of worldwide telescopes.
Understanding Transgenerational Welfare Dependence
The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payments to recipients born between 1987-1988, their families, parents, children and siblings. The raw data was difficult to work with because it was subject to extensive security requirements, was large in volume, and an inconvenient data shape. SIH engineered software to convert the data into forms that made the data accessible to the end user while complying with security and licence requirements. This rich dataset is now available for researchers to explore, and will contribute to the understanding and improvements to the Commonwealth income support systems and life outcomes for all Australians.
Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.
Clustering Light Sources: Scaling Up to the Whole Sky
The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing the brightness of light sources over time requires matching each across pictures from different times and locations in the sky. The astrophysicists had built processing software, a database and a web app to analyse similar datasets, but had never tried to scale it to this size of dataset. An SIH engineer was able to debug and optimise the software involved, so that the data loading process took 8 hours instead of around 15 hours, and web app load times were reduced from multiple minutes to a few seconds. This enabled further research and analysis of this unique and enormous dataset.
Which treatment might patients with relapsed ovarian cancer respond to?
Molecular markers measured within the primary tumour are used to determine if patients who have relapsed ovarian cancer will respond to a particular treatment. SIH helped to identify subsets of genes that are overexpressed / underexpressed in response to treatments, by developing statistical methods including dimensionality reduction and hypothesis testing.
Breast Cancer Dashboard
Medical data is under-used for its potential to inform clinical practice. SIH developed a visual dashboard to display information about lymphoedema in breast cancer patients. A prototype web application with an easy to use interactive dashboard was developed to help understand a patient’s journey and assess the results of different cohorts of patients. User and expert workshops helped optimise the design.
Identifying Nerve Function Profiles in Motor Neurodegenerative Disorders
Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years’ of peripheral nerve excitability studies. A software package, QTRAC, is used to generate ~35 properties that are analysed in a research context. Additional information is incorporated to help make a diagnosis, such as clinical survey data, and the temperature of the nerve at the time of the test. Importantly, diagnosis of the disorder is not always 100% accurate. SIH used machine learning to predict the likelihood motor neuron disease for a patient given nerve excitability measurements. The model had reasonable ability to rank individual cases in order of increasing MND risk. SIH delivered this model in a software package for future use in research as well as a clinical setting, with the intention of improving the speed and accuracy of MND diagnosis to improve treatment outcomes for patients.
Scopus Data Preparation
The University’s research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier’s Scopus. While the University has subscribed to Scopus snapshot data for a few years, it lacked an efficient way to load and query the data.
We analysed the snapshot, stored as a collection of XML files, and developed a relational database schema to represent useful portions of the data for efficient access. We developed a script to efficiently load the data from XML’s snapshot into a relational database.
This resource now allows the Research Portfolio to calculate metrics over the publication record, while the database is now more accessible to researchers, and the loading script to external Scopus Snapshot users.
Applying Machine Learning to Criminology
The incidence of crime the impacts of societal and individual characteristics on criminal behaviour can be explored using modern machine learning methods, answering important questions about crime, such as: • What is the probability of a crime occurring at a location? • What are the characteristics of the population that affect the incidence of crime? Our work implements novel Bayesian machine learning techniques to modelling the dependency between offence data and demographic characteristics and spatial location. This provides a fully probabilistic approach to modelling crime which reflects all uncertainties in the prediction of offences as well as the uncertainties surrounding model parameters. By using Bayesian updating, these predictions and inferences are dynamic in the sense that they change as new information becomes available. Our model has been applied to offence data, such as domestic violence-related assaults, burglary and motor vehicle theft, in New South Wales (NSW), Australia. The results highlight the strength of the technique by validating the factors that are associated with high and low criminal activity.
Research Environment for Ancient Documents (READS) efficiency
Research Environment for Ancient Documents (READ) is an integrated Open Source web platform for epigraphical and manuscript research. It allows digital images of texts to be annotated, and for multiple annotations to be maintained for critical analysis.
The READ research and development team consulted with SIH because they had trouble getting their software to perform well with long texts. SIH Research Engineers helped them to identify parts of the system that were slower than was reasonable. Their software engineers were able to then resolve these bottlenecks, enabling the READ system to be more widely adopted in archaeology, history and manuscript studies.
Predictive Project Profile
Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals.
This project used survey data collected over the lifecycle of 1000 projects to build a machine learning model and prototype web application that would indicate the likelihood of success of the project.
This tool would be used for further demonstration and discussion of the idea with expert groups of project managers, such they can develop better and data driven approaches to lead to successful project management.
Discharge against medical advice in the Sydney Children's Hospital Network
Patients who discharge against medical advice (DAMA) from hospital carry a significant risk of readmission and have increased rates of morbidity and mortality. Using five years of admissions and diagnosis data, we sought to identify the demographic, clinical and administrative characteristics of DAMA patients in the Sydney Children’s Hospital Network. Using a bayesian logistic regression framework, we found statistically significant predictors of DAMA in a given admission were hospital site, a mental health/behavioural diagnosis, Aboriginality, emergency rather than elective admissions, a gastrointestinal diagnosis and a history of previous DAMA. Identification of these predictors of DAMA provides opportunities for intervention at a practice and policy level in order to prevent adverse outcomes for patients.
Transforming IMPALA: International Migration Law and Policy Assessment Database
The IMPALA database (http://www.impaladatabase.org/) contains migration and citizenship law and policy across countries and through time in a form that allows legislation, policy and some statistical data to be easily compared and measured. With one record for each visa type in each year, thousands of records of content have been entered, mostly manually, into a Qualtrics survey. SIH transformed the unwieldy manually-entered database to improve data exploration. We wrote software to ingest the Qualtrics responses (https://github.com/Sydney-Informatics-Hub/qualtrics-pandas), and to generate a more usable output with consistent metadata.
Disease spectrum and management of children admitted with acute respiratory infection in Viet Nam
- Nguyen Thi Kim Phuong, Respiratory Department, Da Nang Hospital for Women and Children; Professor Ben Marais, The Children’s Hospital at Westmead Clinical School and Deputy Director, Marie Bashir Institute for Infectious Diseases and Biosecurity
- Faculty of Health Sciences
- Data Science (Dr Maryam Montazerolghaem )
- Description and basic visualization
This study aim to assess the acute respiratory infection (ARI) disease spectrum, duration of hospitalisation and outcome in children hospitalised with an ARI in Viet Nam. The result indicates that acute respiratory infection is a major cause of paediatric hospitalisation in Viet Nam, characterised by prolonged hospitalisation for relatively mild disease. There is huge potential to reduce unnecessary hospital admission and cost.