Projects

This is a selection of projects we have worked on in the past.

1000-fold speedup in Dynamic Bayesian network model

  • The University of Sydney Medical School
  • Data Science (Di Lu, Gordon McDonald)
  • 2019
  • Software
  • Predictive modelling Inferential modelling Time series

A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic.

SIH helped the researcher by speeding up the R-package used to fit the dynamic bayesian network model by 1000x. The R-package is now available at https://github.com/FrankD/EDISON/tree/MultipleTimeSeries

A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic. SIH helped the researcher by...

Where can deep-sea iron nodules be found?

Where can deep-sea iron nodules be found?
  • Dr. Adriana Dutkiewicz, School of Geosciences; Prof. Dietmar Müller, School of Geosciences
  • Faculty of Science
  • Data Science (Dr. Alexander Judge)
  • 2018
  • Software Transformed data Report
  • Predictive modelling Inferential modelling Description and basic visualization

Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining data from thousands of ocean floor samples with global maps of oceanic variables. The environments in which these deposits do and do not occur could then be characterised to generate insight into potential consequences of proposed mining.

Where can deep-sea iron nodules be found?
Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining dat...

Predicting unnecessary CT scans

Predicting unnecessary CT scans
  • Professor Jonathan Morris, Kolling Institute of Medical Research and Sydney Medical School; Dr Felicity Gallimore
  • The University of Sydney Medical School
  • Data Science (Dr Aldo Saavedra , Dr Madhura Killedar, Dr Joel Nothman and Mr Peter Thiem)
  • 2018
  • Report
  • Predictive modelling Inferential modelling Description and basic visualization Language as data

Diagnostic imaging in hospitals is costly due to expensive machines and their operators, as well as the cost of moving patients in and out of radiography. Published studies of emergency presentations have shown that the number of brain computer tomography (CT-Brain) scans performed is increasing with time while the proportion of scans giving no cause for concern remains the same and represents the largest category.

We sought to determine whether a substantial portion of CT Scans performed in North Sydney LHD were unnecessary. We translated this research question into something determinable from data: identify CT-Brain cases where the unconcerning outcome of scans could be predicted from clinical knowledge available prior to the scan. By first constructing a text classifier to label CT Scan reports as unconcerning, we were able to use clustering and predictive modelling to weakly identify some patient features that predicted unconcerning CT results.

While the project had the potential to impact clinical policy surrounding the application of CT Scans in Emergency Departments, the weak results suggests that if any excessive expenditure problem exists it is not simple to resolve. At the same time, we have developed methodologies for performing similar studies towards rationalising diagnostic scan expenditure.

Predicting unnecessary CT scans
Diagnostic imaging in hospitals is costly due to expensive machines and their operators, as well as the cost of moving patients in and out of radiography. Published studies of emergency presentations have shown that the number of brain computer tomog...

Predicting Crime using a Spatial-Demographic Framework

Predicting Crime using a Spatial-Demographic Framework
  • Dr. Roman Marchant, Centre for Translational Data Science
  • Faculty of Engineering and Information Technologies
  • Data Science (Dr. Sebastian Haan)
  • 2018
  • Verbal advice Software
  • Data collection Predictive modelling Inferential modelling Data linkage Description and basic visualization Time series

Responding to domestic violence related assaults dominate much of the NSW Police’s resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The social-demographic-crime network and its inter-dependencies were modelled using a Bayesian vector autoregression model. We built a collaboration with BOCSAR, the crime database of all offences in NSW over the last 20 years, and sourced demographic data for multiple census years. The results of this study will help inform policy decision-making by government and police.

Predicting Crime using a Spatial-Demographic Framework
Responding to domestic violence related assaults dominate much of the NSW Police's resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The socia...

Optimal Image Reconstruction for the SAMI Galaxy Survey

Optimal Image Reconstruction for the SAMI Galaxy Survey
  • Prof. Scott Croom, School of Physics, and the SAMI team; Dr. Richard Scalzo, Centre for Translational Data Science
  • Faculty of Science
  • Data Science (Sebastian Haan)
  • 2018
  • Software Transformed data
  • Predictive modelling

The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data cube from ~500 spectra taken at different spatial locations across a galaxy. The previous method resulted in undesirable artefacts due to under-sampling and the astronomical sources changing spatial location within the data due to differential atmospheric refraction. We have developed a novel method using probabilistic image fusion that delivers optimal combination of the spectral fibre bundle data into a cube with uniform image quality while maintaining spectral details. This innovative technology has further demonstrated capabilities to achieve super-resolution and is implemented as flexible software framework that can eventually be used by a wide range of worldwide telescopes.

Optimal Image Reconstruction for the SAMI Galaxy Survey
The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data ...

Understanding Transgenerational Welfare Dependence

  • Professor Deborah Cobb-Clark, School of Economics; Dr Sarah Dahmann
  • Faculty of Arts and Social Sciences
  • Data Science (Mr Peter Thiem)
  • 2018
  • Software Transformed data
  • Data-store development

The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payments to recipients born between 1987-1988, their families, parents, children and siblings. The raw data was difficult to work with because it was subject to extensive security requirements, was large in volume, and an inconvenient data shape. SIH engineered software to convert the data into forms that made the data accessible to the end user while complying with security and licence requirements. This rich dataset is now available for researchers to explore, and will contribute to the understanding and improvements to the Commonwealth income support systems and life outcomes for all Australians.

The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payment...

Automating information curation in the OMIA knowledge base

Automating information curation in the OMIA knowledge base
  • Prof. Frank Nicholas, Faculty of Science at The University of Sydney
  • Faculty of Science
  • Data Science (Joshua Stretton, Di Lu)
  • 2018
  • Software
  • Data collection Data-store development Predictive modelling Language as data

Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.

Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. F...

Clustering Light Sources: Scaling Up to the Whole Sky

  • Associate Professor Tara Murphy, School of Physics
  • Faculty of Science
  • Data Science (Joel Nothman)
  • 2018
  • Software
  • Data-store development Time series

The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing the brightness of light sources over time requires matching each across pictures from different times and locations in the sky. The astrophysicists had built processing software, a database and a web app to analyse similar datasets, but had never tried to scale it to this size of dataset. An SIH engineer was able to debug and optimise the software involved, so that the data loading process took 8 hours instead of around 15 hours, and web app load times were reduced from multiple minutes to a few seconds. This enabled further research and analysis of this unique and enormous dataset.

The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing th...

Which treatment might patients with relapsed ovarian cancer respond to?

Which treatment might patients with relapsed ovarian cancer respond to?
  • Cristina Mapagu, Westmead Clinical School
  • The University of Sydney Medical School
  • Data Science (Dr Maryam Montazerolghaem)
  • 2018
  • Transformed data

Molecular markers measured within the primary tumour are used to determine if patients who have relapsed ovarian cancer will respond to a particular treatment. SIH helped to identify subsets of genes that are overexpressed / underexpressed in response to treatments, by developing statistical methods including dimensionality reduction and hypothesis testing.

Molecular markers measured within the primary tumour are used to determine if patients who have relapsed ovarian cancer will respond to a particular treatment. SIH helped to identify subsets of genes that are overexpressed / underexpressed in respons...

Breast Cancer Dashboard

Breast Cancer Dashboard
  • Professor Tim Shaw, Director Research in Implementation Science and eHealth, Charles Perkins Centre; Anna Janssen; Candice Kielly-Carroll
  • Faculty of Health Sciences
  • Data Science (Dr Aldo Saavedra, Joshua Stretton and Peter Thiem)
  • 2017
  • Software

Medical data is under-used for its potential to inform clinical practice. SIH developed a visual dashboard to display information about lymphoedema in breast cancer patients. A prototype web application with an easy to use interactive dashboard was developed to help understand a patient’s journey and assess the results of different cohorts of patients. User and expert workshops helped optimise the design.

Breast Cancer Dashboard
Medical data is under-used for its potential to inform clinical practice. SIH developed a visual dashboard to display information about lymphoedema in breast cancer patients. A prototype web application with an easy to use interactive dashboard was d...

Identifying Nerve Function Profiles in Motor Neurodegenerative Disorders

  • Dr Susanna Parks; Tiffany Li
  • The University of Sydney Medical School
  • Data Science (Alex Judge)
  • 2017
  • Software Report
  • Predictive modelling Inferential modelling

Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years’ of peripheral nerve excitability studies. A software package, QTRAC, is used to generate ~35 properties that are analysed in a research context. Additional information is incorporated to help make a diagnosis, such as clinical survey data, and the temperature of the nerve at the time of the test. Importantly, diagnosis of the disorder is not always 100% accurate. SIH used machine learning to predict the likelihood motor neuron disease for a patient given nerve excitability measurements. The model had reasonable ability to rank individual cases in order of increasing MND risk. SIH delivered this model in a software package for future use in research as well as a clinical setting, with the intention of improving the speed and accuracy of MND diagnosis to improve treatment outcomes for patients.

Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years' of peripheral nerve excitability studies. A software package...

Scopus Data Preparation

Scopus Data Preparation

The University’s research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier’s Scopus. While the University has subscribed to Scopus snapshot data for a few years, it lacked an efficient way to load and query the data.

We analysed the snapshot, stored as a collection of XML files, and developed a relational database schema to represent useful portions of the data for efficient access. We developed a script to efficiently load the data from XML’s snapshot into a relational database.

This resource now allows the Research Portfolio to calculate metrics over the publication record, while the database is now more accessible to researchers, and the loading script to external Scopus Snapshot users.

Scopus Data Preparation
The University's research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier's Scopus. While the University has subscribed to Scopus snapshot data for a few years, ...

Applying Machine Learning to Criminology

Applying Machine Learning to Criminology
  • Dr. Roman Marchant
  • Faculty of Engineering and Information Technologies
  • Data Science (Dr. Sebastian Haan)
  • https://github.com/sebhaan/GPplus
  • 2017
  • Software Transformed data Paper

The incidence of crime the impacts of societal and individual characteristics on criminal behaviour can be explored using modern machine learning methods, answering important questions about crime, such as: • What is the probability of a crime occurring at a location? • What are the characteristics of the population that affect the incidence of crime? Our work implements novel Bayesian machine learning techniques to modelling the dependency between offence data and demographic characteristics and spatial location. This provides a fully probabilistic approach to modelling crime which reflects all uncertainties in the prediction of offences as well as the uncertainties surrounding model parameters. By using Bayesian updating, these predictions and inferences are dynamic in the sense that they change as new information becomes available. Our model has been applied to offence data, such as domestic violence-related assaults, burglary and motor vehicle theft, in New South Wales (NSW), Australia. The results highlight the strength of the technique by validating the factors that are associated with high and low criminal activity.

Applying Machine Learning to Criminology
The incidence of crime the impacts of societal and individual characteristics on criminal behaviour can be explored using modern machine learning methods, answering important questions about crime, such as: • What is the probability of a crime occur...

Research Environment for Ancient Documents (READS) efficiency

Research Environment for Ancient Documents (READS) efficiency
  • Ian McCrab; Dr Mark Allon, School of Languages and Cultures
  • Faculty of Arts and Social Sciences
  • Data Science (Joel Nothman)
  • 2017
  • Verbal advice

Research Environment for Ancient Documents (READ) is an integrated Open Source web platform for epigraphical and manuscript research. It allows digital images of texts to be annotated, and for multiple annotations to be maintained for critical analysis.

The READ research and development team consulted with SIH because they had trouble getting their software to perform well with long texts. SIH Research Engineers helped them to identify parts of the system that were slower than was reasonable. Their software engineers were able to then resolve these bottlenecks, enabling the READ system to be more widely adopted in archaeology, history and manuscript studies.

Research Environment for Ancient Documents (READ) is an integrated Open Source web platform for epigraphical and manuscript research. It allows digital images of texts to be annotated, and for multiple annotations to be maintained for critical analys...

Predictive Project Profile

Predictive Project Profile
  • Professor Lynn Crawford, School of Civil Engineering; Dr Terry Cooke-Davies; Dr Mike Steele
  • Faculty of Engineering and Information Technologies
  • Data Science (Mr Peter Thiem)
  • 2017
  • Software
  • Predictive modelling

Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals.

This project used survey data collected over the lifecycle of 1000 projects to build a machine learning model and prototype web application that would indicate the likelihood of success of the project.

This tool would be used for further demonstration and discussion of the idea with expert groups of project managers, such they can develop better and data driven approaches to lead to successful project management.

Predictive Project Profile
Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals. This proj...

Transforming IMPALA: International Migration Law and Policy Assessment Database

Transforming IMPALA: International Migration Law and Policy Assessment Database
  • Professor Mary Crock, Sydney Law School
  • The University of Sydney Law School
  • Data Science (Joel Nothman)
  • 2017
  • Software
  • Description and basic visualization

The IMPALA database (http://www.impaladatabase.org/) contains migration and citizenship law and policy across countries and through time in a form that allows legislation, policy and some statistical data to be easily compared and measured. With one record for each visa type in each year, thousands of records of content have been entered, mostly manually, into a Qualtrics survey. SIH transformed the unwieldy manually-entered database to improve data exploration. We wrote software to ingest the Qualtrics responses (https://github.com/Sydney-Informatics-Hub/qualtrics-pandas), and to generate a more usable output with consistent metadata.

Transforming IMPALA: International Migration Law and Policy Assessment Database
The IMPALA database (http://www.impaladatabase.org/) contains migration and citizenship law and policy across countries and through time in a form that allows legislation, policy and some statistical data to be easily compared and measured. With one ...

Disease spectrum and management of children admitted with acute respiratory infection in Viet Nam

Disease spectrum and management of children admitted with acute respiratory infection in Viet Nam
  • Nguyen Thi Kim Phuong, Respiratory Department, Da Nang Hospital for Women and Children; Professor Ben Marais, The Children's Hospital at Westmead Clinical School and Deputy Director, Marie Bashir Institute for Infectious Diseases and Biosecurity
  • Faculty of Health Sciences
  • Data Science (Dr Maryam Montazerolghaem )
  • 2016
  • Paper
  • Description and basic visualization

This study aim to assess the acute respiratory infection (ARI) disease spectrum, duration of hospitalisation and outcome in children hospitalised with an ARI in Viet Nam. The result indicates that acute respiratory infection is a major cause of paediatric hospitalisation in Viet Nam, characterised by prolonged hospitalisation for relatively mild disease. There is huge potential to reduce unnecessary hospital admission and cost.

This study aim to assess the acute respiratory infection (ARI) disease spectrum, duration of hospitalisation and outcome in children hospitalised with an ARI in Viet Nam. The result indicates that acute respiratory infection is a major cause of paed...