Projects : Software

Our projects often involve developing and delivering software.

This may be a web app, such as a data mangement tool. It may be a data collection or analysis tool that runs repeatedly. Or it migth be a script that the client can run several times to perform modelling, prediction, simulation or visualisation.

We mostly develop software in Python or R, together with backend technologies such as SQL and frontend technologies like JavaScript/HTML and even Excel! We may employ one of several software frameworks such as R Shiny, Django, Scientific Python and AngularJS.

Below we showcase several projects in which SIH has delivered software. See all projects.

Wheat yield prediction with uncertainty estimates

Wheat yield prediction with uncertainty estimates

Predicting crop yield using a range of proximal and remote sensor measurements is area of active research. Such predictions are important for optimisation of crop management (e.g. nitrogen application) and robust associated uncertainty estimates help to improve this process and understand its limitations. We wrote code implementing a Bayesian regression model with spatially correlated residuals for application to wheat crop yield forecasting using a range of sensor data. We used this to generate predictive maps of wheat yield with robust uncertainty bounds.

Wheat yield prediction with uncertainty estimates
Predicting crop yield using a range of proximal and remote sensor measurements is area of active research. Such predictions are important for optimisation of crop management (e.g. nitrogen application) and robust associated uncertainty estimates help...

GemPy MCMC Sampling

GemPy MCMC Sampling

GemPy is a Python-based, open-source library for implicitly generating 3D structural geological models. It was designed from the ground up to support easy embedding in probabilistic frameworks for the uncertainty analysis of subsurface structures.

SIH investigated the code’s viability for the implementation of various additional sensor models as well the implementation of a data science-friendly Bayesian inference wrapper for easy experimentation with priors, likelihoods and sampling schemes.

GemPy MCMC Sampling
GemPy is a Python-based, open-source library for implicitly generating 3D structural geological models. It was designed from the ground up to support easy embedding in probabilistic frameworks for the uncertainty analysis of subsurface structures. ...

Stoichiometry Table Widget for LabArchives eNotebooks

Stoichiometry Table Widget for LabArchives eNotebooks
  • Samuel Banister
  • Faculty of Science
  • Data Science (Joel Nothman and Vijay Raghunath)
  • 2019
  • Software Web app

A stoichiometry table is useful for calculating and recording quantities of reagents. SIH coded a flexible and reliable widget for entering this data in a lab notebook, supporting University’s use of the LabArchives notebooks in chemistry. Use it: labarchives-stoichiometry-widget!

Stoichiometry Table Widget for LabArchives eNotebooks
A stoichiometry table is useful for calculating and recording quantities of reagents. SIH coded a flexible and reliable widget for entering this data in a lab notebook, supporting University's use of the LabArchives notebooks in chemistry. Use it: [l...

Labelling Clause Type at Scale for LCT

Labelling Clause Type at Scale for LCT

LCT studies how knowledge is built through teaching, and in order to determine the trajectory of knowledge building, proposes to categorise each clause in a teaching transcript. SIH made this process of labelling clauses much faster and scalable. They did so firstly by developing software with natural language processing technology that converts a lesson transcript into a spreadsheet where each row contains a clause to be categorised. Secondly, they developed a machine learning classifier to learn from these spreadsheets and predict the labels of future clauses. Finally, SIH developed techniques to visualise the trajectory of knowledge building through a lesson where clauses have been categorised.

Labelling Clause Type at Scale for LCT
LCT studies how knowledge is built through teaching, and in order to determine the trajectory of knowledge building, proposes to categorise each clause in a teaching transcript. SIH made this process of labelling clauses much faster and scalable. The...

cBioPortal Melanoma Data

cBioPortal Melanoma Data
  • Professor Graham Mann, Professor Anna DeFazio, Dr James Wilmott, Natalie Bouantoun, Pamela Provan
  • The University of Sydney Medical School
  • Data Science (David Kohn and Vijay Raghunath)
  • 2019
  • Software
  • Data-store development

cBioportal is the Westmead Institute for Medial Research’s cancer genomics database and web platform. It provides visualization, analysis and download of large-scale cancer genomics data sets, so that users can find altered genes and networks within the genomic studies on the database. SIH extended the app’s capabilities to be able to incorporate melanoma images, genetic data and studies.

cBioPortal Melanoma Data
cBioportal is the Westmead Institute for Medial Research's cancer genomics database and web platform. It provides visualization, analysis and download of large-scale cancer genomics data sets, so that users can find altered genes and networks within ...

Identifying ram mating behaviour

Identifying ram mating behaviour

Monitoring livestock has historically been labour intensive. The advent of on-animal sensors means this monitoring can be conducted remotely, continuously, and accurately. The ability to identify the precise time when sheep are mating using ram-mounted accelerometer data would unlock unprecedented information on the reproductive performance of these animals. We fit a classifier model to data from collar accelerometers labelled by videoing rams in the presence of ewes in oestrus. We then wrote code to detect change points in new acceleration data and to predict the occurence of mating events.

Identifying ram mating behaviour
Monitoring livestock has historically been labour intensive. The advent of on-animal sensors means this monitoring can be conducted remotely, continuously, and accurately. The ability to identify the precise time when sheep are mating using ram-mount...

1000-fold speedup in Dynamic Bayesian network model

1000-fold speedup in Dynamic Bayesian network model

A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic.

SIH helped the researcher by speeding up the R-package used to fit the dynamic bayesian network model by 1000x. The R-package is now available at https://github.com/FrankD/EDISON/tree/MultipleTimeSeries

1000-fold speedup in Dynamic Bayesian network model
A Bayesian network is a series of linear models fit to describe the relationships between different variables in a time series. If there are change points in how these variables are related, then the network is dynamic. SIH helped the researcher by...

Repackaging software for modelling topic structure in language

Repackaging software for modelling topic structure in language
  • Eduardo Altmann, Mathematics and Statistics
  • Faculty of Science
  • Data Science (Vijay Raghunath and Dr Joel Nothman)
  • 2019
  • Software

Altmann with Martin Gerlach wanted other researchers to try out their new language modelling technique, so they made it open-source. SIH made their work more accessible by following software best practices: restructuring the code so that it conformed to the established Scikit-learn estimator API; adding automated software testing and continuous integration; extending and publishing documentation; and releasing version 0.1 of the software to the Python package index. See http://topsbm.readthedocs.io

Altmann with Martin Gerlach wanted other researchers to try out their new language modelling technique, so they made it open-source. SIH made their work more accessible by following software best practices: restructuring the code so that it conformed...

Where can deep-sea iron nodules be found?

Where can deep-sea iron nodules be found?

Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining data from thousands of ocean floor samples with global maps of oceanic variables. The environments in which these deposits do and do not occur could then be characterised to generate insight into potential consequences of proposed mining.

Where can deep-sea iron nodules be found?
Potato-sized nodules of iron ore found on the ocean floor are of commercial mining interest. However, negative ecological effects from mining these nodules is of concern. SIH constructed a global predictive model of nodule occurrence by combining dat...

Predicting Crime using a Spatial-Demographic Framework

Predicting Crime using a Spatial-Demographic Framework

Responding to domestic violence related assaults dominate much of the NSW Police’s resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The social-demographic-crime network and its inter-dependencies were modelled using a Bayesian vector autoregression model. We built a collaboration with BOCSAR, the crime database of all offences in NSW over the last 20 years, and sourced demographic data for multiple census years. The results of this study will help inform policy decision-making by government and police.

Predicting Crime using a Spatial-Demographic Framework
Responding to domestic violence related assaults dominate much of the NSW Police's resources. We try to understand the relationships that drive social-demographic change and cause the occurrence of crime using a complex modelling framework. The socia...

Optimal Image Reconstruction for the SAMI Galaxy Survey

Optimal Image Reconstruction for the SAMI Galaxy Survey

The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data cube from ~500 spectra taken at different spatial locations across a galaxy. The previous method resulted in undesirable artefacts due to under-sampling and the astronomical sources changing spatial location within the data due to differential atmospheric refraction. We have developed a novel method using probabilistic image fusion that delivers optimal combination of the spectral fibre bundle data into a cube with uniform image quality while maintaining spectral details. This innovative technology has further demonstrated capabilities to achieve super-resolution and is implemented as flexible software framework that can eventually be used by a wide range of worldwide telescopes.

Optimal Image Reconstruction for the SAMI Galaxy Survey
The SAMI Galaxy Survey is a large-scale observational program to target several thousand galaxies with the University of Sydney built Sydney-AAO Multi-object Integral field spectrograph (SAMI). A key data challenge is to optimally reconstruct a data ...

Understanding Transgenerational Welfare Dependence

Understanding Transgenerational Welfare Dependence

The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payments to recipients born between 1987-1988, their families, parents, children and siblings. The raw data was difficult to work with because it was subject to extensive security requirements, was large in volume, and an inconvenient data shape. SIH engineered software to convert the data into forms that made the data accessible to the end user while complying with security and licence requirements. This rich dataset is now available for researchers to explore, and will contribute to the understanding and improvements to the Commonwealth income support systems and life outcomes for all Australians.

The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payment...

Automating information curation in the OMIA knowledge base

Automating information curation in the OMIA knowledge base

Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.

Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. F...

Clustering Light Sources: Scaling Up to the Whole Sky

Clustering Light Sources: Scaling Up to the Whole Sky

The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing the brightness of light sources over time requires matching each across pictures from different times and locations in the sky. The astrophysicists had built processing software, a database and a web app to analyse similar datasets, but had never tried to scale it to this size of dataset. An SIH engineer was able to debug and optimise the software involved, so that the data loading process took 8 hours instead of around 15 hours, and web app load times were reduced from multiple minutes to a few seconds. This enabled further research and analysis of this unique and enormous dataset.

Clustering Light Sources: Scaling Up to the Whole Sky
The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing th...

Breast Cancer Dashboard

Breast Cancer Dashboard
  • Professor Tim Shaw, Director Research in Implementation Science and eHealth, Charles Perkins Centre; Anna Janssen; Candice Kielly-Carroll
  • Faculty of Health Sciences
  • Charles Perkins Centre
  • Data Science (Dr Aldo Saavedra, Joshua Stretton and Peter Thiem)
  • 2017
  • Software

Medical data is under-used for its potential to inform clinical practice. SIH developed a visual dashboard to display information about lymphoedema in breast cancer patients. A prototype web application with an easy to use interactive dashboard was developed to help understand a patient’s journey and assess the results of different cohorts of patients. User and expert workshops helped optimise the design.

Breast Cancer Dashboard
Medical data is under-used for its potential to inform clinical practice. SIH developed a visual dashboard to display information about lymphoedema in breast cancer patients. A prototype web application with an easy to use interactive dashboard was d...

Identifying Nerve Function Profiles in Motor Neurodegenerative Disorders

Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years’ of peripheral nerve excitability studies. A software package, QTRAC, is used to generate ~35 properties that are analysed in a research context. Additional information is incorporated to help make a diagnosis, such as clinical survey data, and the temperature of the nerve at the time of the test. Importantly, diagnosis of the disorder is not always 100% accurate. SIH used machine learning to predict the likelihood motor neuron disease for a patient given nerve excitability measurements. The model had reasonable ability to rank individual cases in order of increasing MND risk. SIH delivered this model in a software package for future use in research as well as a clinical setting, with the intention of improving the speed and accuracy of MND diagnosis to improve treatment outcomes for patients.

Nerve excitability measurements can identify patterns of nerve dysfunction associated with many diseases of the nervous system. The researchers manage a database containing around 20 years' of peripheral nerve excitability studies. A software package...

Scopus Data Preparation

Scopus Data Preparation

The University’s research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier’s Scopus. While the University has subscribed to Scopus snapshot data for a few years, it lacked an efficient way to load and query the data.

We analysed the snapshot, stored as a collection of XML files, and developed a relational database schema to represent useful portions of the data for efficient access. We developed a script to efficiently load the data from XML’s snapshot into a relational database.

This resource now allows the Research Portfolio to calculate metrics over the publication record, while the database is now more accessible to researchers, and the loading script to external Scopus Snapshot users.

Scopus Data Preparation
The University's research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier's Scopus. While the University has subscribed to Scopus snapshot data for a few years, ...

Applying Machine Learning to Criminology

Applying Machine Learning to Criminology

The incidence of crime the impacts of societal and individual characteristics on criminal behaviour can be explored using modern machine learning methods, answering important questions about crime, such as: • What is the probability of a crime occurring at a location? • What are the characteristics of the population that affect the incidence of crime? Our work implements novel Bayesian machine learning techniques to modelling the dependency between offence data and demographic characteristics and spatial location. This provides a fully probabilistic approach to modelling crime which reflects all uncertainties in the prediction of offences as well as the uncertainties surrounding model parameters. By using Bayesian updating, these predictions and inferences are dynamic in the sense that they change as new information becomes available. Our model has been applied to offence data, such as domestic violence-related assaults, burglary and motor vehicle theft, in New South Wales (NSW), Australia. The results highlight the strength of the technique by validating the factors that are associated with high and low criminal activity.

Applying Machine Learning to Criminology
The incidence of crime the impacts of societal and individual characteristics on criminal behaviour can be explored using modern machine learning methods, answering important questions about crime, such as: • What is the probability of a crime occur...

Predictive Project Profile

Predictive Project Profile
  • Professor Lynn Crawford, School of Civil Engineering; Dr Terry Cooke-Davies; Dr Mike Steele
  • Faculty of Engineering and Information Technologies
  • Data Science (Mr Peter Thiem)
  • 2017
  • Software
  • Predictive modelling

Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals.

This project used survey data collected over the lifecycle of 1000 projects to build a machine learning model and prototype web application that would indicate the likelihood of success of the project.

This tool would be used for further demonstration and discussion of the idea with expert groups of project managers, such they can develop better and data driven approaches to lead to successful project management.

Predictive Project Profile
Judging the likelihood of project success is a difficult and important aspect of project management. Projects can fail in many ways, such as overruns in cost, duration, failure to deliver benefit and failure to satisfy the project goals. This proj...

Discharge against medical advice in the Sydney Children's Hospital Network

Discharge against medical advice in the Sydney Children's Hospital Network

Patients who discharge against medical advice (DAMA) from hospital carry a significant risk of readmission and have increased rates of morbidity and mortality. Using five years of admissions and diagnosis data, we sought to identify the demographic, clinical and administrative characteristics of DAMA patients in the Sydney Children’s Hospital Network. Using a bayesian logistic regression framework, we found statistically significant predictors of DAMA in a given admission were hospital site, a mental health/behavioural diagnosis, Aboriginality, emergency rather than elective admissions, a gastrointestinal diagnosis and a history of previous DAMA. Identification of these predictors of DAMA provides opportunities for intervention at a practice and policy level in order to prevent adverse outcomes for patients.

Discharge against medical advice in the Sydney Children's Hospital Network
Patients who discharge against medical advice (DAMA) from hospital carry a significant risk of readmission and have increased rates of morbidity and mortality. Using five years of admissions and diagnosis data, we sought to identify the demographic, ...

Transforming IMPALA: International Migration Law and Policy Assessment Database

Transforming IMPALA: International Migration Law and Policy Assessment Database

The IMPALA database (http://www.impaladatabase.org/) contains migration and citizenship law and policy across countries and through time in a form that allows legislation, policy and some statistical data to be easily compared and measured. With one record for each visa type in each year, thousands of records of content have been entered, mostly manually, into a Qualtrics survey. SIH transformed the unwieldy manually-entered database to improve data exploration. We wrote software to ingest the Qualtrics responses (https://github.com/Sydney-Informatics-Hub/qualtrics-pandas), and to generate a more usable output with consistent metadata.

Transforming IMPALA: International Migration Law and Policy Assessment Database
The IMPALA database (http://www.impaladatabase.org/) contains migration and citizenship law and policy across countries and through time in a form that allows legislation, policy and some statistical data to be easily compared and measured. With one ...