Below we showcase several projects in which SIH has used data-store development. See all projects.
ProteHome: Proteomics Experimental Results Database
The Metabolic Cybernetics Lab at the Charles Perkins Centre has generated disparate Mass Spectrometry datasets for protein metabolism studies. With datasets stored in various formats and places, it has been difficult to search and compare experimental results.
SIH developed ProteHome, a bioinformatics system providing a centralised data repository, with a web-based interface to facilitate the:
- Standardisation of quantitative analysis results, with common specification of experimental metadata and formatting of analysis data.
- Retrieving those results for protein(s) or modification(s) of interest, regardless of the version of protein identification number used in the stored experiment.
- Management of submitted datasets using a comprehensive hierarchical storage structure.
cBioPortal Melanoma Data
cBioportal is the Westmead Institute for Medial Research’s cancer genomics database and web platform. It provides visualization, analysis and download of large-scale cancer genomics data sets, so that users can find altered genes and networks within the genomic studies on the database. SIH extended the app’s capabilities to be able to incorporate melanoma images, genetic data and studies.
Understanding Transgenerational Welfare Dependence
The Transgenerational Dataset 2 Extended (TDS2-e) dataset is an important investment by the Commonwealth in understanding the factors contributing to life outcomes, including the reliance of people on income support. The data contains welfare payments to recipients born between 1987-1988, their families, parents, children and siblings. The raw data was difficult to work with because it was subject to extensive security requirements, was large in volume, and an inconvenient data shape. SIH engineered software to convert the data into forms that made the data accessible to the end user while complying with security and licence requirements. This rich dataset is now available for researchers to explore, and will contribute to the understanding and improvements to the Commonwealth income support systems and life outcomes for all Australians.
Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.
Clustering Light Sources: Scaling Up to the Whole Sky
The Murchison Widefield Array is a state-of-the-art telescope in Western Australia. Over the last four years, researchers have collected an exceptionally large time-series dataset on 300,000 bright objects in the sky, such as supernovae. Analysing the brightness of light sources over time requires matching each across pictures from different times and locations in the sky. The astrophysicists had built processing software, a database and a web app to analyse similar datasets, but had never tried to scale it to this size of dataset. An SIH engineer was able to debug and optimise the software involved, so that the data loading process took 8 hours instead of around 15 hours, and web app load times were reduced from multiple minutes to a few seconds. This enabled further research and analysis of this unique and enormous dataset.
Scopus Data Preparation
The University’s research output is evaluated, in part, on the basis of publication and citation networks derived from publication metadata archives like Elsevier’s Scopus. While the University has subscribed to Scopus snapshot data for a few years, it lacked an efficient way to load and query the data.
We analysed the snapshot, stored as a collection of XML files, and developed a relational database schema to represent useful portions of the data for efficient access. We developed a script to efficiently load the data from XML’s snapshot into a relational database.
This resource now allows the Research Portfolio to calculate metrics over the publication record, while the database is now more accessible to researchers, and the loading script to external Scopus Snapshot users.