Below we showcase several projects in which SIH has worked with language as data. This includes technologies such as natural language processing (NLP), text mining, speech recognition, large-scale content analysis, corpus linguistics, web scraping, linguistic corpus annotation, etc. See all projects.
Grant application support for cardiovascular AI impact
We assisted the Westmead Applied Research Centre to apply for the Google Impact AI Challenge, and are involved in the ongoing design and engineering to make use of the $1M prize. WARC called on our expertise in data science and natural language processing to revise their application and to support their case in interview for the award. Our involvement ensured that the proposal amounted to a realistic, operationalised application of artificial intelligence technologies.
Labelling Clause Type at Scale for LCT
LCT studies how knowledge is built through teaching, and in order to determine the trajectory of knowledge building, proposes to categorise each clause in a teaching transcript. SIH made this process of labelling clauses much faster and scalable. They did so firstly by developing software with natural language processing technology that converts a lesson transcript into a spreadsheet where each row contains a clause to be categorised. Secondly, they developed a machine learning classifier to learn from these spreadsheets and predict the labels of future clauses. Finally, SIH developed techniques to visualise the trajectory of knowledge building through a lesson where clauses have been categorised.
Predicting unnecessary CT scans
- Professor Jonathan Morris, Kolling Institute of Medical Research and Sydney Medical School; Dr Felicity Gallimore
- The University of Sydney Medical School
- Data Science (Dr Aldo Saavedra , Dr Madhura Killedar, Dr Joel Nothman and Mr Peter Thiem)
- Predictive modelling Inferential modelling Description and basic visualization Language as data
Diagnostic imaging in hospitals is costly due to expensive machines and their operators, as well as the cost of moving patients in and out of radiography. Published studies of emergency presentations have shown that the number of brain computer tomography (CT-Brain) scans performed is increasing with time while the proportion of scans giving no cause for concern remains the same and represents the largest category.
We sought to determine whether a substantial portion of CT Scans performed in North Sydney LHD were unnecessary. We translated this research question into something determinable from data: identify CT-Brain cases where the unconcerning outcome of scans could be predicted from clinical knowledge available prior to the scan. By first constructing a text classifier to label CT Scan reports as unconcerning, we were able to use clustering and predictive modelling to weakly identify some patient features that predicted unconcerning CT results.
While the project had the potential to impact clinical policy surrounding the application of CT Scans in Emergency Departments, the weak results suggests that if any excessive expenditure problem exists it is not simple to resolve. At the same time, we have developed methodologies for performing similar studies towards rationalising diagnostic scan expenditure.
Automating information curation in the OMIA knowledge base
Online Medelian Inheritance in Animals (OMIA) is an online knowledge base of inherited disorders in animals. It offers a wide range of search & curation functionalities on the animal genetics database created and maintained by Prof. Frank Nicholas. Frank maintained an annotated bibliography in OMIA by manually searching for the latest articles (~150 per day), but this approach was not sustainable. SIH automated this process to emulate Frank’s existing work. A text-mining pipeline now automatically downloads and shortlists recent publications predicted to have high relevance for OMIA. We developed an interface in which Frank can annotate or exclude these publications from the knowledge base. This project enables the OMIA to continue contributing to the genetic science community as a user-friendly online platform.