Galileo, a novel search engine turning unstructured text into actionable insights

Galileo, a novel search engine turning unstructured text into actionable insights

Galileo, a novel search engine

Data sources in medical field

Healthcare system produces a large amount of data; even during a short-term hospitalization for an interventional procedure, an unbelievable volume of information is generated: operative reports, therapy administration, medical imaging, discharge letters, laboratory/diagnostic tests, daily annotations or other written prescriptions.

Most of them are stored in hard copy form; even so, despite the current trend toward the conversion to digital forms, the explosion of data related to health, the gathered speed they are increasingly produced, the variability of their sources, and the effort to manage them with traditional software and databases, make difficult even to store and retrieve them without losing the broader view, which means their interrelation and complexity.

Woven through these issues are those of data processing and cleansing, together with data integration. Indeed, healthcare data are often disordered, fragmented, and generated by different operators (clinicians, surgeons, nurses and administrative personnel) in legacy IT systems with incompatible formats.

Currently, producing statistically sound data in healthcare requires an intense digging among a vast array of non-integrated sources, like going over a clinical folder with a fine tooth comb, or looking for patients in internal spreadsheets, filled by physicians or clinical data manager that collect data for their own department or for a specific clinical study, and integrate them with the information available from the Institute’s IT system.

The lack of integration and the need for cleansing data from raw text to column databases are crucial challenges to improve clinical and financial outcomes and to boost research projects in the healthcare field.

In this scenario, steps forward are being made: Medtronic, one of the largest companies in the biomedical devices’ field, is developing a novel mobile application able to ingest data from the sensors of its devices for diabetic treatment (insulin pumps) to better manage patients outside of the hospital setting, and IBM has recently introduced a novel technology, Watson Health, to enhance the data-driven approach with cognitive features for the healthcare system.

Here, we present a novel platform, named Galileo, able to ingest unstructured data generated from a patient’s hospitalization (in the setting of an Electrophysiology Department dedicated to arrhythmia management), to combine and deliver them to users at a single access point.

Facing the lag between data production and data collection/processing

The rationale behind Galileo is to face the lag between data production - when a clinical report is written by the physician - and data collection/processing - when the clinical data manager or statistician reads the plain text from the clinical folders, design proper databases and performs data entry.

Galileo was designed to process data from the very source, that are the hospital’s discharge letters, laboratory and pre-operatory exams, all in .txt format (.docx or .pdf may be applicable as well) before any data collection or cleansing, retrieve data applying text analytics algorithms and provide one single application able to connect all the sources related to the patient’s hospitalization and make them accessible.

With just one click.

How Galileo has been developed

The application was developed by using IBM Watson Explorer (WEX). In particular, two WEX modules were used: 

• WEX Engine: the backbone of the final application. It was used to ingest, elaborate and index plain text (discharge letters, operatory reports, preoperatory exams and Wikipedia) with Text Analytics custom converters written with AQL (Annotation Query Language) language, and further refinements with other custom converters in plain XSL-T.

• WEX AppBuilder: Galileo’s front-end, consisting of a search engine provided in a user-friendly, responsive web application allowing searches through the medical records in input, enabling the application of filters and refinements to the search results, and able to connect the pages related to a patient’s hospitalization such as registry information, procedural data, pre-operatory exams, medical therapy and information related to the disease extracted from Wikipedia. 

Moreover, a simplified algorithm was implemented in Galileo allowing the user to predict the position of the arrhythmogenic substrate of a ventricular tachycardia on the patient’s heart by compiling an HTML form with few questions on the electrocardiographic recording of the arrhythmia.

Galileo’s architecture is shown in Figure 1.

Galileo's architecture

Figure 1. Galileo's architecture.

An example of hospitalization page as it appears in the final application is shown in Figure 2: widgets with personal data, clinical history, operative procedure and medical therapy are displayed, presenting information as extracted from the raw text. From that page, further hospitalization related pages could be easily accessed, providing further insights into the patient’s condition.

Hospitalization page in Galileo

Figure 2. Hospitalization page in Galileo.

Benefits & Drawbacks

Galileo’s advantages include:

• Availability;

• Continuity;

• Ease to use;

• Ability to manipulate at different levels of granularity;

• Privacy and security enablement.

In particular, privacy is paramount when manipulating sensitive data. IBM Watson Explorer allows the implementation of authentication and authorization rules that reach the atomic level of the single field, making the application usable by different types of users for different purposes, for example, physicians and administrative personnel, since they could be allowed to different levels of information.

Furthermore, since the crawling of the connected sources could be scheduled at very close intervals, Galileo could provide data with a frequency comparable to a real-time acquisition, parsing new sources as soon as they were added or generated.

Still, data quality stands as a key point, since the text mining relies on the unavoidable hypothesis that the information is reported in the text content, and lack of text integrity, incomplete paragraphs or records may inevitably result in missing information.


In conclusion, we discussed the benefits of Galileo, a novel big data application providing the integration of different, often non-standardized sources and able to extract information from raw text and turn it into actionable insights almost in a real-time setting.

Galileo may support physicians to manage large volumes of unstructured medical data easily and quickly from a single transparent and secure application. It could also support hospital administrators and managers to check the cost-effective ratio of an index clinical practice and health institutes in population-based medicine analysis.

Finally, the question of digging the hidden treasure of data in the medical field is still open. An active cooperation between different professional figures is a cornerstone, and the Business Intelligence companies may provide with unique competencies and knowledge this thrilling challenge.

Nicolò Albanese
Nicolò Albanese
Junior Business Analyst