Cardiomyopathies (CMPs) are internationally defined as heart diseases with structurally and functionally abnormal myocardium not explained by coronary artery disease, hypertension or valvular heart disease [1
]. Many CMP patients have a familial history of disease, which typically follows an autosomal dominant inheritance pattern. In the Netherlands, it is estimated that 1 in 200 individuals carry a genetic predisposition for a CMP [3
]. However, penetrance is incomplete and clinical expression of CMPs is heterogeneous, ranging from overt heart failure and lethal arrhythmias to being asymptomatic [2
]. Despite major advances in our understanding of the genetics of these diseases, our knowledge of the pathophysiological substrate of CMPs is limited, and CMPs remain a leading cause of premature sudden cardiac death and end-stage heart failure in persons below the age of 60 years [7
By integrating electronic health records (EHRs) with research data platforms (RDPs), new insights into disease penetrance, risk assessment and disease pathophysiology can be obtained. In their current format, EHRs comprise both structured and unstructured electronic data that have been gathered, captured and assessed during routine clinical care [8
]. Major opportunities lie in the standardisation of unstructured data, such as clinical notes and investigations [8
]. Integrating these data with other data sources, including outcome registries, imaging, wearables and research measurements (‑omics), has the potential of offering higher-resolution data regarding disease epidemiology, onset and progression.
In this article, we present the design of the UNRAVEL RDP, in which a large dataset of CMP patients is enriched by text mining and linked to biomaterials. The UNRAVEL RDP aims to improve the daily care of CMP patients and their family members by (1) providing a standardised database with routine health care data linked to research-generated data that are easily accessible for big data analytics; (2) facilitating harmonisation of data, clinical care protocols and sharing of algorithms on www.unravelrdp.nl
; and (3) providing the basis for approaching patients for in-depth biological research through the generation of induced pluripotent stem cells.
An overview of the preliminary results is provided in Tab. 1
. By October 2018, 1928 individuals had been asked to participate in the UNRAVEL RDP. Of these, 828 individuals provided consent, of which 58% are male. Median current age is 57 years (interquartile range (IQR) 45–67). Overall, the available data comprises 18,565 ECGs with a median of 74 per patient (IQR 32–105), 3619 different echocardiograms with a median of 12 per patient (IQR 5–18), data from over 20,000 radiological examinations including 389 cardiac MRI scans and 650,000 individual laboratory results. Data from other non-cardiac examinations, e.g. orthopaedic MRI or endoscopy, are also available. In 356 participants, a diagnosis of heart failure had been registered according to the diagnosis thesaurus described earlier: 222 have dilated CMP, 38 hypertrophic CMP. Blood from 267 patients has thus far been stored in the biobank according to protocol. To date, 323 mutations have been identified, primarily in PKP2
(17%) and TTN
Clinical characteristics and available tests of 828 patients included in UNRAVEL. Data are presented as number (median, IQR)
57 years (IQR 45–67)
Diagnosis as registered in EHR
Cardiac ultrasound images
3619 (12, IQR 5–18)
18,565 (74, IQR 32–105)
274 (7, IQR 3–15)
389 (2, IQR 1–3)
There is still limited knowledge on the aetiology, diagnostic performance of clinical investigations and disease modifiers in CMPs, complicating the clinical care of these patients [2
]. Research databases based on large numbers of patients provide the infrastructure for new insights into these diseases. To date, patient registries have typically often had fixed time points at which data are manually inputted, data entry is at the discretion of the researcher and a vast amount of (meta)data gathered during routine clinical care is inherently disregarded. The current advanced EHR systems provide exciting opportunities to access all data gathered in routine clinical care which can be linked to research data. The resulting datasets will have larger resolution and may provide new insights into disease penetrance, risk assessment and disease pathophysiology [8
]. The UNRAVEL RDP incorporates these large automated and standardised datasets of CMP patients, enriched with language processing and text retrieval. Advantages include (1) automation and efficiency, (2) featuring temporal or sequential data, (3) allowing for EHR-embedded trials and (4) mining unstructured data using text analysis.
EHR data are extracted and standardised in the UNRAVEL RDP, which has thus far led to a dataset comprising 828 patients with a total of 18,565 ECGs, 3619 echocardiograms, 389 cardiac MRI scans and 323 patients with mutated genes (Tab. 1
). The RDP automatically provides these raw (meta)data. This obviates the laborious need for manually maintained registries, saving the precious time of (medical) experts and reducing transcription errors. Furthermore, since outcomes such as admission, heart transplantation and (cardiac) death are automatically extracted from the EHR, obtaining follow-up will be less time-consuming, thereby reducing costs [11
With the RDP, these data can be integrated into a detailed longitudinal picture of the clinical course of a patient, a “human phenome sequence” [8
]. In previous studies, (semi-)supervised and unsupervised machine learning on linked EHR data was able to solve problems in prediction and pattern recognition [8
]. However, routine clinical records can be sparsely filled and (ontological) definitions of disease may differ over time. To counter these issues, a semi-supervised machine learning method has been proposed by Beaulieu-Jones et al. [18
] to analyse these high-dimensional EHR data, constructing phenotypes based on unsupervised learning, then clustering these patients in sub-phenotypes and performing survival analyses. Furthermore, large datasets such as the UNRAVEL RDP are prone to generate associations with uncertain causal relevance. To address causality, the addition of our stem-cell informed consent serves as a stepping stone for functional follow-up studies using induced pluripotent stem cells. Additional statistical frameworks such as instrumental variables and Mendelian randomisation, or further research in randomised clinical trials may also provide further support to observed associations [19
To embed clinical trials, data in the UNRAVEL RDP can be used for trial feasibility, patient recruitment, but also for remote data monitoring, potentially reducing clinical trial costs and selection bias (pragmatic trials). Using the UNRAVEL RDP, it is possible to perform interventions and measure outcomes during routine health care, ranging from life-style interventions to logistical questions on how often a patient should be followed up. EHR can be an alternative to electronic case registration forms providing data is consistently collected in routine clinical care, including data on (adverse) events [20
Structured EHR data such as encoded diagnosis and cardiac ultrasound are the easiest data sources to process, but advances in text mining have made it possible to also use unstructured clinical data, such as patient medical histories, discharge summaries and clinical notes [10
]. Using a text-retrieval algorithm, we have developed a tool to extract standardised data from clinical notes. This tool is, however, still under development and was implemented on clinical notes from the Department of Cardiology at the UMC Utrecht. Therefore, the tool should be used with caution and under the supervision of a medical expert in other centres.
EHR data that are subjected to robust pre-processing and cleaning have been shown to offer a common scaffold upon which research questions can be built and linked to datasets, enabling new areas of research [9
]. With these “big” EHR data, however, great challenges and responsibilities arise: data governance, data access, public trust, definitions of disease and development of replicable scientific tools. Furthermore, these large datasets are prone to generating associations with great uncertainty regarding causality. Therefore, analysis of data and interpretation must be performed by a multidisciplinary team including medical experts, epidemiologists and data scientists. Only if the data are understood and carefully evaluated can new models explaining onset and progression of disease be developed [8
In conclusion, the UNRAVEL RDP is an enriched data platform for CMPs that combines EHR data with a standardised blood biobank and text-mining tools. This integration of EHR data into the RDP allows novel analysis of the onset and progression of disease and can embed performance measures in clinical practice. Laboratory protocols, informed consent forms and algorithms are available on www.unravelrdp.nl
. Protocols have been shared thus far with the University Medical Centre Groningen, Amsterdam University Medical Centre and Bergman Clinics, and we explicitly welcome national and international cooperation with the UNRAVEL team to harmonise protocols.