The data necessary to account for every aspect of our human complexity poses significant challenges to health AI systems. There’s certainly no way around data science to get a hold of it – but don’t you count physicians out too soon!

Precision Medicine bears the promise to bring highly individualized treatment to every patient. This is a major step forward from traditional medical practice where the treatment patients receive is determined solely on the characteristics of their disease and possibly a rather coarse set of population-level features (such as age or sex), but often without fully taking the patient‘s individual parameters into account.

After all, the patient is a complex bio-psycho-socio-cultural entity: Our health is influenced by our genetically determined biological predispositions (genotype) and their healthy and pathological manifestations within our body (phenotype), and by our psychological preconditions, our social environment, as well as our cultural background. At the same time, these facets determine which treatment options are available, applicable, and acceptable for us. And vice versa, our disease and the medical treatment we receive often impact each of these facets.

Recent advancements in genetic sequencing have fueled Precision Medicine based on genotype data, such as targeted cancer treatment after molecular profiling of the mutations driving cancer in the individual patient (1). At the same time, several highly specialized machine learning approaches have set remarkable landmarks in the automated interpretation of selected areas of medical data (2&3). However, much of the data describing the other facets of a patient are still largely unused for systematic individualization of treatment – and this data is vast and manifold.

Where we leave our health dataprints and what they look like

Our health-related data intuitively includes electronic documentation generated about us by the health care system, sensor data recorded by wearable devices that track our body signals and fitness parameters, or the data we provide to specialized health platforms. But it also spans well into areas such as our social media footprints where we, for instance, blog, tweet, or post about our symptoms or convey detailed information about our lifestyle (4). It covers our housing and travel record with geospatially determined pollutants, our employment history, information about our fitness club memberships, and even our bare online search history (5).

Next to where this data is generated and stored, our health data has several other highly heterogeneous properties: it can be structured, or unstructured (i.e., free-form text), it can follow standard data models or terminologies (6) (or not), it can be easily accessible to us or locked in a data silo (7). Each of these properties comes with its very own set of challenges for automated utilisation (8).

Note that in terms of precision medicine the definition of the individual by its data is largely given by how it relates to and differs from the rest of us. Only when enough comparative data about others (9) is available, can sound individual treatment decisions be made, and only then can computational methods such as machine learning or similarity search be employed to provide computerized support for medical decision making.

Obviously, the 3 V of Big Data (10) apply, and we need computers to grasp these vast amounts of heterogeneous data, to ingest, process, combine, analyze, and aggregate the data to allow its interpretation. While this requires full-stack data processing capabilities that cover data engineering, data integration, data analysis, and data science skill sets, it also requires a deep interdisciplinary understanding of the biomedical domain and of our psychological, social, and cultural facets.

Current medical AI is far from eating the whole patient data cake but makes for an indispensable icing

Next to these technical and domain-related challenges, another inherent hurdle with the analysis of patient data (and especially with the accumulation and analysis of large amounts of it from various different sources) is the law, which often prohibits integration and joint analysis of broad range individual data without explicit consent (for good reasons). Addressing this problem, some recent approaches are starting to provide centralized platforms for individual (health) data storage (11), looking to give the individual control of their respective data and access to it. Such platforms may at some point offer more comprehensive starting points for holistic medical data analysis than we currently have available – data coverage, patient adoption, and algorithmic accessibility provided.

Given these premises, it can certainly be expected that Doctor AI replacing your very human physician (12) (who is trained to see the bio-psycho-socio-cultural whole of you and may even have a hand for showing it) is still more than a handful of petaflops away. However, every single one of those Vs makes it inevitable to use computational methods wherever we can to gain information from the data we all leave behind faster and more coupiously than ever before, to assist medical diagnostics, to enable precision medicine, and to improve patient care.


1.Garraway, L. A., Verweij, J., & Ballman, K. V. (2013). Precision oncology: an overview. J Clin Oncol, 31(15), 1803-1805.

2. e.g., Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A., … & Uhlmann, L. (2018). Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology, 29(8), 1836-1842.  

3. e.g., Meyer, A., Zverinski, D., Pfahringer, B., Kempfert, J., Kuehne, T., Sündermann, S. H., … & Eickhoff, C. (2018). Machine learning for real-time prediction of complications in critical care: a retrospective study. The Lancet Respiratory Medicine.

4. Paul, M. J., & Dredze, M. (2011). You are what you Tweet: Analyzing Twitter for public health. Icwsm, 20, 265-272.

5.Did you know that next to Google being able to track infection outbreak and spread from the search terms people enter in certain geographical regions ( Microsoft can predict cancer in single users based on their – often presumably unspecific – Bing search profile (

6. e.g., , , ,

7. Highly recommended and with a comprehensive overview graphic: Weber, G. M., Mandl, K. D., & Kohane, I. S. (2014). Finding the missing link for big biomedical data. Jama, 311(24), 2479-2480.

8. Some of which are highly language specific and only starting to be addressed, see e.g.: Starlinger, J., Kittner, M., Blankenstein, O., & Leser, U. (2017). How to improve information extraction from German medical records. it-Information Technology, 59(4), 171-179.

9. In a balanced way, see also

10. Volume, Variety, Velocity plus any amendment you may see fit (

11. None of which covers all of the aspects named above, but some do include blockchain-based data transaction control. e.g.,,,

12. Check your favourite social media feed for #digitalhealth to find at least 3 such predictions per week

Johannes Starlinger will be speaking at Data Natives 2018– the data-driven conference of the future, hosted in Dataconomy’s hometown of Berlin. On the 22nd & 23rd November, 110 speakers and 1,600 attendees will come together to explore the tech of tomorrow. As well as two days of inspiring talks, Data Natives will also bring informative workshops, satellite events, art installations and food to our data-driven community, promising an immersive experience in the tech of tomorrow.

Previous post

The cart before the horse in data-science projects: back to basics

Next post

How Blockchain is Boosting Renewable Energy