Prof Valentina Escott-Price, UK DRI Group Leader at Cardiff, studied mathematics at St Petersburg University in Russia before embarking on her PhD in statistics at Cardiff University. She has helped discover 11 new susceptibility loci for Alzheimer’s disease. Here she talks to us about big data, working in high dimensional space and the challenge of persuading others to share their datasets.
You started out as a mathematician, then statistician. What first got you interested in applying maths to the study of disease?
My first research contract after my PhD in statistics was with Cardiff University’s School of Medicine. My task was to fix and maintain a database for a schizophrenia research team. Fixing the database took about a week and maintenance didn’t require much time, so I volunteered to do some genetic data analysis. I’m from the Soviet Union, where genetics and cybernetics were two areas of research that were suppressed, so for me, working in these two fields at the same time was just fascinating. I then got interested in genetic epidemiology, which struck me as being entirely logical and mathematical. I worked for about ten years in schizophrenia, bipolar disorder and depression, before becoming more and more involved in Alzheimer’s disease and dementia.
What is the biggest challenge you face when trying to make sense of large datasets?
Actually, working with large datasets in high dimensional space isn’t so much the challenge – for me this is the fun bit! A point in a dataset in high dimensional space behaves very differently from one in small dimensional space, and this is a mathematical challenge, but a really interesting one. One of the main challenges is actually to explain to biologists and clinicians that things aren’t as simple as they might expect when looking at large, complex datasets. We sometimes need to encourage our colleagues to step outside the comfort zone of their common sense.
Another big challenge is data sharing. We need to trust each other, link our data sets together, harmonise them, then move outside the standard analysis to come up with something new and better. We mathematicians can’t come up with any new ways of analysing data if it is not shared with us.
What does it mean to say that Alzheimer’s has a ‘significant polygenic component’?
When we say that a complex neurological disorder has a high polygenic component, we simply mean that multiple small effects of genes combine – along with environmental factors – to trigger disease. All of us carry some risk genes, and people carry different genes in different chromosomal positions. So it’s often impossible to say one gene is responsible for a disease – it’s much more complex than that. What we try to do is use datasets that combine genetic and environmental factors for the same people – like with the UK Biobank – so we have genetic information but also know about people’s lifestyle, exercise, medical conditions and so on. We can then model these together to find the best prediction model of a disease.