In my last blog post, I described general MapReduce paradigm which allows us to create big data processing systems in a fairly simple way. In this blog post, I will describe more in detail what I’m doing in my project, which big data framework I have chosen and what are the motivations for the project.
We live in the era of big data. More than 50 millions of tweets are sent every day, 20h of video content is uploaded to YouTube every minute and every second almost 3 millions of emails is sent. The big data revolution does not omit scientist and researchers. For instance, the Large Synoptic Survey Telescope which is in the construction in Chile will make high-resolution photos of the sky. The quality and size of the photo will be so high that we will need 30 TB of disc space to store such image. How much data is it? Well, the human genome of 30 thousand peoples or 60 millions of books is of similar size. Do you think it is much? In 2020 it is planned to activate the Square Kilometre Array which will generate 1 048 576 TB of data every second. It is impossible for a scientist to analyze such huge amounts of data by hand, that’s why there is an increasing interest in automatic knowledge discovery techniques for scientists.
One of the ideas to make the life of scientists easier is the concept of Literature-based Discovery (LBD). In LBD we assume that a scientist is not aware of every single paper which appears in his discipline. What’s more, if some discoveries could be made at the intersection of several disciplines, then usually researcher is aware of the research only in one of them. So it seems possible (and it was confirmed by research practice), that the discoveries are just there in the huge scientific papers databases, but they are unnoticed. Let me give you a very general example. One researcher in his paper claims that A is related to B and the second researchers claims that B is somehow associated with C. However, no one is aware of the apparent connection between A and C. Such discoveries are just waiting to be found – no stroke of genius is required!
In my project, we are concentrated on the exploration of MEDLINE database of medical scientific papers. MEDLINE is the largest bibliographical database in the world and it’s freely available. Each MEDLINE entry is annotated with around 12 Medical Subject Headings (MeSH) terms which aim to describe the content of the article in a concise way. We are using MeSH terms to construct our knowledge graph and then we use linkage prediction techniques to find new, promising connections between terms in the graph. A medical researcher seeing, for example, the system proposal of a connection between some disease and a medicine, can be inspired to perform further investigations to check if proposed drug really cures this illness. In this way, hopefully, our system will help to make a progress in medicine.
Since there are millions of possible connection in the graph, I implement the calculation in Apache Flink. Flink is an open source big data processing engine which greatly extends MapReduce paradigm. A characteristic feature of Flink is that its treats data as a data stream. Even if you work with a normal, static dataset, Flink is seeing it as a finite datastrem of a constant size. Such a view to processing data allows Flink to perform more optimizations than standard big data frameworks. In particular, Flink does not need to wait for the termination of the previous stage of calculations to start the next ones. Since the previous stage output is a data stream then just after the first result is produced it can be directly processed by the next stage without waiting for the rest of results.
In our system Flink is performing all the calculations and the resulting file is then imported to a webapp called LBDream. LBDream is written in a python-based Django framework and it uses a bunch of other technologies like D3.js, Twiter Bootstrap, PostgreSQL etc. The main feature of the application is a search mechanism which allows user to explore results of our calculation. We hope to make LBDream accessible to all medical researcher around the globe soon.