Link prediction in large-scale networks with Hadoop framework
Project reference: 1621
In this project, we will investigate how link prediction algorithms benefit from parallel computing in Apache Hadoop ecosystem. We will implement parallelized methodology for link prediction algorithms using MapReduce model. We will examine two types of topological features, namely neighbourhood based (e.g., common neighbors, Jaccard coefficient, Adamic/Adar coefficient) and random walk based (e.g., Katz, rooted PageRank, SimRank). We will assess the performance and validity of MapReduce implementation on several large-scale co-occurrence based biomedical networks. In addition, we will explore influence of network properties on the performance of link prediction by varying network parameters (size, density, degree distribution, clustering coefficient, etc.). Finally, we will provide a prototype application for visualization of predicted links in MEDLINE co-occurrence network.
The corpus of biomedical papers in online repositories is of enormous size. A large chunk of potentially useful information in literature remains undiscovered. It is laborious if not impossible for a researcher to keep abreast of all this information, even in her narrow research domain. In this regard, development of automated text mining tools is of utmost importance. Literature-based discovery (LBD) is a mature text mining methodology for automatically generating hypotheses for scientific research by uncovering hidden, previously unknown relationships from existing knowledge. Knowledge in one domain may be related to knowledge in the other domain, without the relationship being known. For example, suppose a researcher has found a relationship between disease X and a gene Y. Further, suppose that a different researcher has studied the effect of substance Z on gene Y. The use of LBD may suggest an XZ relationship, indicating that substance Z may potentially treat disease X. Such a latent link may represent a hypothesis for a potential, yet undiscovered relationship.
Knowledge of a particular biomedical domain can be viewed as a complex network, which is composed of a set of biomedical concepts and the relationships among them. Complex networks are evolving structures that change in time. One of the fundamental problems underlying complex network research is understanding the process underlying link formation. Link prediction is novel research direction at the intersection of network analysis and machine learning. Link prediction aims to estimate the likelihood that a link exist between two nodes, based on the observation of existing links of the nodes. In this work, we will examine link prediction from the novel perspective of LBD. We will mimic the process of LBD as a link prediction problem, where features are represented by various topological similarity indices between biomedical concepts. Each node of the network will represent a biomedical concept, while each edge will represent co-occurrence of the two nodes.
The major challenge of link prediction for LBD stems from massive nature of biomedical data. Therefore, a scalable and efficient approach is needed to accurately predict future connection between entities. Biomedical networks are sparse networks. This means that the set of edges is only a small proportion of all possible edges, which in turn implies an enormous imbalance between positive and negative instances in machine learning application. This problem is often addressed using a simple undersampling strategy. However, it is well known that reducing the majority class results in information loss, which may significantly overly generalize learning rules. Parallelization is therefore the only feasible and meaningful approach for studying link prediction in large-scale biomedical networks.
Project Mentor: Andrej Kastrin
Site Co-ordinator: Leon Kos
Student: Mateusz Lango
- Familiarize with state-of-the-art methods and techniques in literature-based discovery;
- Learn the functionality of big data processing with Apache Hadoop ecosystem;
- Learn to implement computational algorithms in MapReduce;
- Learn to write a scientific paper.
Student Prerequisites (compulsory):
Proficiency in Bash and strong programming background in Python and R.
Student Prerequisites (desirable):
Experience with MapReduce technology and Apache Hadoop framework.
- Bruza, P. & Weeber, M. (2008). Literature-based discovery. Berlin: Springer.
- White, T. (2015). Hadoop: The definitive guide. Beijing: O’Reilly.
- Liben-Nowell, D. & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology. 58(7):1019-1031.
- Lu, L. & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications. 390(6):1150-1170.
Week 1: Training week
Week 2: Literature review (plan writing)
Week 3-6: Project development
Week 7: Optimization, validation, and visualization
Week 8: Final report write-up
Final Product Description:
The final product will allow predicting links in large-scale complex networks using Apache Hadoop ecosystem. We will publish the developed code on GitHub.com as open-source project. Project results may lead to a publication in high impact biomedical journal (e.g., BMC Bioinformatics, PLoS ONE).
Adapting the Project: Increasing the Difficulty:
The proposed measures for link prediction (neighbourhood based and random walk based) only evaluate network topology. However, each node, which represent biomedical concept, is also associated with plethora of semantic information. A particularly able student may work on development of semantic-based features for link prediction.
The student will have his own desk in an office (4 desks in total). The student will need access to standard computing resources (PC, internet connection) as well as account for the Rudolf (supercomputer at the Faculty of Information Studies in Novo mesto, Slovenija).