Parallel algorithm for non-negative matrix tri-factorization

Project reference: 1720
In computational biology we typically have p≥3 sets of data points P1,P2,…,Pp (e.g., sets of different biological objects including proteins, genes, drugs, diseases etc. with |Pi |=ni and for each pair i < j we have a data matrix Dij containing relations between data points from Pi and Pj (intertype connections, e.g., drugs-genes), while for i = j the data matrices Dii contain intra-type connections (e.g., gene-gene, drug-drug interactions).
A way to mine all these data sets instantly is to solve non-negative matrix tri-factorization problem, which can be formulated as follows:
Quite some effort has been devoted to solve this hard optimization problem, mainly using Fixed Point Method. We will use best existing academic code and parallelize it within C++ and OpenMPI and test it on a local HPC machine.
Results will be tested on real data from biomedicine.
Non-Negative Matrix Tri-Factorization NMF applied to data associating genes, diseases and drugs can help (i) reconstructing data matrices Dij, (ii) co-clustering the data sets, and (iii) detecting new associations (triangles of connections) between the data sets.
Project Mentor: Prof.JanezPovh, Ph.D.
Site Co-ordinator: Leon Kos, Ph.D.
Learning Outcomes:
- Fixed point method to solve optimization problems;
- C++ and openMPI programming on HPC
Student Prerequisites (compulsory):
C++ coding
Student Prerequisites (desirable):
Basics from mathematical optimization or data science
Training Materials:
See attached paper:
- N. Pržulj, N. Malod-Dognin, Network analytics in the age of big data. Science, 353(6295):123–124, 2016
Workplan:
- Week 2: study of the method and design of the development environment
- Week 3: programming the method.
- Week 4: testing and improving the code
- Week 5: testing the method on real data
- Weeks 6-7: writing the final report and the presentation
- Week 8: Wrapping up.
Final Product Description:
Final result will be used within a research project that is already running at the hosting institution and will be incorporated in the dissemination activities of this project
Adapting the Project: Increasing the Difficulty:
There are several extension of the underlying optimization problem which we can start working on.
Resources:
The student needs basic knowledge in C++. During the project he will get access to the local HPC machine.
Leave a Reply