Re-engineering and optimizing Software for the discovery of gene sets related to disease

Project reference: 2114
Recent technological advances allow us to assess millions of genetic data points (>30 million) to identify their involvement in disease. This is a two orders of magnitude increase compared to the data available a few years ago. In biology, pathway analysis aims to find genes that work together to improve our understanding of the genetic basis of diseases This project will focus on re-engineering genomicper (https://CRAN.R-project.org/package=genomicper), a pathway analysis piece of software, to allow efficient analyses at the scale we need today and futureproof it as genomic data grows further. Genomicper is an R package, currently available from the CRAN repository.
This project aims to re-engineer genomicper so it can analyse bigger and existing data sets quickly. The application of the algorithm to large scale data sets is hindered by:
- the fact that the software is currently based on a collection of functions written in R and
- it performs thousands of independent permutations on the dataset in a sequential manner.
Thus, the objectives of this project are:
Understand the circular permutation algorithm (https://pubmed.ncbi.nlm.nih.gov/22973544/) that underlies genomicper analysis
- Profile and baseline the code performance to identify bottlenecks and opportunities for software improvement in performance and functionality
- Re-write the base algorithm in C/C++ and embed the code in R
- Benchmark and test the performance of any new algorithm with varying input sizes (e.g. 10/20 /30 million data points)
Any resulting code improvements should be contributed back to the existing CRAN R package.

Factors related to particular types of disease (from DOI: 10.1534/g3.112.002618).
Project Mentor: Dr. Mario Antonioletti
Project Co-mentor: Dr. Pau Navarro
Site Co-ordinator: Catherine Inglis
Participants: İrem Okur, Aybüke Özçelik
Learning Outcomes:
- Learn and implement the process of optimising a real-world piece of code.
- Learn about tool sets that can be used to achieve 1.
Student Prerequisites (compulsory):
The student should have skills in C or C++.
Student Prerequisites (desirable):
Knowledge of parallel techniques. Some R.
Training Materials:
Useful links:
https://rstudio.github.io/profvis/
https://cran.r-project.org/web/views/HighPerformanceComputing.html
https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
Workplan:
The work will basically consist with slight variants of the following cycle: baseline code performance for different data sizes and construct tests for correctness; profile code; plan changes; apply optimisation; go back to baselines and tests.
Final Product Description:
An R package that can process existing problems faster and be able to tackle bigger problems than is currently possible.
Adapting the Project: Increasing the Difficulty:
Can restrict to understanding the performance bottlenecks for a weaker student and suggesting improvements while a stronger student can implement actual improvements such as converting R to C/C++.
Adapting the Project: Decreasing the Difficulty:
As above – a weaker student can baseline the code, understand what changes would need to be made and possibly implement some of the simpler cases. The same process could still be carried but the level of change implementation would be more restricted
Resources:
All the codes and required tools are open source and/or are freely available. For bigger problems access to EPCC systems can be given, e.g. Cirrus should suffice. Suitable data sets to do benchmarking/profiling will be given.
Organisation:
EPCC
Leave a Reply