Leveraging HPC to test quality and scalability of a genetic analysis tool

Project reference: 2201
Loss of heterozygosity (LOH) is an evolutionary dynamic whereby heterozygous genomes progressively lose allele diversity according to natural selection. The result are genomic blocks of loss-of-heterozygosity (LOH blocks) where multiple homozygous alleles can be found in series. LOH blocks are often extracted from a genome to study its adaptation strategy to a specific environment. The block size, distribution, gene content and retained alleles are very informative in terms of how the species is adapting to the environment where it lives. In fact, the study of LOH blocks gained momentum during the last decade due to its importance in genome evolution, but this did not translate into a large amount of software developed for this purpose. The majority of studies carried out in this area are performed with self-developed scripts that never translate into a more general software that the rest of the community may use in the future. To fill this gap, in our group we are developing an algorithm named “JLOH” that is aimed at filling this gap, providing a general tool to extract LOH blocks from single-nucleotide polymorphism (SNP) data and a reference genome sequence. The core algorithm has been developed and already returned promising results on a few test sets, but extensive testing in different conditions must be carried out before opening this algorithm to the whole scientific community. The necessary testing will be done on the code’s: 1) scalability (CPU, memory), 2) ability to cope with different genetic properties (GC content, repeat content), 3) ability to cope with different data properties (coverage depth, SNP quality, SNP density). The testing will be conducted in the “MareNostrum4” HPC cluster at the BSC in Barcelona, providing the student with a state-of-the-art infrastructure to work with.

from Pryszcz et al. (2015) The Genomic Aftermath of Hybridization in the Opportunistic Pathogen Candida metapsilosis. PLoS Genet 11(10): e1005626. https://doi.org/10.1371/journal.pgen.1005626.
Heterozygous regions are in white, while homozygous blocks of at least 5 kb are depicted in grey. Loss of heterozygosity (LOH) detection is described in the paper’s methods. The method applied in Pryszcz et al (2015) is the ground algorithm of JLOH, the software that will be tested within the scope of this internship.
Figure copyright is not infringed as Toni Gabaldón is one of the authors of this paper
Project Mentor: Toni Gabaldón
Project Co-mentor:Matteo Schiavinato
Site Co-ordinator:Toni Gabaldón
Learning Outcomes:
At the end of the internship, the student will have gained substantial knowledge on the usage of an HPC cluster. The knowledge will be both in terms of day-to-day usage and in terms of good practices of data and workflow organization. The student will have familiarized with the standard procedures involved in developing and testing software and learned how to fully leverage the power of HPC in data analysis.
Student Prerequisites (compulsory):
- Basic Linux shell skills
- Basic knowledge of genomics
Student Prerequisites (desirable):
- Understanding of python programming language
- Familiarity with the concept of DNA variation (even if only from classes)
Training Materials:
The student is encouraged to use the many free online resources available at websites such as “tutorialspoint” or “codeschool” to get familiar with:
- The GitHub environment
- The basics of an HPC cluster, e.g. the SLURM queuing system
Prior knowledge is not needed, and these will be addressed together during the first week. It will however be very useful for the student to familiarize with the concepts before the internship.
Workplan:
- Week 1 (training week): the student will learn the basic functioning of the algorithm and familiarize with the MareNostrum4 HPC cluster, closely helped by the research group.
- Weeks 2-3-4: testing the scalability of the software in terms of computational resources using pre-existing data, submitting a workplan by week 3.
- Weeks 5-6-7: the student will scan the literature for available datasets that can be used to test the tool, simulate further data to add to this collection (if time permits), and test the software with these datasets.
- Week 8: Summary of the experience, wrap-up, submitting a final report by week 8.
If two online students are selected, since online supervision may be less effective, one student will take care of the scalability testing and the other of the dataset testing for the duration of the internship.
Final Product Description:
This study is a necessary step in software development. Its results will attest to the core algorithm’s ability to carry on its function in multiple conditions, on different data, and fully using the available computational resources. If the results will be incorporated in a scientific publication, the student will be considered a co-author of it.
Adapting the Project: Increasing the Difficulty:
In case difficulty must be raised, the student(s) will be asked to simulate testing data under a larger number of variables, finding potential bugs in the software and increasing the scientific value of the project results.
Adapting the Project: Decreasing the Difficulty:
In case students will need more training than expected, an extra week will be dedicated to teaching them more of the HPC basics and good practices. The work will be limited to scalability testing and non-simulated data securing an outcome for the student(s) and for the group.
Resources:
The student will need a computer with a Linux-based operating system or, at a minimum, access to a terminal. If in presence, this can be provided by the institute or the research group. The student will be granted access to the MareNostrum4 at the BSC. The data and computing hours used by the student will be placed under the projects of prof. Gabaldón.
Organisation:
BSC – Computer Science- European Exascale Accelerator
Leave a Reply