Divide and Conquer
Hello and welcome back!
In my previous blog post, I gave an overview of the file types used to handle the large genomic data for analysis (which we use as the inputs for the toolkit JLOH), the resulting output files, and how these output files would look when visualized. As promised, I will now go into how JLOH does what it does, why it is efficient at what it does, and what I have been doing to test this efficiency and scalability.
JLOH, the algorithm
Armed with the input VCF, FASTA, and BAM files, JLOH first separates the homozygous regions of the hybrid genome from the heterozygous ones and prints them on two separate VCF files. The variants in the input VCF files have a ‘format’ field that shows whether they are heterozygous or homozygous. JLOH then calculates the heterozygous and homozygous SNP densities of both parental genomes and clusters the regions of high homozygosity and heterozygosity together. The remaining regions with low SNP density are where some information has been lost through LOH and these are the ones we are interested in. They are labeled REF and ALT blocks and are compared to the heterozygous regions, trimming any overlaps.
At this point, JLOH can now analyze these REF and ALT blocks for their read coverage. Read coverage is the number of unique reads with a specific nucleotide base in the sequence. The blocks whose read coverage is above a specified threshold are discarded along with any uncovered regions. Using these read coverage profiles, JLOH then determines each block’s zygosity, whether ‘Hemi’ or ‘Homo’. Homozygous regions display a uniform read coverage while hemizygous regions show a reduced coverage as compared to their neighboring regions.
JLOH then outputs a table of all the LOH blocks found, their zygosity (homo or hemi), and their allele type (ALT or REF), among the other stats established along the way. With this information, I bet this visualization of the output files that I shared in the previous article now makes sense.
Given the scale of information that we are dealing with, this analysis should take a long time to complete. However, JLOH carries out most of its operations in parallel, cutting down on the overall run time.
If the computing problem you are trying to solve is too big to be handled by a single computer or would take too long when handled by a single computer, one way to solve it would be to split the problem into several sub-tasks that can be executed in many processors concurrently to save time. This is essentially what supercomputing is. Cloud computing is a commercial version of this, for those of you familiar with it.
The steps followed by JLOH, as described above, are carried out by functions in JLOH’s extract module. How fast these steps run is dependent on how fast the data can be processed. To leverage the power of multiple processors, JLOH uses Python’s inbuilt multiprocessing module to analyze the data. Multiprocessing opens a ‘pool’ with multiple processes that can run separately on separate processors at the same time. JLOH splits some of its more resource-intensive tasks, runs them on separate processors and when they are complete, the main process resumes.
To test this, I split the genome samples that we are using into subsets of incremental size with 1, 4, 8, 12, and 17 chromosomes respectively. I then ran JLOH using these sets as job scripts on a single node of the Marenostrum4 cluster keeping the amount of allocated memory and processing cores constant. For each run, I recorded the time JLOH took to complete the analysis. The results are graphed as shown.
The red line represents the real time which is the actual time taken by a program to run from the time we run the command up until we get the output as if timed using a stopwatch. The blue line, on the other hand, is the user time or the cumulative time taken by the processors to run the program till completion. This is the amount of time it would take JLOH to complete its analysis had it been running serially on a single processor which is not the case thanks to parallelization. Notice that the overall time taken increases steadily as more input data is passed, showing JLOH’s scalability with increasing input size.
I took this further. Using a single data set, I ran JLOH but this time I varied the number of processor cores allocated for each run. The results for the time taken graphed against the number of processor cores are shown below. For this set, the real time decreases steadily from about 16 minutes using 1 core until about four minutes when it flattens out at about 4 cores, which are sufficient to run the analysis on this set. User time stays constant but using multiple processors cuts down the real time. Running on a CPU-based parallelization strategy, JLOH consumes little memory to run. Overall, JLOH performed well in most of the test cases that I ran, proving to be scalable with increasing input and processor cores.
That is it for this post and my Summer of HPC project. This has been an enriching, although fast-paced experience, where I have learned a lot about genomics and high-performance computing. I am grateful to my mentors from the BSC Comparative Genomics Lab for their guidance, to everyone at BSC for hosting me, and to PRACE for making all of this possible and allowing me to work on this project. I would highly recommend SoHPC to any university students interested in research computing.
Until next time, hoşça kalın!