A video of my work is published on the Youtube. For anyone who is interested, it can also be viewed here.
The details of my project can be found in this github repository, https://github.com/ljchan1/SoHPC_19. Also, a magazine will be published by PRACE and will be consisted of a brief report of this project.
Now, I am back in Malaysia. Looking back, it is definitely an experience worthwhile. My site coordinator was satisfied with my performance and have offered me to apply for a permanent job in CINECA. Therefore, I may be back to Bologna in the very near future. Therefore, this summer of HPC program can even land me a very exciting job. The experience and friendship I gained from this program are definitely priceless.
Another side benefit of this program is I am able to publish my own scientific paper. My supervisor told me that I can publish the result of my research in OpenFOAM conference next year. This is definitely a plus in my cv as it can help me to secure funding for my future PhD program.
Last but not least, I would like to express my utmost gratitude to my site coordinator, Simone Bnà for providing me the guidance and advice throughout the project. Without his supervision, this project would never reach as far as it has. Furthermore, I would also like to thank PRACE for sponsoring me to work in CINECA.
This program just make me like supercomputer more.
In
the German movie Good bye, Lenin, a proud socialist woman
falls into a comma in October 1989. When she wakes up again, the Wall
has already fallen. To protect her health, her son decides to hide
this historical episode from her. She would suffer a huge anguish if
she knew her loved old regime has disappeared. Then, the son builds a
parallel reality inside their apartment, where nothing has changed, a
bastion of the old German Democratic Republic. How is this related
to my Summer of HPC project? We’ll see…
Project recap
As
we have learnt in the previous blog posts, drug discovery is a very
expensive (~US $2.8 billion) and slow (12 to 15 years) process. Free
Energy Perturbation (FEP) calculations coupled with Molecular
Dynamics simulations may allow us
to speed it up and reduce the cost and time of efficacy optimization
(one of the goals of Lead Optimization, which in turn is the most
expensive pre-clinical phase of drug discovery). These simulations
allow, in theory, to computationally screen many compounds, select
those with better binding affinity of their target and synthesize
only the most promising ones, thus saving valuable resources and
time. The goals of this project have been 1) to compare FEP results
with experimental data to prove whether it is accurate enough to be
used in industry, and 2) whether High-Performance Computing (HPC) is
necessary for FEP/MD calculations. All simulations were performed on
ARIS supercomputer, the national Greek supercomputer.
Project Results
Comparison between experimental and GAFF2 results for different analogs.
Our
test case has been CK-666. It is a micromolar inhibitor of protein
Arp23. This protein is involved cell movement and in tumor cells
migration. However, when CK-666 binds to it, the protein is
inactivated. Different modifications of CK-666 (analogs), were
studied.
After
completing FEP/MD simulations for 11 analogs against CK-666 we could
generate a correlation plot with the experimental results. How did
our predictions correlated with experiments? That’s the big
question.
When looking at the
final correlation plots, it is observed how, for example, FEP
simulations using Gromacs software and the GAFF2 force field
correctly predicted that molecules ai003 and ai007 are favored over
the reference ligand (CK666), because their relative free energy is
smaller than zero. For other analogs, such as ai015 or ai101, this
was not the case. But, the Mean Absolute Error was 1.13 kcal/mol,
which allows us to say that FEP can predict whether an analog will be
better or worse than the originally synthesized molecule with a ΔΔG
of ~1 kcal/mol. Thus, this technique has the potential to greatly
reduce the cost and time of lead optimization for a drug to enter
clinical trials.
Finally, we showed that HPC resources are essential for FEP/MD. On average, 13.5 hours and 1680 cores were needed for the complex phase of FEP, while 6.5 hours and 210 cores were necessary for the solvent phase (see here the hardware specifications). And that is for one single compound. Since the goal of FEP is to screen many analogs to select only a few of them to be synthesized, let’s imagine the scenario in which we want to analyze 100 analogs in 1 day. This would require 5250 cores for the solvent phase of the simulations, and 84000 cores for the complex phase.
One of the last stages of the project included creating a popular video in which the entire work performed is explained, together with the results and conclusions. If you are eager to learn more at what I have done, please, check it here!
Farewell
Hopefully,
the day when drug discovery cost is reduced thanks to the integration
of computational tools in the pipeline will arrive in the near
future. However, there is still work to be done by industry and
academia in this direction. Meanwhile, from next week on, the only
good times I will be missing will be my two months at Athens.
Perhaps I will ask my family to pretend I am still there…
Hello dear readers! Today, we will write the ending of this story. A story that lasted for two months. TWO MONTHS. For me, July was full of surprises and new friends, August was full of wonders and adventures!
But let’s start by a summary of my project:
Last time I talked about Gradient Boosted Decision Trees (GBDT) and addressed quickly the parallelization methods. This time we will dig deeper into this aspect! So how can we parallelize this program?
Well, when it comes to the concept of a parallel version of a program, many aspects have to be taken into account, such as the dependency between various parts of the code, balancing the work load on the workers, etc.
This means that we cannot parallelize the gradient boosted algorithm’s loop directly since each iteration requires the result obtained in the previous one. However, it is possible to create a parallel version of the decision trees building code.
The algorithm for building Decision Trees can be parallelized in many ways. Each of the approaches can potentially be the most efficient one depending on the size of data, number of features, number of parallel workers, data distribution, etc. Let us discuss three viable ways of parallelizing the algorithm, and explain their advantages and drawbacks.
1. Sharing the tree nodes creation among the parallel processes at each level
The way we implemented the decision tree allows us to split the effort among the parallel tasks. This means that we would end up with task distribution schematically depicted in the figure below:
You have already seen this figure in the Episode 3!
Yet, we have a problem with this approach. We can imagine a case where we would have 50 “individuals” going to the left node, and 380 going to the right one. We will then expect that one processor will process the data of 50 individuals and the other one will process the data of 380. This is not a balanced distribution of the work, some processors doing nothing while others maybe drowning in work. Furthermore, the number of tree nodes that can be processed in parallel limits the maximum number of utilizable parallel processes… So we thought about another way.
Sharing the best split research in each node.
In our implementation of the decision tree algorithm, there is a function that finds the value that splits the data into groups with minimum impurity. It iterates (for a fixed column — variable) through all the different values and calculates the impurity for the split. The output is the value that reaches the minimum impurity.
Sharing the best split research in each node, for each feature
As can be seen in the figure above, this is the part of the code that can be parallelized. So every time a node has to find a split of individuals in (two) groups, many processors will compute the best local splitting value, and we keep the minimum value from the parallel tasks. Then, the same calculations are repeated for the Right data on one side, and for the Left Data on the other.
In this case, a parallel process will do its job for a node and when done it can directly move to another task on another node. So, we got rid of the unbalanced workload since (almost) all processes will constantly be given tasks to do.
Nevertheless, this also has an notable drawback. The cost in communication is not always worth the effort. Imagine the last tree level where we would have only a few individuals in each node. The cost of the communication in both ways (the data has to be given to each process, and received the output at the end). The communication will eventually slow down the global execution more than the parallelization of the workload speeds it up.
Parallelize the best split research on each level by features
The existing literature (See) helped us to merge some features of the two aforementioned approaches to find one that reduces or eliminates their drawbacks. This time, the idea is to parallelize the function that finds the best split, but for each tree level. Each parallel process calculates the impurity for a particular variable across all nodes within the level of the tree.
This method is expected to work better because:
The workload is now balanced because each parallel process evaluates the same amount of data for “its feature”. As long as the number of features is the same (or greater, ideally much larger) than the number of parallel tasks, none of the processes will idle.
The impact of the problem we described in the second concept is reduced, since we are working on the whole levels rather than on a single node. Each parallel task is loaded with much larger (and equal) amounts of data, thus the communication overhead is less significant.
The global concept is shown in this figure:
Parallelizing the best split research on each level by features
And now let’s see one way to improve all of this. We will focus on the communication, particularly its timing.
While we cannot completely avoid loosing some wall-clock time in sending and receiving data, i.e. communication among the parallel processes, we can rearrange the algorithm (and use proper libraries) to facilitate overlap of the computation and communication. As shown in the the figures below, in the first case (a) we notice that the processor waits until the communication is done to launch the calculations, whereas in the second case (b) it starts computation before the end of the communication process. The last case (c) shows different ways to facilitate total computation-communication overlap.
Timing of communication and computation steps: (a) no overlap, (b) partial overlap. Timing of communication and computation steps: (c) cases of complete overlaps
One of the libraries that allows for asynchronous communications patterns, thus overlap of computation and communications is the GPI-2, an open source reference implementation of the GASPI (Global Address Space Programming Interface) standard, providing an API for C and C++. It provides non-blocking one-sided and collective operations with strong focuses on fault tolerance. Successful implementations is parallel matrix multiplication, K-means and TeraSort algorithms are discribed in Pitonak et al. (2019) “Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms.“
So we’re done now! We have discussed some of the miscellaneous possible parallel versions of this algorithm and their efficiency. Unfortunately we did not have enough time to finalize the implementations and compare them with the JVM-based technologies but also with XGBoost performances.
Future works could focus on improving and continuing the implementations and comparing them to the usual tools. Including OpenMP in the parallelization could also be a very interesting approach and lead to a better performance.
Another side that could be considered as well is using the fault tolerant functions of GPI-2 which would ensure a better reliability for the application.
Now what?
Now I am going back to Paris for a few days. Full of energy, full of knowledge and with a great will to come back here. This summer has been crazy. In two months I met some incredible people here and I am sooo grateful for all of this.
The first week in Bologna was a great introduction to the HPC world. We learned a lot and got the chance to meet and know more about each other. I really hope we will keep in touch in the future! I’d love to hear about what everyone will be working on, etc.
The TEAAAAAM in Bologna! <3
Then Irèn and I got the incredible chance to work with the CC SAS team here in Bratislava. People are so nice and the city is lovely! I’d like to take this opportunity to thank my mentors and the whole team for their help and kindness (and the trips!).
Living in Bratislava does not only mean living in a lovely city. It also means being in the middle of many other countries and big cities. So I was able to visit Vienna with Irèn (and we actually met 3 SoHPC participants who came from Ostrava and Ljubljana, we spent a great weekend together!), then I visited Prague (And fell in love with it) and Budapest!
Five HPC participants in Vienna! (Belvedere Palace) From left to right: Pablo (Ostrava), me, Khyati (Ljubljana), Irèn (Bratislava), and Arsenios (Ljubljana).
Well, well, well. I think that’s all what I have to say. You should apply and come to discover by yourself! I can promise that you won’t regret it.
Check my LinkedIn if you want some more information and drop me a message there!
Last week concluded my (almost) two month stay at CINECA in
Bologna. I worked on an interesting project and got to experience the life in
Bologna.
Project
The first big part of my project was data preparation. From
raw data, stored in a database, data had to be transformed into form suitable
for machine learning. Due to the sheer volume of the data, this transformation
had to be done in parallel and in batches.
Once the data transformation was completed, I focused on a
subgroup of the dataset – data about high-level failure reporting. My task was
to try to understand how faults relate to system availability (which faults
cause system failures). The nature of this task requires an explainable model –
a model that provides reasoning behind predictions.
The first step in creating such a model is to compare
explainable models (decision trees) to black box models (neural nets, SVMs,
ensemble methods). It turned out that the decision trees achive accuracy that
is as high as the black box models (10 fold cross validation with preprocessing
included in pipeline – see previous post). This suggests that the problem is
Suitable for construction of explainable model
Possesses informative features (attributes)
The problem, however, with decision trees is that they are
unstable. The structure of decision trees changes with (relatively) small
changes of the dataset. As such they do not provide the most reliable explanation
of the model.
The goal for continuation of this work (outside the scope of
HPC) is to adopt consensus tree classifier (explainable ensemble method) to
work efficiently (in parallel) on HPC infrastructure on big quantities of data.
Additional details about my work are included in my
presentation:
Life in Bologna
Work at CINECA and University of Bologna was a great experience. My colleges – my site supervisor dr. Massimiliano Guarrasi (CINECA), mentor: dr. Andrea Bartolini, and co-mentors dr. Andrea Borghesi and Francesco Beneventi (university of Bologna) – were extremely helpful and they created a pleasant working environment.
Also Li was a great roommate.
Li an I after we blew out the fuses for our apartment (apparently you cannot use a washing machine and a dishwasher at the same time) Consulting my spirit guide Marko Mandic before final presentation
All in all I am grateful for the opportunity to be a part of
summer of HPC 2019
Summer of HPC, right? It’s in the name of the program. So maybe it is time now to explain what is it about! If you are following this journey because you have an HPC class and want to join the program, you may already know about it but if you’re curious and want to know why you should apply too, follow me!
First of all, HPC stands for… High Performance Computing. It is the use of multiple computers/processors in order to run advanced and complex applications or process large amounts of data reliably, quickly and efficiently. So, I never said here that it is a “computer science field”. And I will never say that because it is not true. I mean it is not only for computer scientists, but it is everywhere!
Data processing, computational chemistry, biotechnology, physics… and so much more! So you really should read this!
Distributing the work on different processors means that your program is parallel. If it is not distributed, then it is a serial code.
Here is how it works. It is exactly the same thing as if you had to wash 1000 plates, you may take 2 hours to finish that if you do it by yourself, but if you take 3 other friends and give each one of them 250 plates and keep 250 for yourself. If you all have the same washing dishes speed and are all working on the same time, you may divide the initial time by 4 (at most)!
Yes, but why am I saying “the best we can hope for is Ts/4”? This is actually where we have to introduce the communication time. It is one of the reasons we usually don’t exactly have a direct “Serial version time divided by the number of processors”. If your friends start washing the dishes at the same time as you, they probably are doing that on another sink. So you have to move the plates to where they are and take them (cleaned, hopefully) back at the end. These steps take a bit of time so even if they finish their task at the EXACT time as you, the “communication” process will make you lose some time, thus end up with a final time that is greater than Ts/4.
And now, you just understood what the speedup is! It is what you gain in time after parallelizing your program! So in the figure showed below, the max should be 4. Maybe you can have a speedup around 3.9 which is not bad at all.
The speedup will help you calculate something else: The efficiency.
Efficiency = Speedup/Number of processors
So efficiency is a parameter ranging between 0 and 1. The best you can get is 1. If you get 0, it means that either none of the other processors is working or you should completely review your program.
When you measure this efficiency, you get some information about how good your program is scaling. The procedure is pretty simple; you have to launch you serial code then execute the parallel version on 2, 4, 8, 16, 32 … (I mean, the “steps” depend on you and by how many processors you are limited). If you obtain at each step an efficiency that is near the 1 limit and not dropping quickly at some point, then you’re probably on the right track!
But if, for example, get efficiency = 1 for 2 and 4 processors, then 0.9 for 16 processors, then suddenly drop to 0.7 when using 64, there should be something to improve! In many cases, it is a communication problem… If you use them too much they can end up consuming too much time and you would lose a lot compared to what you gain by parallelizing your application.
Alright, maybe it is enough! This was a really quick introduction to HPC that, I hope, makes you want to dig into it. I can advise you to check these resources: https://insidehpc.com/, https://epcced.github.io/hpc-intro/010-hpc-concepts/ and of course check some courses about MPI and OpenMP to train!
The past two months have been a really great experience and i would like to thank PRACE summer for the opportunity that has given me to participate in this program and work with supercomputers. I would also like to thank my supervisors at the Julich center that helped me reach my goal of the project and also learn tons of new stuff.
All of the journey has been amazing. To begin with, first week in Bologna i learned a lot of things about the usefulness of parallel programming and i met all the incredible members that participate in this program including professors and staff that make the PRACE summer possible.
The program was as such it provided a good balance between having fun and education.
Then with my arrival at Julich i was introduced to many different new concepts such as architectures of computers, possible scalability of different networks, numerical methods that i wasn’t aware of and many other things.
Continuing, my project had begun and at that point i was not only absorbing new material but i was also implementing it. These past couple of months helped me a lot to increase my knowledge and my confidence as a programmer since programming is not my kind of routine as i study mechanical engineering.
At last i will portray again how beautiful Julich is. By having a bike i visited pretty cool places in and close to Julich. The city has a really great history that can be studied in the castle, there are also very nice events almost every weekend at the square as an example last weekend there were beach volley matches which was pretty awesome. There are also very beautiful lakes for swimming and a zoo which pretty much has all kind of entertainment for kids and grown ups. I will add some pictures of some places that i have visited, consider it an update from my last post.
Firstly, you need a local virtual instance of Hadoop with R, which you can download from Oracle Virtual Machine (or VMWare) and the image from Mint-Hadoop.ova. This ova file already contains a working version of Hadoop & RStudio.
You can also set up your Hadoop Cluster if you have a number of PCs on a domain by following the instructions in this link.
Future Learn: Digital Education Platform
Future Learn “Managing Big Data with R and Hadoop” by PRACE & University of Ljubljana is a step-by-step course to manage and analyze big data using the R programming language and Hadoop programming framework. It is one of the top courses to learn statistical learning with RHadoop.
Managing Big Data with R and Hadoop Course on Future Learn Platform
I have adapted some of the algorithms developed in my project “Predicting Electricity Consumption” to RHadoop. The example has been made available in this course. The Electricity Consumption Data has been made available in the Hadoop Distributed File System (HDFS) for you to practice. The example explains how to do data formatting, aggregating, calculating mean, standard deviation using MapReduce jobs and then plotting the results. Try the exercises at the end of the course and leave a comment there if you have any questions.
All the world’s a stage, and all the men and women merely players: they have their exits and their entrances…
William Shakespeare
I admit this is a bit pathetic, but maybe it’s the end of these great two months which makes me a bit emotianal! However, I think one can describe this summer like a three-act drama, and the acts are represented by my blogpost:
The first one was the introduction, where you got to know all important characters of the play.
The second one was the confrontation, where I had to fight against the code, to get him work!
And this is the third one, the resolution, where I’ll reveal the final results and answer all open questions (Okay, probably not all, but hopefully many).
Okay, to be honest, the main reason for this way seeing it is that I can cite Shakespeare in an HPC blog post which is quite unusual, I believe! But there is some truth in it…
Now it’s time to stop philosophizing and start answering questions. To simplify things, I will summarise my whole project to one question:
Is GASNet a valid alternative to MPI?
Allow me one last comment before finally showing you the results: I must mention that what we’ve done here by replacing MPI with GASNet was quite an unusual way of working with GASNet’s active messages. We don’t make really use of the “active” nature of GASNet, so probably one could redesign the applications in a way that they make better use of them. But our goal was to find a simple drop-in replacement for MPI. So if the answer of the above question will be no (spoiler-alert: it will), then I mean just that specific use-case, not GASNet in total!
Below you see two plots of the performance of two very different applications:
I won’t explain all the details about these applications here, for this check-out my video or my final report. There are only to things I want to point out here:
The scaling-behaviour: On the x-axis, you see in both plots the number of cores (one time with log-scale, one time linear). If we now increase the computing power by the number of cores, we have to decide, if we want to change the amount of work to do for the application:
If we don’t change the amount of work, the runtime will decrease hopefully. In the best case twice the number of cores result in half the runtime. Here we can see, if our application scales good. This is called strong scaling.
If we change the amount of work in the same way as the cores, the time should stay constant. If it does not, it is a sign, that the underlying network is causing problems. This behaviour you can see above, in the left plot the time is pretty constant, in the right one it increases a bit at the end.
The communication-pattern: this is what we call the kind and number of messages a program sends. The stencil usually sends few, but big messages whereas the graph500 many, but small ones. And here we see the big performance difference: In the first case, MPI and GASNet are almost equal which is totally different in the second one. We assume, the reason for this are differences in the latency. This means that GASNet simply needs longer to react on a message-send which has much more impact with many, small messages.
However, GASNet is not able to beat MPI in one of these cases, so in the end we must conclude: It is not worth to use GASNet’s active messages as a replacement for passive point-to-point communication via MPI. But that doesn’t mean, there are no cases, in which GASNet can be used. For example, according to the plots on the official website the RMA (Remote memory access) component of GASNet performs better.
But it is not like the project failed, because of these results. Besides the fact, that I learnt many new things and feeling now very familiar with parallel programming, these results could also be interesting for the whole parallel computing community! So I’m leaving with the feeling, that this wasn’t useless at all!
Goodbye, Edinburgh!
Unfortunately, it’s not just my time with GASNet which is ending now, but also my time in Edinburgh. I really fell in love with this city in the last weeks. And it was not so much the Fringe (which is very cool, too), but more the general atmosphere. We also discovered the environment around the city in the last weeks, it is such a beautiful landscape!
I’m really sad, that this time is over now, but I’m very grateful for this experience. I want to thank all the people who made this possible: PRACE and especially Leon Kos for organising this event, Ben Morse from EPCC for preparing everything in Edinburgh for our arrival, my mentors Nick Brown and Oliver Brown for this really cool project, and last but not least all the awesome people I spent my time with here, mainly my colleagues Caelen and Ebru, but also other young people like Sarah from our dormitory! Thank you all!
Hello, my friends. Today we will learn how to cook the dish show bellow, in Python, using PyQt5. I don’t know if you see it yet, but bellow is a dish of pasta with tomato sauce and cheese.
Pasta with tomato sauce and cheese.
Still don’t see it? Let’s have a closer look on the recipe.
Ingredients
The ingredients will be provided by the (product placement) PyQt5 library. Specifically we will need to provide from the QtWidgets
1 QApplication (table)
1 QWidget (serving platter)
1 QGridLayout (dish)
3 QPushButton (pasta, tomato and cheese)
and we will need the backends from matplotlib and pyplot which we will call plt for short and the yum yum numpy as np. And of course sys.
matplotlib.pyplot or just plt (pot)
just a tip from np (salt)
a slice of sys (bread)
Bellow, you can see a picture of those ingredients.
from PyQt5.QtWidgets import (QApplication, QWidget,
QGridLayout, QPushButton)
import numpy as np
import sys
import matplotlib
matplotlib.use('Qt5Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.backends.backend_qt5agg import NavigationToolbar2QT as NavigationToolbar0
Method
We will first cook the pasta, the basic compartment of our dish. Then, we will prepare the tomato and afterwards our cheese. We will treat all of them with a very delicate way. Finally, we have to combine all things together on our dish. We could leave things that way, but a proper chef takes care of the whole serving procedure, so we will also take care of the dish on it’s way to the table.
1 Boil water.
First, we have to boil some water. So, let’s grab some pre-boiled water from the fridge and let’s go.
self.figure = plt.figure() # figure out which pot you will use
self.canvas = FigureCanvas(self.figure) # convey water to pot
toolbar = NavigationToolbar(self.canvas, self) # navigate to hob
Just kidding. We grabbed it from the s(h)elf.
2 Get the pasta ready
Now, we will prepare the paste. Let’s choose the desired pasta, and let’s think of what we’ll do with it!
button_pasta = QPushButton('Pasta') # Choose what pasta we will use
button_pasta.clicked.connect(self.add_pasta) # think of preparation
But what is the actual preperation? Let’s have a better look at it.
def add_pasta(self):
# prepare new pot
self.figure.clear()
# add a little bit of salt
chi = np.random.uniform(0, 10, 400)
psi = np.random.uniform(0, 10, 400)
# get boiling water
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# put pasta in pot and wait
plt.plot(chi, psi, color='gold')
# get pasta ready/dry/drawn
self.canvas.draw()
3 Get tomato sauce ready
Again, we choose the desired tomato sauce and we think about how we will incorporate to our dish.
def add_sauce(self):
# add just a small tip of salt to the tomato sauce
chi = np.random.uniform(1, 9, 40)
psi = np.random.uniform(1, 9, 40)
# heat olive oil in a pot
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# add tomato sauce to the heated olive oil
plt.plot(chi, psi, 'o', color='red', ms=35, alpha=0.5)
# get sauce out of pot
self.canvas.draw()
4 Get the cheese ready
Finally, we have to prepare the cheese. It’s should be very easy for you by now.
Now preparation is simpler but still is very important.
def add_cheese(self):
# taste how salty cheese is but you can't do much about it
chi = np.random.uniform(2, 8, 40)
psi = np.random.uniform(2, 8, 40)
# grab a grate
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# grate the cheese with a specific grate size
plt.plot(chi, psi, '.', color='yellow', mec='black', mew=0.5)
# get cheese ready in a bowl
self.canvas.draw()
5 Get things on dish
The last part, is the dish preparation.
# choose a nice dish
layout = QGridLayout()
# start by adding pasta, tomato sauce and cheese in that order
layout.addWidget(button_pasta, 2, 0)
layout.addWidget(button_sauce, 2, 1)
layout.addWidget(button_cheese, 2, 2)
# now hope that pasta will be hot enough
# and cheese will melt
layout.addWidget(toolbar, 0, 0, 1, 3)
layout.addWidget(self.canvas, 1, 0, 1, 3)
# smell the result
self.setLayout(layout)
6 Serve dish on the table
A proper dish is only good when you have add it to the table. So here, in this final step, let’s add this dish to our table with a slice of bread too, for any customer/friend/partner that wants to try our food!
if __name__ == '__main__':
# add a slice of bread to the table
app = QApplication(sys.argv)
# incorporate the dish recipe as we described it above
main = PastaWithTomatoSauce()
# present the dish on a serving plate
main.show()
sys.exit(app.exec_())
If you execute this recipe, you will get the following result.
An empty table
Don’t worry, you didn’t spend 10′ for nothing, although you might see just a table. You just have to click on the buttons with the order you want to add. If you want to add more sauce or cheese feel free to click multiple times! And voila!
A dish with only pasta
Pasta with a lot of tomato sauce. I like tomato sauce.Our full dish
Bon appetite!
Here is the full recipe code.
from PyQt5.QtWidgets import (QApplication, QWidget,
QGridLayout, QPushButton)
import numpy as np
import sys
import matplotlib
# for full compatibility we use the specific render
matplotlib.use('Qt5Agg')
import matplotlib.pyplot as plt
from matplotlib.backends.backend_qt5agg import FigureCanvasQTAgg as FigureCanvas
from matplotlib.backends.backend_qt5agg import NavigationToolbar2QT as NavigationToolbar0
class PastaWithTomatoSauce(QWidget):
def __init__(self, parent=None):
super(PastaWithTomatoSauce, self).__init__(parent)
# create figure, canvas and toolbar objects
self.figure = plt.figure()
self.canvas = FigureCanvas(self.figure)
toolbar = NavigationToolbar(self.canvas, self)
# add button objects and
# connect the click signal to relevant functions
button_pasta = QPushButton('Pasta')
button_pasta.clicked.connect(self.add_pasta)
button_sauce = QPushButton('Sauce')
button_sauce.clicked.connect(self.add_sauce)
button_cheese = QPushButton('Cheese')
button_cheese.clicked.connect(self.add_cheese)
# create the layout and add all widgets
# on proper positions
layout = QGridLayout()
layout.addWidget(button_pasta, 2, 0)
layout.addWidget(button_sauce, 2, 1)
layout.addWidget(button_cheese, 2, 2)
layout.addWidget(toolbar, 0, 0, 1, 3)
layout.addWidget(self.canvas, 1, 0, 1, 3)
# assign the layout to self, a QWidget
self.setLayout(layout)
def add_pasta(self):
# clear the canvas for a new dish
self.figure.clear()
# generate random data
chi = np.random.uniform(0, 10, 400)
psi = np.random.uniform(0, 10, 400)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color
plt.plot(chi, psi, color='gold')
# update canvas with current figure
self.canvas.draw()
def add_sauce(self):
# generate random data
# tomato should be on pasta so we limit the boundaries
chi = np.random.uniform(1, 9, 40)
psi = np.random.uniform(1, 9, 40)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color, marker size, and transparency
plt.plot(chi, psi, 'o', color='red', ms=35, alpha=0.5)
# update canvas with current figure
self.canvas.draw()
def add_cheese(self):
# generate random data
# cheese should be on tomato so we limit the boundaries
chi = np.random.uniform(2, 8, 40)
psi = np.random.uniform(2, 8, 40)
# create axis with relevant lengths
ax = self.figure.add_subplot(111)
plt.axis((0, 10, 0, 10))
# plot data
# set relevant color, marker edge color, and width
plt.plot(chi, psi, '.', c='yellow', mec='black', mew=0.5)
# update canvas with current figure
self.canvas.draw()
if __name__ == '__main__':
app = QApplication(sys.argv)
main = PastaWithTomatoSauce()
main.show()
sys.exit(app.exec_())
It’s hard to believe that I have less than a week left here in Luxembourg. The past few days have been quite busy between filming my presentation, working on my final report, tidying up my code, and fleshing out the documentation but, despite being kept busy, it’s hard not to feel strange about the fact that Summer of HPC is almost over. However, I have been able to make a reasonable amount of progress on my main progress. A number of additional experiments have been run to build on the results described in my last post. Many of these have been examining whether my previous findings are also valid when using CPUs instead of GPUs for training (short answer: they are but everything is quite a bit slower). I’ve also been able add support for another framework, namely MXNet, to my code, although it doesn’t seem to want to run on more than one node at the moment. However, as my final report will be available on this site soon, I thought I would take this opportunity to review what I have learned over the past two months and encourage anyone who wants to read about all the technical details to have a look at that when it’s available.
Over the course of the last two months I that I have gained a relatively large amount of experience in deep learning. This is an area which I had a small amount of prior knowledge of, but I feel that the chance to spend a summer working on one substantial project and apply my knowledge on a machine that was far more powerful than a standard laptop was extremely beneficial in allowing me to become more experienced in what is becoming a very important topic. I also feel that I am becoming more proficient with the most common software packages used in this area. Indeed, the fact that my project was based around comparing different frameworks means that I now have at least some experience in a lot of the libraries used in the field. This might be helpful, given that machine learning will probably feature quite a bit in the master’s program I’m starting in less than two weeks.
I also feel that I learned a lot about HPC in general this summer. While the training week in Bologna now seems like a long time ago, the tutorials in MPI and OpenMP, among other topics, were very interesting and informative and I plan to keep practicing and build on my knowledge of this material for any future HPC related projects I may undertake. Working on my project has also helped me to become familiar with the SLURM scheduler and generally get used to thinking about how simple pieces of code have to be adapted to work on a large scale. It’s also quite hard to spend two months developing code on a cluster without picking up a few new Linux and vim keyboard shortcuts. Overall, I feel that my knowledge of HPC has improved substantially this summer and I look forward to any opportunities that may arise to use what I have learned in the future.
View of Echternach cathedral in the distance from near the end of the hike.
Personal Update
Over the past eight weeks, I have had a number of opportunities to explore Luxembourg city, which is a very pleasant place to wander around, with many events happening over the summer. I also had the chance to go on two daytrips to neighbouring countries. The first was to Trier, the oldest city in Germany, birthplace of Karl Marx and home to an impressive amount of Roman ruins for somewhere this far North. The second was to Metz in France, another extremely old town with an extremely large cathedral as well as an impressive modern art gallery. Both trips were very enjoyable and, in general, I felt that this summer was a great opportunity to spend some time in an area which is not exactly a standard tourist destination.
In our last full weekend in the country, Matteo and I spent the in Echternach, near the border with Germany waking part of the Mullerthal trail, Luxembourg’s main hiking route. We only managed to do a small section of the trail, which is over 100km long in total, but it was still a very good way to relax and enjoy the good weather after a busy week sorting out some of the final details of our projects.
My time in Amsterdam is coming to an end. Not only did Allison and I explore a lot, but we also learned a lot and are now finalising our projects. So, for one last time, I guess it’s time to tell you what I’ve been up to for the last few weeks.
I want to start off talking a bit about encryption. There are different encryption algorithms and encryption modes. The algorithms I looked at in my project were AES, Twofish, Serpent and CAST5. The differences between them are not too important, so instead, let’s take a look at their most significant similarities:
they all remain unbroken and are therefore secure, and
they are all block ciphers.
The term blockcipher refers to the idea of splitting a plaintext that you want to encode into blocks of the same length and encrypting them seperately. This is necessary because otherwise, you would always need a key of the same length as the plaintext. If you’re not sure why that is, take a look at the Vernam cipher. By encrypting blocks of a determined length, you only need a key of that exact length and can use it for all of the blocks. There are different ways of applying a key to a block, i.e. different encryption modes. In my benchmarks, I looked at three such modes, namely ECB, CBC and XTS. The following is a short overview of these modes.
ECB mode
CBC mode
XTS mode
ECB: The Electronic Code Book mode is the simplest and fastest way to encrypt blocks of plaintext. Every block is encrypted separately with the chosen algorithm using the key provided. This can be done in parallel. However, it is highly deterministic: identical plaintexts have identical ciphertexts.
CBC: The Cipher Block Chaining mode is less deterministic, but slower. The encryption is randomised by using an initialisation vector to encrypt the first block. Every subsequent block’s encryption depends on the block before, i.e. also on the initialisation vector. This cannot be done in parallel.
And finally, XTS: The XEX-based tweaked-codebook mode with ciphertext stealing uses two keys: The first one is itself encrypted by an initialisation variable and diffuses both the plaintext and the ciphertext of every block. It is also changed slightly for every block. The second key is used for actually encrypting the block. XTS also uses ciphertext stealing which is a technique for when the data length is not divisible by the block size: The plaintext of the last block is padded with the ciphertext from the previous block. Again, this mode can work in parallel.
Now you have a basic idea of what to consider when encrypting a disk: Not only are there different algorithms, but also different modes. Since they work quite differently, they might also have a different impact on the performance. To look into this, I set up a simple test environment out of two virtual machines: One acts as the host for the virtual machine to be benchmarked, and one acts as an NFS-Server which contains the encrypted disk. For more details on this, I suggest you have a look at my final report.
To benchmark the setup, I transcoded a 17-minute-long video of a German news programme. I did this on eight differently configured virtual disks: seven that were each encrypted differently, and one that was not encrypted. In every configuration, five runs were performed. I plotted the results in the following two graphs, and the configurations were:
Configuration No.
cipher
mode
key size
hash function
1
None
None
None
None
2
AES
ECB
256
SHA256
3
AES
XTS
256
SHA256
4
AES
CBC
256
SHA256
5
AES
CBC
128
SHA256
6
Twofish
CBC
256
SHA256
7
Serpent
CBC
256
SHA256
8
CAST5
CBC
128
SHA256
Here you
can see how the encryption impacted the user performance. The non-encrypted
setup is the green bar at the
bottom. The impact of encryption on the user time is very low; the time
needed is only increased by 0.5 to 3 percent.
User performance in test setup when benchmarking the transcoding of a video.
Things look different in system time: The performance is decreased by 13-30%. While this sounds like a lot, it is not completely crazy; if you really need to secure your data, you’ll be okay with it. Also, the system time takes up only a small part of the real time.
System performance in test setup when benchmarking the transcoding of a video.
Looking back at the process, I can say it is fairly simple to set up an encrypted disk for a virtual machine. The benchmark results show that the encryption does impact the performance, but only to a moderate extent. Also, they suggest that performance demands can be balanced with security demands since different configurations of the encryption lead to different timings. So basically, yes, you can have it all!
Sadly, this is my last post for the Summer of HPC. It’s been a great two months! Today, I’d like to talk about the final results of my project, namely the use of my animation framework for the coastal defence/wave demonstration, and the web framework I created for the simulation. First, a video of the coastal defence simulation in action!
Coastal Swapping
Haloswapping in coastal defence simulation
The main operation used in the coastal defence demonstration is what is known as a “haloswap”. This is an operation at a slightly higher level of abstraction to typical MPI communications, but is very simple. To start, you need a grid, or matrix, or array – whatever terminology you prefer.
In the wave simulation, you need to split this grid across many different Pis to run a simple differential equation solver on it (which I won’t go into here). So far, so good. In the video above, we split this grid into 16 equal horizonal strips and give one strip to each Pi (the output from the simulator shows slightly different vertical lines). But how does a wave move from the top of the simulation to the bottom? There is some kind of local relationship in the grid that means changes in one area affect those nearby. How do we pass this local information to other Pis?
The size of this local area of influence is known as a halo, and as I’m sure you’ve guessed from the video, we simply swap the halos between the Pis, known as a “haloswap”. Under the hood, this is a simple send and receive both ways, of an array of numbers (a buffer of data). Once the local information is properly exchanged, the simulation can transfer waves between strips. Wonderful!
Diagram showing a haloswap
This does have to happen after every iteration (tick of time), so many thousands of haloswaps will occur during the simulation. This would be absolutely impractical to visualise, so there needs to be some kind of limitation on what is shown. For the ease of getting a nice visual output in this project, I’ve just shown every 100th haloswap, a number which would vary depending on the speed of the simulation. Better solutions exist – you could calculate the speed of showing the visualisation vs a sample speed of the visualisation, and show them at the correct rate to keep them in sync, for example, but this is a complex operation given how little it offers in return.
In order to show the haloswap, I added a custom function to my Wee MPI framework (now available in Fortran, as that’s what this and many other HPC simulations are written in), which allows a user to play any named animation on the server, allowing lots of easy flexibility for custom animations such as that shown above.
A Web Interface?
In my previous post, I discussed the creation of a web framework for running simulations. To recap on the current situation (the framework used for the current version of the wave simulator), all computations are managed by a web server (using Flask) on the top left Raspberry Pi. This Pi dispatches simulations and collects results using MPI, and then continuously returns these results over the network. Currently the client is also a Python application with an interface made using WxWidgets, but this doesn’t have to be the case!
Developed as a NodeJS package, I’ve ported the network code from the client to python to JavaScript – meaning I can now start and download simulations in the browser. I used React to build the user interface around this and the website it’s embedded in (optimised and built by Gatsby). With this, I’m able to start all of my MPI tutorial demonstrations from the webpage directly, and in theory, also the wave demonstration.
Homepage
MPI Tutorial
Coastal Defence Demonstration
The website is mainly functional – I didn’t spend long perfecting looks. As the focus of my project is the visualisation of the parallel communications, I didn’t want to spend too long on web development! I developed a simple version of the whole site, with placeholders in place of some things. For example, while network communications are fully implemented for the MPI demonstrations, handling data in the browser is not the same as in Python!
As a result, I wouldn’t be able to fully re-implement the wave simulation without a bit more work than I had time for, though you can see an initial version, which allows you to place blocks and shows the benefit of using a webpage as the user interface. Many further features could be added from here, such as a detailed control panel for Wee Archie operable by a student or non-HPC expert, further tutorials and demonstrations, even a virtual version of Wee Archie – WebGL is getting pretty fast nowadays.
Alternatives include a more modern UI framework than WxWidgets, but this requires a lot more knowledge than web development of the average Wee Archie developer. I’ll leave the final decision on the path to take up to the maintainers of Wee Archie here at EPCC!
Goodbye!
I’d like to thank everybody here at EPCC and PRACE for such a wonderful experience, in particular my mentor, Dr. Gordon Gibb. I’ve learned a lot, from MPI to video editing in these two months, and I’m sure that I’ll never program in serial again if I can help it.
Before too long, all of my code will be available on the Wee Archie Github repository. If you’d like to see my final video presentation, I’ve embedded it below. My final report will be available on this site also. If you’d like to follow my work in future, check out my website!
I hope you’ve enjoyed my blog posts, goodbye for now!
In my last
blog post I introduced the Intel Movidius Neural Compute stick and sketched out
a rough idea of what it is good for. In this post, I would like to build up on
that and describe another very important component that accompanies the Neural Compute
Stick, which is the OpenVino toolkit.
To make use
of the Neural Compute Stick, it is necessary to first install the Open Visual
Inferencing and Neural Network Optimization (OpenVino) toolkit, which I briefly
introduced in my last post. This toolkit aims at speeding up the deployment of
neural networks for visual applications across different platforms, using a
uniform API.
I also touched in my last blog post on the difference between training a Neural Network and deploying it. To recap on that, training a Deep Neural Network is like taking a “black box” with a lot of switches and then tuning those for as long as it takes for this “black box” to produce acceptable answers. During this process, such a network is fed with millions of data points. Using these data points, the switches of a network are adjusted systematically so that it gets as close as possible to the answers we expected. Now this process is computationally very expensive, as data has to be passed through the network millions of times. For such tasks GPUs perform very well and are the de facto standard if it comes to hardware being used to train large neural networks, especially for tasks such as computer vision. If it comes to the frameworks used to train Deep Neural Networks, so there are a lot that can be utilized like Tensorflow, PyTorch, MxNet or Caffe. All of these frameworks yield a trained network in their own inherent file format.
After the
training phase is completed, a network is ready to be used on yet unseen data
to provide answers for the task it was trained for. Using a network to provide
answers in a production environment is referred to as inference. Now in its
simplest form, an application will just feed a network with data and wait for
it to output the results. However, while doing so there are many steps that can
be optimized, which is what so called inference engines do.
Now where does the OpenVino toolkit fit in concerning both tasks of training and deploying a model? The issue is that using algorithms like Deep Learning is not only computationally expensive during the training phase, but also upon deployment of a trained model in a production environment. Eventually, the hardware on which an application utilizing Deep Learning is running on is of crucial importance. Neural Networks, especially those used for visual applications, are usually trained on GPUs. However, using a GPU for inference in the field is very expensive and doesn’t pay off when using it in combination with an inexpensive edge device. It is definitely not a viable option to use a GPU that might cost a few hundred dollars to make a surveillance camera or a drone. If it comes to using Deep Neural Networks in the field, most of them are actually running on CPUs. Now there are a lot of platforms out there using different CPU architectures, which of course adds on to the complexity of the task to develop an application that runs on a variety of these platforms. That’s where OpenVino comes into play, it solves the problem of providing a unified framework for development which abstracts away all of this complexity. All in all, OpenVino enables applications utilizing Neural Networks to run inference on a heterogeneous set of processor architectures. The OpenVino toolkit can be broken down into two major components, the “Model Optimizer” and the “Inference Engine”. The former takes care of the transformation step to produce an optimized Intermediate Representation of a model, which is hardware agnostic and usable by the Inference Engine. This implies that the transformation step is independent of the future hardware that the model has to run on, but solely depends on the model to be transformed. Many pre-trained models contain layers that are important for the training process, such as dropout layers. These layers are useless during inference and might increase the inference time. In most cases, these layers can be automatically removed from the resulting Intermediate Representation.
Depiction of a workflow utilizing the Model Optimizer and Inference Engine.
Even more so, if a group of layers can be represented as
one mathematical operation, and thus as a single layer, the Model Optimizer
recognizes such patterns and replaces these layers with just one. The result is an Intermediate Representation that has fewer layers
than the original model,
which decreases the inference
time. This Intermediate
Representation comes in the form of an XML file containing the model
architecture and a binary file containing the model’s weights and biases.
Locality of different software components when using the OpenVino toolkit in combination with an application.
As shown in the picture
above, we can split the locality of an application and the steps that utilize OpenVino
into two parts. One part is running on a host machine, which can be an edge device
or any other device “hosting” an accelerator such as the Neural Compute Stick.
Another part is running on the accelerator itself.
After using the Model Optimizer to create an
Intermediate Representation, the next step is to use this representation in
combination with the Inference Engine to produce results. Now this Inference Engine
is broadly speaking a set of utilities that allow to run Deep Neural Networks on
different processor architectures. This way a developer is capable to deploy an
application on a whole host of different platforms, while using a uniform API.
This is made possible by so called “plugins”. These are software components
that contain complete implementations for the inference engine to be used on a
particular Intel device, be it a CPU, FPGA or a VPU as in the case of the
Neural Compute Stick. Eventually, these plugins take care of translating calls
to the uniform API of OpenVino, which are platform independent, into hardware
specific instructions. This API encompasses capabilities to read in the network’s
intermediate representation, manipulate network information and most
importantly pass inputs to the network once it is loaded onto the target device
and get outputs back again. Now a common workflow to use the inference engine
includes the following steps:
Read the Intermediate Representation – Read an Intermediate Representation file into an application which represents the network in the host memory. This host can be a Raspberry Pi or any other edge or computing device running an application utilizing the inference engine from OpenVino.
Prepare input and output format – After loading the network, specify input and output dimensions and the layout of the network.
Create Inference Engine Core object – This object allows an application to work with different devices and manages the plugins needed to communicate with the target device.
Compile and Load Network to device – In this step a network is compiled and loaded to the target device.
Set input data – With the network loaded on to the device, it is now ready to run inference. Now the application running on the host can send an “infer request” to the network in which it signals the memory locations for the inputs and outputs.
Execute – With the input and output memory locations now defined, there are two execution modes to choose from:
Synchronously to block until an inference request is completed.
Asynchronously to check the status of the inference request while continuing with other computations.
Get the output – After the inference is completed, get the output from the target device memory.
After laying the foundations for working with the Intel Movidius Neural Compute Stick and the OpenVino toolkit in this and my last blog post, I will elaborate on what the “dynamic” in my project title stands for. So stay tuned for more!
First of all, I wanted to share the final presentation of my project, in which some of the concepts mentioned in previous articles are explained in a more visual way.
After two articles in which the basics of aerodynamics and CFD in HPC systems were discussed and a presentation in which everything was wrapped up, now it is the perfect occasion to build on top of all that and take a different approach. This article will get more technical than the previous ones and will basically cover how to have a good start in CFD for Formula Student (or any other similar project).
Motivation: Reproducibility of the Results
One of the motives to write this article is to make it easier for anyone wishing to perform a similar study to the one that I have developed during this summer to be able to achieve it. The details regarding the scripts and modified pieces of code used throughout the project will not be shared although, if someone is interested, they can be provided. This means that the article only accounts for the geometry preparation, not the CFD setup itself, for which information is available in many blogs such as CFD Online.
The list of software used along the project contains:
SolidWorks to handle the CAD files
Oracle VM VirtualBox Manager to set up a Linux-Ubuntu Virtual Machine
OpenFOAM to perform all the CFD meshes and simulations
ParaView and EnSight to carry out the post-processing
How to Start? Pre-processing of the Geometry
The base for obtaining an accurate enough result is building an adequate mesh. But an adequate mesh can never be achieved without a good pre-processing of the solid (car) geometry.
The starting point of the project was a complex CAD version of the TU Ostrava Formula Student car in STP format. However, it must be first acknowledged that CAD formats cannot be used as an input to generate a mesh: what the meshing algorithm needs are a cloud of points defining a surface. CAD files present two problems: (1) they do not contain a set of points, but rather a set of operations and mathematical functions that define the geometry of the solids; and (2) they contain unimportant information, such as material or colour of the different parts, that are meaningless for the CFD process.
The geometry of the car should, therefore, be converted into a more suitable format such as STL. This was achieved by importing the STP file into SolidWorks and then exporting it as an STL file. The exporting process is also important:
The small details that are not relevant to the solution should be removed. Overall, the geometry must be carefully cleaned. This includes all kind of bolts, nuts and screws in the car, for example. It is true that they can have a small effect on the aerodynamic performance of the car, but it is not significant enough so as to compensate for the associated increase in computational time/resources.
The file must be exported as an ASCII file type due to functionality reasons.
The reference coordinates system must be chosen carefully: it will define the absolute coordinates of the file and its location in the mesh.
The exporting quality should be large enough. Otherwise, it will be possible to observe even at first sight how the geometry is not continuous. The mesh will never improve a bad STL surface quality; at most, it will be able to keep up to it. To get this right, one must analyse the maximum accuracy that is going to be possible to achieve on the surfaces of the body in the meshing process, and then generate the surface file from the CAD in such a way that the characteristic size of its elements is smaller than that of the smallest mesh elements -for the geometry not to be the limiting factor-, but not much smaller -to avoid a very large geometry file that will slow down the whole process-. It could take several iterations to find the sweet point.
The geometry assembly (in this case the car) should not be saved as a single STL file since this will not allow to then analyse the forces created at each of the parts. The best approach is, therefore, to save the geometry in as many parts as relevant subdivisions the geometry features. In the formula car case, the geometry was split into the following parts:
Front wheels
Rear wheels
Suspension assembly
Chassis assembly
Engine
Driveshaft
Driver
Front wing
Rear wing
Underbody
Splitting the geometry into different parts implies that they have to be merged to be input as a single file (the internal split into the different parts will be kept). This is done via a merging bash script.
Once all these steps have been fulfilled, the STL file is ready to be used.
Conclusions
If these steps are followed carefully, a base will have been built to start performing CFD simulations.
The next step would be following one of the numerous OpenFOAM tutorials to install the software. Followingly, one can start running actual simulations and adjusting the parameters of the simulations to improve the modelling of the problem up to the moment in which a satisfactory level of accuracy is reached. This article is just an introduction, but more details can be found both in the final presentation and the final report of the project.
The time has come to talk about parallelization and MPI, the heart of our topic and my project. First, the problem the parallel Radix Sort can solve will be presented. Then, we will see the algorithm before talking about my implementation. Lastly, it will be interesting to expose performance results.
Problem the parallel Radix Sort can solve
Every day we collect more and more data to be analyzed. With nowadays large-scale data and large-scale problems, parallel computing is a powerful ally and important tool. While processing data, one recurrent step is the sorting. In my second blog post, I have highlighted the importance of sorting algorithms. Parallel sorting is one of the important component in parallel computing. It allows us to reduce the sorting time and to sort more data, amountsthat can’t be sorted serially. Indeed, it is possible when we want to sort a huge amount of data that it can’t fit on one single computer because computers are memory bounded. Or, it may take too much time using only one computer, a time we can’t afford. Thus, for memory and time motivations, we can use distributed systems and have our data distributed across several computersand sort them in that way. But how? Let’s define properly the context and what we want before.
What we have
Presources able to communicate between themselves, store and process our data. They can be CPUs, GPUs, computers… Each resource has a unique rankbetween 0and P-1.
Ndata, distributed equally across our resources. This means that each resource has the same amount of data.
The data are too huge to be stored or processed by only one resource.
What we want
Sort the data across the resources, in a distributed way. After the sort, resource 0 should have the lower part of the sorted data, resource 1 the next and so on.
We want to be as fast as possible and consume the lowest memory possible on each resource.
Like all distributed and parallel systems, an issue we have to be careful of is communications between the resources. They have to be as minimal as possible to make parallel algorithms efficient. Otherwise, we spend too much time in communications instead of treating the problem itself. Through communications, we can exchange data and information between the resources. The more amount of data we exchange, the more it will take time.
There is a bunch of parallel sorts and among them sample sort, bitonic sort, column sort, etc. Each of them has its pros and cons. But, as far as I know, they are not so many satisfying all of our requirements above. Often, they need to gather all the data or a huge part of them on one resource, at one point or another, to be efficient. This is not suitable. They can be managed without gathering the data on one resource, but, most of the time, they are not efficient enough anymore because of the communication overhead involved. The parallel Radix Sort is one of those which meets our requirements while being efficient. It is currently known as the fastest internal sorting method fordistributed-memory multiprocessors.
Now, we will use the word processor instead of resource because it is the word often used in HPC.
Parallel Radix Sort
I recommend to read my two previous blog posts where I have detailed the serial Radix Sort because the parallel version is entirely based on it. So if you don’t know the Radix Sort, it is better to read them first. The notations used here have been introduced in the previous posts.
In general, parallel sorts consist of multiple rounds of serial sort, called local sort, performed by each processor in parallel, followed by movement of keys among processors, called the redistribution step. Local sort and data redistribution may be interleaved and iterated a few times depending on the algorithms used. The parallel Radix Sort also follows this pattern.
We assume that each processor has the same amount of data otherwise processors workload would be unbalanced because of the local sort. Indeed, if one processor has more data than the others, it will take more time to achieve its local sort and the total sorting time will be greater. However, if the data are equally distributed, the processors will take the same time to achieve their local sort and none of them will have to wait for another to finish. As a result, the total sorting time will be shorter.
The idea of the parallel Radix Sort is, for each key digit, to first sort locally the data on each processor according to the digit value. All the processors do that concurrently. Then, to compute which processors have to send which portions of their local data to which other processors in order to have the distributed list sorted across processors and according to the digit. After iterating for the last key digit, the distributed array will be sorted as we want. Below is the parallel algorithm.
Input: rank (rank of the processor), L (portion of the distributed data held by this processor)
Output: the distributed data sorted across the processors with the same amount of data for each processor
1. for each keys digit i where i varies from the least significant digit to the most significant digit:
2. use Counting Sort or Bucket Sort to sort L according to the i’th keys digit
3. share information with other processors to figure out which local data to send where and what to receive from which processors in order to have the distributed array sorted across processors according to the i’th keys digit
4. proceed to these exchanges of data between processors
Each processor runs the algorithm with its rank and its portion of the distributed data. This is a “high-level” parallel Radix Sort algorithm. It describes what to do but not how to do it because there are so many ways of doing the steps 3. and 4. depending on many parameters like the architecture, environment and communication tool used. Let’s go through an example and then I will explain how I have chosen to implement it.
Figure 1. Unsorted versus sorted distributed array across three processors. Our goal with the parallel Radix Sort is to get the sorted one from the unsorted. Source: Jordy Ajanohoun
We start with the distributed unsorted list above to run our parallel Radix Sort. P equals 3 because there are 3processors and N equals 15. We will run the example in base 10 for simplicity but don’t forget that we can use any number base we want and in practice, base 256 is used as explained in my previous posts. Also for simplicity, we deal with unsigned integer data and the sorting keys are the numbers themselves. To sort signed integers and other data types, please refer to my previous article.
Figure 2. First iteration of the parallel Radix Sort on a distributed array across three processors. The numbers are sorted according to their base 10 LSD (Least Significant Digit). The first step (local sorting) is done concurrently by all the processors. The second step (data redistribution) can be computed and managed in several ways depending on the architecture, environment and communication tool used. One iteration according to their base 10 MSD is remaining to complete the algorithm and get the desired final distributed sorted array. Source: Jordy Ajanohoun
The keys having less digits (2, 1 and 5) have been extended to the same number of digits than the maximum in the list (two here) by adding zero as MSD. If you don’t understand why, please refer to my second article. It doesn’t change the value of the keys. The challenging part when implementing can be the redistribution step. We have to think about, for a given processor, what information from other processors is required to figure out where to send the local data. It is not complicated, we want the distributed array to be sorted according to one particular key digit (the i’th digit). In our example, the value of a digit is between 0 and 9. We have sorted locally the data on each processor, therefore, each processor knows how many data there are having their i’th key digit equals to a given digit value. By sharing this information with the other processors and getting their information too, we can determine which data each processor has to send and receive. In the example above, all the processors (from the rank 0 to P-1) have first to send their local data having their key LSD equals to 0 to the processor p0 until this one can’t receive data anymore because it is full. There is no such data so we continue with the next digit value which is 1. The processor p0 keeps its data having their key LSD equals to 1, then receive those from the processor p1, and finally those from the processor p3 but it doesn’t have. The processors order really matters and has to be respected. Once done, we repeat the same with the value 2, then 3 and so on until 9. When p0 is full, we continue by filling p1 and so on. Careful, the redistribution step has to be stable too, like the sort used in each iteration of the Radix Sort and for the same reasons. This is why we have said the order mattersand has to be respected, otherwise it doesn’t sort.
The algorithm is still correct if we swap the local sort step and the data redistribution step. But, in practice, it is not suitable because often, to send data efficiently, we need the data to be contiguous in memory. Two data having their i’th digit identical will most probably be sent to the same processor, so it is a good point to sort the data locally before.
We still have one iteration to do according to the algorithm. Let’s finish.
Figure 3. Last iteration of the parallel Radix Sort on a distributed array across three processors. The numbers are sorted according to their base 10 MSD (Most Significant Digit). The first step (local sorting) is done concurrently by all the processors. The second step (data redistribution) can be computed and managed in several ways depending on the architecture, environment and communication tools used. At the end, the algorithm is completed and the distributed array is sorted. Source: Jordy Ajanohoun
We have done the same thing but according to the last digit this time. The algorithm is now finished and, good news, our distributed array is sorted as we wanted.
I have presented here the parallel LSD Radix Sort. But, like the serial, there is a variant called parallel MSD Radix Sort which is the parallelization of the serialMSD. I have implemented both to sort int8_t data type and my performance results were better with my LSD version. This is why I have continued with the LSD to generalize it to other integer sizes. It is also the reason why since the beginning I am focusing on presenting the LSD version and I don’t really go further into detail with the MSD.
My implementation
MPI is used for communications between processors.
I have used the Counting Sort and not the Bucket Sort because my implementation with the Bucket Sort had an extra charge due to memory management. Indeed, unless making a first loop through the keys to count before moving them into the buckets, we don’t know in advance the length of the buckets. Therefore, each of my buckets was a std::vector and although they are very well implemented and optimized I still had a lack of performance due to memory management. The problem is absolutely not due to the std::vector class, it comes from the fact that on each processor, each bucket has a different size, depending on the key characteristics, and we can’t predict them. So, instead of making a first loop to count and find out the length of each bucket before creating them with appropriate sizes, I opted for the Counting Sort which is finally almost the same because we also count before moving the data. The difference is we don’t use buckets anymore, we use a prefix sum instead.
To do step 3. of the algorithm, I save the local counts from the Counting Sort on each processor and share it with other processors via MPI_Allgather. In that way, each processor knows how many keys having their i’th byte equals to a given value there are on each processor, and from that, it is easy to figure out where to send which data as explained in the Parallel Radix Sort section example (just above this “My implementation” section).
Step 4. is managed using MPI one-sided communications instead of send and receive which are two-sided communications. I have tried with both and most of the time either performances were similar or better with one-sided communications. MPI one-sided communications are more efficient when the hardware supports RMA operations. We can use them in step 4. of parallel Radix Sort because we don’t need synchronization for each data movement, but only once they are all done. The data movements are totally independent in step 4., they can be made in any order as long as we know where each element has to go.
In terms of memory, on each processor, I am using a static array (std::array) of 256 integers for the local counts. In addition to that, I have the input array “duplicated”. The original local array is used to receive the data from other processors in step 4. Thus, at the end, the array is sorted in this same input buffer. The copy is used to store the local sorted data and send them to the right processor. It is possible to implement it without duplicating the array, but a huge amount of communications and synchronizations will be necessary and the communications won’t be independent anymore. In my opinion, it is too much time lost for the memory gain.
As said previously, there are several ways of doing steps 3. and 4. It is also possible for example, to build the global counts across processors (for step 3.) using other MPI collective communications like MPI_Scan. To send the data in step 4. we can also use MPI_Alltoallv instead of one-sided communications but it requires to sort again the data locally after receiving. I tried several alternatives and what I have exposed here is the one giving me the best performances.
Performance analysis
As hardware, I have used the ICHEC Kay cluster to run all the benchmarks presented in this section. The framework used to get the execution times is google benchmark. The numbers to sort are generated with std::mt19937, a Mersenne Twister pseudo-random generator of 32-bit numbers with a state size of 19937 bits. For each execution time measure, I use ten different seeds (1, 2, 13, 38993, 83030, 90, 25, 28, 10 and 73) to generate ten different arrays (of a same length) to sort. Then, the execution times mean is registered as the execution time for the length. These ten seeds have been chosen randomly. I proceed like that because the execution time also depends on the data, so it allows us to have a more accurate estimation.
Strong scaling
Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
Figure 4. Sorting time of 1GB int8_t as a function of the number of processors. dimer::sort is my parallel Radix Sort implementation. The execution times given using boost::spreadsort and std::sort are for one processor only. They appear here just as a reference to see the gain we have with dimer::sort. They are not parallel algorithms, therefore, strong scaling doesn’t make sense for them contrary to dimer::sort. The perfect scalability curve represents the best execution time we can reach for each number of processors. Theoretically, it is impossible to do better than this perfect scalability. It is defined as the best serial execution time for the problem, divided by the number of processors.
We can notice that my implementation seems good because, in this case, it is better than std::sort and boost::spreadsort whatever the number of processors used. Plus, it seems to be very close to the perfect scalability. My implementation here is faster than boost::spreadsort even in serial (with only one processor) because when the array is not distributed, I am using ska_sort_copy instead of boost::spreadsort. ska_sort_copy is a great and very fast implementation of the serial Radix Sort. Actually, it is the fastest I have found. This is why it is my reference to compute the perfect scalability.
The problem with the previous graph is the scale. It is difficult to distinguish the perfect scalability and the dimer::sort curves because the times are too low contrary to std::sort execution time. This is why we often use a log-log scale to plot execution times in computer science.
Figure 5. Exactly the same as Figure 4 but in log-log scale this time.
With a log-log scale we can see better what is happening. Now, we are able to tell that until 8 processors, the perfect scalability and dimer::sort curves are the same. It means until 8 processors, theoretically, we can’t do better than what we already have to sort 1GB int8_t. But, from 8 processors it is not the case anymore and it is most probably due to communication overhead because there are more and more processors. Let’s verify it with another graph.
Distribution of execution time
Figure 6. Distribution of dimer::sort execution time as a function of the number of processors. The same 1GB int8_t is sorted but with different numbers of processors. This 1GB int8_t is the same for figures 5 and 4.
Indeed, when the number of processors increases, the communication part in the total execution time increases too. The local sort step (in blue) contains zero communication. All communications are done during the data redistribution step (in orange). These communications are one MPI_Allgather and several MPI one-sided communications.
Until now, we have seen performance results only for int8_t. It is good but limited because the int8_t value range is too small and not representative of real-world problems and data. What about int32_t which is the “normal” integer size?
int32_t, sorting time as a function of the array length
Figure 7. Sorting time as a function of the number of items to sort. The items here are 4 bytes integers. dimer::sort curve is the execution times of my parallel Radix Sort implementation ran with 2 processors.Figure 8. Same as figure 7 but in log-log scale.
The performances are also satisfying. Using dimer::sort we go faster than anything else when the array is enough big and this is what we expect when we use HPC tools. For small array sizes, we are not really expecting an improvement because they can be easily well managed serially and enough quickly. Plus, we add extra steps to treat the problem in a parallel way, therefore, when the problem size is small, it is faster serially because we don’t have these extra steps. It becomes faster in parallel when these extra steps are a small workload compare to the time needed to sort serially.
Demonstration
Let’s finish by visualizing my implementation! We easily distinguish the two different steps of the algorithm in this visualization.
Visualizing dimer::sort (with 3 processors) using OpenGL
Just a quick info: This blog post was written by Allison Walker and Kara Moraw. Due to a missing plugin, the shared authorship is not displayed correctly, so we’re letting you know like this instead.
** This post is intended for future participants in the SoHPC program, and for anyone planning a visit to Amsterdam. **
SURFsara
SURFsara is a member of the SURF cooperative. It brings together many educational and research institutions in the Netherlands to drive innovation and digitization. SURFsara is an institution that provides a variety of computing services: networks, storage, visualization and, of course, supercomputers.
SURFsara’s supercomputer is called Cartesius. It ranks 455 on the TOP500 list and 158 on the GREEN500 list. It offers dutch researchers the opportunity to process their data and compute large simulations on national resources. It is designed to be a well-balanced system for maximum usability.
For the hardware nerds among you, here are the juicy details: Cartesius runs on different Bull nodes with the bullx Linux operating system. The nodes sum up to 47,776 cores and 132 GPUs with a theoretical peak performance of 1.843 Pflop/s. Its power consumption is about 706.00 kW, and in case there is a power cut in Amsterdam, there sits a power generator ready to keep the systems up for 24 more hours. Brace yourself for an insanely high number: This power generator would use 500,000 tons of diesel… Let’s hope they’ll never have to use it!
Cartesius
Cartesius cables
Cartesius Cables
Cartesius Cooling
Back of Cartesius
A supercomputer is insanely loud…
Cartesius Cases
We have had a really wonderful time working here. The people are lovely, the environment is open, and coffee is better than average. Our only complaint: the air conditioning seems to be consistently a few degrees too cold.
Airbnb
Possibly the most important piece of advice that we can offer is: book your accommodation early! Amsterdam is one of the most popular tourist destinations in Europe, and the Summer of HPC program takes place during its busiest months: July and August. For a variety of reasons, our accommodation was not arranged until mid June, and as a result we had to pay a pretty exorbitant price! But be prepared, even an early booking doesn’t mean you will get a great deal. Amsterdam is so popular that hosts can pretty much charge what they want. Also, you can’t really be picky about the location – in our case, we’re lucky to live quite close to the centre in Amsterdam West, but on the other hand, cycling to work takes us 40 minutes every day.
If you know someone in Amsterdam, maybe ask them to look at notice-boards at the University of Amsterdam. It is just across from SURFsara and so is their housing. Maybe you’re lucky and find some student renting out their apartment over the summer.
Bike etiquette
Amsterdam, is the fifth most bicycle-friendly city in the world. There are more than 400km of bike paths, endless parking options and many bike rental companies catering to locals, expats, and tourists alike. Given the distance from our accommodation to SURFsara, the high cost of public transport, and our general eagerness to live this summer as the Dutch do, we chose to cycle to and from work each day (10km and 40 minutes in each direction). We rented some great little bikes through Swapfiets for less than 20 euro per month. Swapfiets is a great option for anyone visiting the city for more than a few weeks: they provide well-maintained biked and a repair service that will come to you.
Meet our Swapfiets bikes – Poppy and Sally!
Of course, it’s important to be aware of the rules in Amsterdam. Cyclists rule the roads. They (we) have little tolerance for pedestrians or fellow cyclists who don’t know what they’re doing, so if you don’t learn the rules quick smart you risk getting knocked over:
Keep to the right. Cyclists often pass each other, and you don’t want to be the person holding up all of the people behind you on the bike path.
Abide by the road rules! Even if the locals don’t…Stop at the lights, watch out for trams and tram tracks, don’t cycle on footpaths, signal when turning, make sure you have good lights.
Always lock your bike (twice). Amsterdam is notorious for bike theft, so take the necessary precautions.
Buienradar weather app was our saviour this summer. Amsterdam is known for it’s rainy and unpredictable weather. It’s always best to check the forecast before heading out on a ride.
Public Transport
Of course, you want to be a real Amsterdam local so you’ll mostly find your way through the city cycling. It’s fast, it’s cheap, and it’s a nice exercise. Oh, and it’s a very good way of getting rid of your sleepy mood in the morning. But you’ll find that summer days in the Netherlands can be quite rainy. You’ll get used to a light rain, but you should really stay off your bike when it’s pouring! Just take your favourite book and catch a tram.
Public transport in the Netherlands is a bit different from what you might be used to. For trams, buses and metros, you can’t get a ticket based on where you’re going. You have got two options:
Get a one-hour-ticket for 3,20 €. For this, just get in at the front and ask the driver for it. You can only pay by card! 1 € of this amount is a fee because you’re using a paper ticket (shame on you!).
Get an OV-Chipkaart. You can get it at all the bigger stations (e.g. Science Park Station close to SURFsara), and it’s 7,50 €. With the OV-Chipkaart, you check in when you enter your chosen means of transport and check out when leaving. You only pay for the distance travelled, which means at least 1 € less than with a one-hour-ticket because you’re not using a paper ticket. You pay with the OV-Chipkaart itself, it works just like a Prepaid SIM card. You can top it up at a lot of supermarkets and tram stations. For checking into a tram/metro/bus, you need to have at least 4€ credit on it. You can check your credit online. At the end of your stay, you can get the rest of the money you have on it at the NS office of Amsterdam Centraal, as long as it’s less than 30 €. Or just come back within the next five years and use it again!
Trains:
If you opted for an OV-Chipkaart (which we definitely recommend), you can also use it on trains. Beware that to check into a train, you need to have a minimum credit of 20 €. If you buy a one-way or day train ticket instead, you again pay the extra fee of 1 € for using a paper ticket.
Weekend tickets:
If you have friends or family coming to visit for the weekend, these are all good options (depending on what you want to do):
Have them rent a bike. Renting a bike for a day is about 10 € and you can be sure there’s at least three different places to do that on the way from your home to the next supermarket.
The multi-day ticket. It can be bought at all metro stations and is valid in all metros, trams and buses in Amsterdam. There are different options but the 72-hours-version costs 19 €.
The Amsterdam & Region travel ticket. It takes you a bit further than the multi-day-ticket, e.g. to the beach or Haarlem. Again, there are different options but the 3-day-version is 36,50 €. It can be purchased in I Amsterdam visitor centres, at Amsterdam Centraal and other places. Just google it.
Exploring A’dam
As mentioned, Amsterdam is one of the most popular European destinations, and with good reason. The city truly is beautiful and culturally very diverse. The museums, sites, and shopping are enough to draw any adventurer to this beautiful town. Below are some of our favourite experiences as Amsterdam expats:
Pride Canal Parade Amsterdam
Canal in Amsterdam
Another canal in Amsterdam
A particularly beautiful house boat in Amsterdam
Amsterdam Pride
Amsterdam Canal Pride Parade
Another canal in Amsterdam
The A’dam tower
The flower market
Explore the canals (if you’re here during the Pride parade be sure to pay a visit!)
Anne Frank house
Rijksmuseum
Van Gogh museum
Stedelijk museum
Food
Look, ‘Dutch’ isn’t necessarily a cuisine that comes to mind when you think about interesting/delicious/exotic dishes. But saying that, we have discovered some delicious treats that are native to the country. Our must-tries are:
Bitterballen: breaded and fried meatballs, often served with mustard.
Herring: a classic Dutch street food, Herring can be found all over Amsterdam.
Dutch Pancakes: bigger and less fluffy than their American counterparts, these pancakes can be topped with both sweet or savoury toppings. Try out the Pannenkoekenboot (Pancake boat) for 75 minutes of all you can eat pancakes while cruising around Amsterdam.
Typically served on its own with onion and pickles, or in a sandwich.
Croquettes: you can find these in all varieties; meat, cheese, fish…
Cheese (Gouda being the classic Dutch variety)
Rijsttafel: directly translated to rice table, this Indonesian food is traced back to colonial history between the Netherlands and Indonesia. This dish is made up of dozens of small and shareable dishes.
Stroopwafel: two waffles with a gooey caramel centre, mmmmm.
Day trips
Something you’ll notice right away is that Amsterdam gets very busy, especially on the weekends. It’s not the ~800,000 citizens, it’s the 18 million tourists. It’s insane how many tourists roam around the city centre, and that’s why Amsterdam stopped advertising the city as a tourist destination. If you’re looking for a quieter weekend, consider exploring other parts of Holland. We got more suggestions from our colleagues here at SURFsara than we could fit into our two months, just take a look!
Utrecht:
We went to Utrecht on a sunny Sunday to avoid the crowds in Amsterdam. We took a train from Amsterdam Centraal which cost us about 9€ and arrived there within an hour. Something to have in mind about day trips on Sundays: Most of the shops open between 11 and 12. This means that until about lunchtime, the city is sleepy and quiet. If you want to enjoy a nice walk through the centre, just arrive before that, but if you’re out for a shopping trip, don’t bother arriving before 12.
Street in Utrecht
Canal in Utrecht
Trash Whale in Utrecht
After grabbing a coffee at Anne&Max, we stumbled upon the Domtoren, a gothic bell tower of 465 steps, and the Domkerk itself. Around the corner, there is also a cozy little garden. From there on, we just wandered around, but there’s two places you should definitely check out: De Zakkendrager is a restaurant with an exquisite Bittergarnituur, a collection of Dutch food. This was our first taste of Dutch food, and we loved it! Later, we stopped at the famous Dapp Frietwinkel for some Dutch fries (eat them with mayonnaise, that’s what makes them Dutch). Allegedly, you can get the best fries of all Netherlands there!
The Hague:
The Hague (or Den Haag, as the Dutch call it) is a bit further away than Utrecht. We took a train from Amsterdam Sloterdijk for about 12€ and arrived an hour and a half later. From the station, it’s a quick walk into the centre. Our highlights were the Peace Palace (home to the international law administration) which has two towers that look like they were inspired by the Disney castle, and the Binnenhof and Buitenhof, two historical squares just where the parliament and the Prime Minister of the Netherlands meet. We also enjoyed some herring just in front of Binnenhof – we were told this is where the politicians go for lunch, and we can’t blame them: it was delicious.
Den Haag
Peace Palace Den Haag
Parliament Den Haag
Haarlem:
Haarlem is a lovely town right next to Amsterdam: It’s perfect when you have a friend visiting for the weekend or just aren’t up for a full day trip. You can easily bike to Haarlem, but if you choose to go on a day with winds gusting up to 90km/h like us, you might want to opt for the train instead. The Sprinter from Amsterdam Sloterdijk costs 3,70€ and is there in less than 10 minutes. South from the station is the Nieuwe Gracht where the historical centre begins. East from here you can find a beautiful windmill (you can also enter it, but that’s 5€), and then turn west into the centre. All the surrounding streets and alleys are lovely, quiet, cozy and adorned with greenery. Towering over all the houses is the Grote Kerk (big church), or St. Bavokerk, which was definitely worth entering. It’s a beautiful and very unique cathedral built in the 16th century.
Windmill in Haarlem
View in Haarlem
Another view in Haarlem
Sailing houses in Haarlem
Grote Kerk in Haarlem
Canal in Haarlem
Another canal in Haarlem
Zaanse Schans:
Zaanse Schans is apparently a ‘must do’ when in Amsterdam. After reading about its popularity and crowds, we chose to take a day off work and cycle the 15km to this beautiful little town. It is famous for its historic windmills and distinctive green wooden houses. While there, we visited the wooden clog carving workshop, the cheese factory, and the chocolate workshop. We also stopped into the saw-mill: one of the six still-functioning mills in the town.
While this town was indeed quaint and interesting, be aware that the crowds are very overwhelming. Honestly, our day-trip felt more like a visit to Disneyworld than a visit to a historic Dutch village. It’s worth the visit, but it’s also worth timing your visit to avoid peak crowds (ie. choose a weekday and visit in the evening).
Windmills in Zaanse Schans
Other places worth a visit:
There is a lot more places worth exploring around Amsterdam. Here are a few which we had on our list, but didn’t have the time to visit:
In last blog post I talked about over/under sampling as a
method to address unbalanced datasets. As a data transformation method it is
important that we are cautious when evaluating performance of classifiers when
we are performing over/under sampling. In this blog post I will briefly discuss
the correct way to construct data transformation pipeline so that it will not
influence the evaluation of the classifier.
It is a well-known principle that, when evaluating the classifier
(or a supervised ML model in general), the
model has to be trained only on the training set and then evaluated on previously unseen test data. Doing so
enables us to correctly estimate the generalization power of the model as opposed
to its ability to over-fit the training set.
The same principle should also apply to every data
transformation that takes into account any information from the dataset. These
include:
Normalization
One-not encoding
Over-under sampling
Separation of train/test set in preprocessing – just like in
machine learning – means that the transformation pipeline has to be trained only on train set. It is then applied to test set. In the case of
transformators included above this means:
Mean, standard deviation, min, max are only
computed on train set. That means for instance that although we normalize train
set to values between 0 and 1 we can have values outside this interval.
Only classes in train set are taken into account
in one-hot encoding. This means that we (usually) group all previously unseen
values into the same group.
In the case of over-under sampling the
transformation is performed only on
train set. The model is than trained and evaluated on (unaltered) test set.
In the case of my project I have to take into account all
these methods:
This will be the last blog post, so Martin, don’t be sad. Before leaving this amazing summer school, let me sum up what I achieved.
In overall, this experience has been amazing. Although there were ups and downs throughout my research, I managed to work through it. To make it short, I will use one graph to finalise my findings in this summer.
Speedup curve of different pipelines
As you can see from the graph, the scalability of Catalyst connected OpenFOAM are very similar regardless of the pipelines. In contrast, the scalability of OpenFOAM with coprocessing and without coprocessing are similar up to 108 cores. Beyond that point, the different in scalability diverges. However, considering the benefits that can be obtained from Paraview Catalyst, the decrease of scalability is not much of a concern.
In addition, I would like to take this opportunity to share with you the experience in this summer school. I had met a lot of amazing people who are passionate about technology and HPC in specific. Also, I would like to thank my site coordinator, Simone Bnà, which was very helpful to me. I have learned a lot of skills, for example Paraview Catalyst, OpenFOAM, Pyhton scripting, Bash scripting, remote visualisation and video editing. This may be one of the best summer I had in my life. After working with HPC for 2 months, I really find myself really interested in and want to know more about HPC. Therefore, I will not stop here and I will continue to seek for more opportunity to learn more about HPC. Therefore, I am here to advise anyone who is interested in HPC to not hesitate to apply for next year Summer of HPC program.
MY WORKDESK, YOU WILL BE REMEMBERED. BYE CINECA AND THANKS FOR HAVING ME HERE!!!
The goal of my project is to construct a classifier
(learner) that will be able to recognize (and possibly predict) abnormal behavior
of HPC system. Naturally abnormal behavior and faults represent relatively
small part of the overall data collected from the HPC system. Traditional
supervised learning approaches fail on unbalanced datasets. Unbalanced datasets
mean that one label (for instance normal operation of HPC system) is far more
frequent than all other labels (for instance different possible failures of
system). To address this, two general approaches (strategies) are possible:
Supervised learning on unbalanced dataset (with strategies
to undress unbalanced data).
Semi-supervised/unsupervised learning.
On unbalanced dataset supervised learning algorithms focus
too much on prevalent class. There are two possible approaches to remedy this
Training set manipulation
Tuning of classification algorithm. Some
classification alogorithms (such as SVMs) allow us to place more importance
(with weights) to less frequent classes.
The idea of training set manipulation is that the number of
positive and negative examples are balanced. Two approaches are relevant here:
Oversampling: either taking (randomly) the same
samples multiple times or producing new samples from existing ones.
Oversampling is always performed on a smaller class and is useful when training
datasets are smaller.
Under sampling: omitting (randomly) examples
from taring set or constructing new (smaller) set of training instances from
existing set (for example taking centroids as new instances). Under sampling is
always used on a bigger class and is useful on larger datasets. Additional
usefulness of under sampling is that it reduces the size of training set and
thus speeds up computation.
In the cases of extremely unbalanced dataset (where for
instance one label represents 98% of the dataset) more useful approach is the
one with semi-supervised/unsupervised learning. The idea of this approach is to
(with for instance autoencoders) learn what is characteristic for baseline behavior
and then recognize every behavior that deviates from baseline. The drawback of
this approach is of course that this approach (in the basic implementation)
only constructs a binary classifier (that classifies only if something is an
anomaly/potential fault and not what kind of fault it is).
Research is an iterative process, and when writing code, we are often carving out a small area of functionality. Software version control is an essential part of this process, since we may want to try explore possible solutions, and revert back if the efforts did not bear fruit. It can also be key when we work collaboratively, as team members have a flexible, but standardised framework for combining work on the same system.
However, we have all encountered the old adage of “it works on my machine”, and to our chagrin nowhere else. Often that machine was our own 6 months ago. A part of the solution to this is containerisation, described in one of my earlier posts. Containerisation has its place for running full experiments, however for the development process of your systems, we need another piece of the puzzle, specifically related to our version control system.
Enter continuous integration
A mainstay of recommendations made by €150/hr devops consultants, continuous integration does have its place in successful projects, and is not daunting if you start small. It won’t solve all of your problems: write your code for you, help you achieve bigger lifts at the gym. What will it do?
In a continuous integration system (CI), there is a central server, which may be third party controlled, or in a small container on your existing infrastructure. This server has access to your version control repository, and will see new commits that have been pushed. You define a small script that will determine if your proposed version of the codebase is accepted. This script is checked into your repository too, so it can evolve over time as your needs change.
The script can be as simple as checking the code compiles properly, to more rigorous techniques, such as running your unit tests. The process of checking for new commits, and running the script is automated. The CI process is distinct doing these steps on your development machine because the CI starts with a clean slate with each job, avoiding pitfalls such as an inconsistent version-control state, magic dependencies, or other problems.
Extensions include setting up your CI to email the responsible team member for their build failing (because no one likes to nag), or running the CI script in different environments.
Adding a small file to your project that describes how to compile and test your code is enough.
Here is a an example. Assuming you are using git, with GitHub for your project, here we using the Travis CI system. In the root of your project directory, add a file called .travis.yml. Below, you can see the Travis script for a C++ project, which uses a Ubuntu environment, installs some packages, and then try and build the code:
The CI system, in this case Travis, will have a web-dashboard, that will tell you the status of the CI jobs it is running. We can see the logs from each job to understand how our build failed, and perform a number of other useful tasks.
You can see here building our project failed. No worries, we can just try again
If you are using another version control hosting systems, then there are other options (e.g. GitLab CI, Jenkins, etc. Talk to you friendly local sysadmin). Annoyingly, the syntax can be different for each solution, however at a conceptual level things are the same.
We did it, future generations will thank us
This is a short post, trying to evangelise the use of continuous integration in research codebases. It’s a small amount of effort to get set-up, but can provide an immense amount of value, both during development, and later in a project’s life-cycle. If you’ve got a grad student, or recent new hire that looks too comfortable, get them on it!