Coding Challenge (ISC20)
This year, we will have a coding exercise challenge!
High Level Overview
For this exercise, you will write a part of the code that simulates particles moving in a predefined manner within a two-dimensional space (bounding box). The bounding box is divided into a two dimensional grid of cells. The particles will be moving around in the bounding box with <x,y> double precision floating point coordinates and belong to one of the cells based on their coordinates. In each cell, the particles are stored as a vector. The size of each cell is 1 X 1, making the entire bounding box have a size of n * n, where n is the number of cells per dimension. For your simulation, we will have a simulation size of 35 * 35, with 35 cells along each dimension. The coordinates of the particles in the simulation box will be between 0 and 35 along each dimension (0 <= x, y <= 35).
The entire simulation is written using the Charm++ parallel programming model and uses the power of the adaptive Charm++ runtime system to parallelize, automatically overlap computation and communication and balance imbalanced load. Your programming assignment is to write the serial (non-parallel) part of the program that moves each particle and then based on the movement of each particle, sends it to their adjacent cells. Your code will be called from the already written portions that parallelize the program and move the simulation forward. In addition to this, if you wish to answer the bonus question, you would also be writing some code to find out the cell with maximum and minimum number of particles and the values of the maximum and minimum particles across the bounding box using Charm++ custom reduction functions.
Expected output: Correctly send the particles to their right homes and ensure that no particle is lost in the simulation of the entire system. The bounding box is wrapped around i.e. the particles that leave a corner or edge loop back into the opposite corner or edge (explained clearly below). You can verify that no particle is being lost by making sure that the sum of all the particles, which is printed periodically, remains constant across iterations. Additionally, for answering the bonus question, determine the cell with the maximum and minimum particles and their values at the end of simulation. On submission, to ensure correctness, all of your output (particle positions, min value, min cell indices, max value, max cell indices) will be compared against our output that we have obtained from previous runs. We will use the total time taken by the simulation (also printed at the end of the simulation) to evaluate your team.
Note: There might be multiple particles having the same x and y coordinates. You do not need to handle this case separately; it is a valid case assumption.
Tasks
Task 1: Compile charm++ and run simple programs
You can select one among several machine layers (or networking layers) to serve as your networking backend.
Consider using the flag ‘--with-production’ for the best optimizations
Build with the smp mode to take advantage of shared memory optimizations ( ./build charm++ <machine layer-arch> smp --with-production)
After building successfully, try a simple program like tests/charm++/simplearrayhello to ensure that programs run successfully.
Task 2: Complete the first section of the programming assignment
Untar the programming assignment tar file and then understand the coding assignment. After that, write code for the Cell::updateParticles function in the src/exercise.cpp file. You will be required to iterate through the particles vector and move each particle. After moving them, you will determine the particles that do not belong in my cell and then send them to the appropriate neighbor cells using helper methods. Run the simplistic version of the program using ‘make test’.
Note: If you run ‘make test' before writing the code in Cell::updateParticles, your application will hang since it calls Cell::updateParticles (which is required to be filled in by you).
Task 3: Debugging, testing and profiling
For debugging, you can compile your program with -g (and optionally -O0) and debug the program using gdb . Ensure that the program runs successfully with one rank ( or process) before debugging the multiple process case. While using gdb in the multiple process case, you can use an xterm to launch multiple gdb windows.
For testing, there is a C++ method in the simulation code named Cell::verifyCorrectness that is called after the simulation. This method will read the data from the pre-computed output (which is known to be correct) and will compare it against your results. (NOTE that before version 3, there was a script called scripts/evaluateOutput.sh that was provided. However, with the verification integrated into the simulation, from version 3 onwards that script is not provided and is considered obsolete).
For profiling, you can use the tool called ‘projections’ to analyze the performance of the charm++ program. The usage profile view is especially useful to view the load on each processor.
Task 4: Bonus Question
For your bonus question, learn to use Charm++ custom reduction functions and add code to calculate the cell with the maximum number of particles and the minimum number of particles and their respective values. This will execute at the end of the simulation. On running the program with the bonus question, make sure to set BONUS_QUESTION to 1 in your makefile.
Task 5: Benchmarking and Output submission
After you’ve verified the correctness of your code using the small target (make test), you can now run the actual full-sized simulation using the 'make testbench' target. Optionally, you can use the `+balancer <load-balancer>` option in your run command to apply a load balancer to balance the imbalanced load.
Note: Ensure that LIVEVIZ_RUN is set to 0 during your benchmarking run to avoid performance loss because of any visualization component.
This year, since we are conducting the competition remotely, we will reproduce your results in order to validate your submission. After the completion of your run, when you are ready for submission, create a tar file (named cc_<team-name>.tar.gz) with the following files and folders:
A folder called cc_src with all the particle simulation source code files that were modified by you (including src/exercise.cpp).
A folder called cc_output with all the final simulation output files (should include all the sim_output_* files including sim_output_main). Also include the program output that is displayed to
stdout
in a file called sim_output_stdout.A file called cc_jobscript.sh. This is essentially your job script file that you used to run the final ‘testbench’ target of the particle simulation.
[OPTIONAL] A folder called cc_charm_src with the charm directory source code changes that were made by you. If you didn’t make any changes to the charm directory source code, you do not have to submit this directory.
A file called cc_instructions.txt with the following details:
Charm version: You can obtain this using the ‘VERSION’ file inside your charm directory.
Command used to build charm++ (Include any other commands used to build dependent libraries)
[OPTIONAL] A folder called cc_additionals with any other scripts or files that you think are relevant for reproducing your results. If you don’t have any other additional files, you do not have to submit this directory.
Task 6: Visualization and video
Install the Charm++ visualization tool called liveViz on your laptop. Build a netlrts version of Charm++ (on your laptop) and compile the particle simulation program with LIVEVIZ_RUN set to 1. Run the particle simulation with the ‘make testviz’ target and in a separate window, connect the liveViz client to the running program to visualize the particle simulation. Capture your screen and generate a video showing the movement of particles in the 2D space.
Note: You have to use the netlrts machine layer for this part of the exercise, since this functionality is not supported on other machine layers. It is highly recommended to do all of this task on a smaller machine like a laptop/desktop.
Task 7: Let others know
Add a team name-tag or a team photo in small into the video (to mark it) and post to your team’s twitter account if you have, or linkedIn account with the hashtags #ISC20 #ISC20_SCC @Charmplusplus. If you don’t use social media, you can submit the file using a flash drive.
Details
Charm++ Introduction
Charm++ is a C++-based parallel programming system, founded on the migratable-objects programming model, and supported by a novel and powerful adaptive runtime system. It supports both irregular as well as regular applications, and can be used to specify task-parallelism as well as data parallelism in a single application. It automates dynamic load balancing for task-parallel as well as data-parallel applications, via separate suites of load-balancing strategies. Via its message-driven execution model, it supports automatic latency tolerance, modularity, and parallel composition. Charm++ also supports automatic checkpoint/restart, as well as fault tolerance based on distributed checkpoints.
Charm++ is a production-quality parallel programming system used by multiple applications in science and engineering on supercomputers as well as smaller clusters around the world. Currently the parallel platforms supported by Charm++ are the IBM BlueGene/Q and OpenPOWER systems, Cray XE, XK, and XC systems, Omni-Path and Infiniband clusters, single workstations and networks of workstations (including x86 (running Linux, Windows, MacOS)), etc. The communication protocols and infrastructures supported by Charm++ are UDP, MPI, OFI, UCX, Infiniband, uGNI, and PAMI. Charm++ programs can run without changing the source on all these platforms. Notable Charm++ applications that are regularly run on leading supercomputers across the world are NAMD (molecular dynamics), ChaNGa (n-body cosmological simulations), Episimdemics (contagion in social networks) and OpenAtom (Ab initio Molecular Dynamics) among many others.
Particle Simulation Introduction
The particle simulation is written using the Charm++ parallel programming model and it simulates a set of particles moving in a 2-dimensional space within a bounding box. The bounding box is divided into a two dimensional grid of cells. The particles moving around in the bounding box have (x,y) floating point coordinates and belong to one of the cells based on their coordinates. In each cell, the particles are stored as vector. The size of each cell is 1 X 1, making it n X n for a simulation involving n cells in each dimension. In the code, the variable 'numCellsPerDim' represents the number of cells along each dimension of the bounding box.
Each cell also has many particles that move around in a specific manner and in that process could go to the 8 neighboring cells, which are Top-Left, Top, Top-Right, Left, Right, Bottom-Left, Bottom, and Bottom-Right.
The bounding box (i.e. the simulation space) is wrapped around, i.e. all the edge cells and corner cells loop
back along each edge and corner.
For example, In a 5x5 grid, cell (4,2) will have the following neighbors:
Top Left Neighbor : (3,1)
Top Neighbor : (3,2)
Top Right Neighbor : (3,3)
Left Neighbor : (4,1)
Right Neighbor : (4,3)
Bottom Left Neighbor : (0,1)
Bottom Neighbor : (0,2)
Bottom Right Neighbor : (0,3)
Similarly, cell (0,4) will have the following neighbors:
Top Left Neighbor : (4,3)
Top Neighbor : (4,4)
Top Right Neighbor : (4,0)
Left Neighbor : (0,3)
Right Neighbor : (0,0)
Bottom Left Neighbor : (1,3)
Bottom Neighbor : (1,4)
Bottom Right Neighbor : (1,0)
Input Parameters
The simulation takes the following input parameters
Grid Size (Creates a grid of cells of Grid Size X Grid Size, same as the 'n' parameter explained above)
Particles per cell seed value
Number of Iterations
Green Particles (Lower Triangular Half) distribution ratio
Blue Particles (Upper Triangular Half) distribution ratio
Red Particles (Diagonal) distribution ratio
Red Particles (Central Box) distribution ratio
Velocity Reduction Factor
Log Output
Load Balancing Frequency
The simulation consists of three types of particles: Green, Blue and Red. Green particles are added to the lower triangular half of the bounding box. Blue particles are added to the upper triangular half of the bounding box. Red particles are added to the diagonal and a central box in the middle of the bounding box. The particle color plays a role in affecting the speed of the particle ( Red > Blue > Green).
The distribution ratios are used to populate the bounding box with the particles before the beginning of simulation. The velocity reduction factor is used to slow the speed of particles by a constant factor.
The log output (accepts 'yes' or 'no') determines if the output of the program should be written to files or not. Setting it to 'yes' will cause the simulation output to be logged to files inside the output
directory. Setting it to 'no' will cause the program to not write the output to files. This is an input parameter that can be used by the users for debugging. It is important to note that the time taken for writing the output to files, is not counted in the total simulation time and hence, will not affect your score. (Also NOTE that this variable was earlier called outputprompt. However, with the change in the verification process i.e. with output verification integrated inside the program, the previous variable was no longer required and hence modified to logoutput).
The load balancing frequency is the number of iterations for which load balancing occurs in the runtime system. During load balancing, parallel objects are migrated from the overloaded processors to the underloaded processes. For this assignment, it is a tunable parameter to get better performance from the application. Note that for load balancing frequency to play a part in the application, '+balancer <load-balancer-name>' should be passed to the application.
In the particle simulation, the simulation begins after the initial population of cells with particles based on the particle distribution ratios. Particles move in a pre-defined manner (when you call the perturb function). Based on the particle's new position, it is your responsibility to determine if the particle is still in the same cell or if it has to be transferred to one of the adjacent cells. In the case of the particle not belonging to the same cell, you need to send the particle to the new cell, where it belongs. For each iteration, every particle in each cell is moved (perturbed) and then, the cell sends 8 messages to 8 of its neighboring cells (TL, T, TR, L, R, BL, B, BR, as explained above). After sending the 8 messages, each cell waits for 8 incoming messages from its neighbors, consisting of particles that belong to it. These new particles that belong to the cell are added to the existing particles of the cell. Note that, it is your task to only move and send the particles (the 8 outgoing messages) to 8 neighboring cells. The already written skeleton code takes care of waiting for the 8 incoming messages and adding them to the cell. This process of moving the particle, sending 8 outgoing messages and waiting for 8 incoming messages is repeated for every cell, for every iteration of the simulation. The simulation completes when the ongoing count of iterations reaches the total number of iterations.
Charm++ Setup
1. Download the latest Charm++ release
$ wget https://github.com/UIUC-PPL/charm/archive/v6.10.1.tar.gz
...
$ tar xzf v6.10.1.tar.gz
2. Build charm++ using a suitable machine layer (or networking layer)
On a commodity ethernet cluster, use the netlrts build. For proprietary networks, use the appropriate machine layer (Eg - gni for Cray gemini interconnect, pami for IBM pami interconnect, ucx for Infiniband, ofi for Intel Omnipath). Since charm has a MPI backend, optionally, you can also build Charm++ with an MPI backend and compare its performance against the native machine layer.
Consider using the '--with-production' flag for the best optimizations.
$ ./build charm++ <mach-layer>-linux-x86_64 --with-production -j4
Build with the smp mode to take advantage of shared memory optimizations:
$ ./build charm++ <mach-layer>-linux-x86_64 smp -j4
Refer https://charm.readthedocs.io/en/latest/charm++/manual.html#installing-charm for more information
on building Charm++
3. After building successfully, try a simple program like tests/charm++/simplearrayhello to ensure that
programs run successfully.
Coding Assignment (Particle Simulation) Setup
1. Untar the charm-exercise.tar.gz that will be provided to you
2. cd into charm-exercise (the recently tar-ed folder) and set CHARM_HOME in the Makefile to
point to your charm installation directory.
3. Run 'make test' to ensure that all files compile and the executable is launched. The program
will hang because Cell::updateParticles(int iter) is empty currently (and requires you to code
the remaining parts).
4. Run make testbench to test it on a larger size, Check the Makefile.
Coding Assignment Task
Your exercise is to add code for two functions:
The main part of the exercise is to add code in Cell::updateParticles.
The bonus part of the exercise is to add code in Cell::contributeToReduction, Main::receiveMinMaxReductionData, and global function:calculateMaxMin. For this part, you would need to understand how custom reduction functions are written in Charm++ in order to write the reduction function for calculating the min value, the min cell indices, the max value, and the max cell indices.
Main task:
The Cell::updateParticles function is called for each cell in the grid. It is called for each iteration of the
simulation to perform the following functionalities:
Move (or Perturb) the particles
Determine the cell location of the newly moved particles i.e. determine if the particle still belongs to me or does it belong to one of my 8 neighboring cells.
Make sure that particles that go outside the bounding box are wrapped back.
Send particles that belong to my neighboring cells using the Cell::sendParticles function.
Your assignment is to add code into Cell::updateParticles in the file src/exercise.cpp to add the functionalities described above. It is important to note that sendParticles is called 8 times to send outgoing messages to the 8 adjacent cells or neighbors, since they are waiting on messages from all their neighbors. Failing to send messages to the neighbors will cause a hang.
Bonus exercise
Understand custom reductions from the manual, refer to https://charm.readthedocs.io/en/latest/charm++/manual.html#defining-a-new-reduction-type.
The Cell::contributeToReduction function is called for each cell in the grid. It is called at the end of the simulation i.e. after all iterations, to contribute to the custom reduction operation (as specified in the global reduction function calculateMaxMin). calculateMaxMin is the custom reduction function that gets reduction msgs from the contributors and needs to perform the reduction operation on the received data. Finally, on completion of the reduction operation, the callback pointing to Main::receiveMinMaxReductionData will be invoked with the final reduction data.
Your coding assignment bonus exercise is to
Add code in Cell::contributeToReduction in the file src/exercise.cpp to declare the reduction data and assign the declared data with the different values being reduced (min value, min x index, min y index, max value, max x index and max y index). Following the data declaration, declare a callback to Main::receiveMinMaxReductionData and call contribute passing the size, data, reduction type i.e minMaxType and the declared callback. Note that the custom reduction type, 'minMaxType' has already been declared to use calculateMaxMin.
Add code in calculateMinMax to implement the reduction operations for the data being received.
Add code in Main::receiveMinMaxReductionData to initialize the variables with the final reduction values. These variables are maxParticles, maxCellX, maxCellY, minParticles, minCellX and minCellY. Note that, Main::receiveMinMaxReductionData is called after the reduction operation is complete.
Testing your program
On completing the main task (and optionally the bonus task), run the small target 'make test' and ensure that the value of the total particles in the simulation is not changing across iterations. The unchanged value indicates that no particles are lost in the simulation. A changing value indicates a bug in your code.
After you have verified the unchanging value of the total number of particles, the correctness verification is performed inside the Cell::verifyCorrectness method after the simulation. (Note that the previous method of running src/evaluteOutput.sh is now obsolete).
Note that the previous script src/evaluateOutput.sh was comparing coordinates using a naive diff between the simulation output and the pre-computed output. The current method is an improved method that compares the coordinate values using fabs(precomputeParticles[i].x - reorgParticles[i].x) < 1e-6
. In terms of precision, the previous method was printing doubles with 6 decimal places and comparing for an exact match. With the new method, although the printing of pre-computed output is done with 14 decimal places, only 6 decimal places are accounted for precision in the comparison. This eliminates the effect of rounding which was earlier introduced and hence uses a more robust method for floating-point number comparison.
Additionally, previously, the comparison was done on a cell by cell basis i.e. sim_output_0_0
of comparison output (or precomputed output) was compared for an exact match with sim_output_0_0
of your output. However, because of the behavior in floating point numbers, since the particle itself can move to another box, this approach of validation was not entirely correct. The current approach uses a global sort of particles (using global particle id) after the simulation and compares particles at a global level irrespective of which box they may belong to. This phase is called reorganization of particles and is not included in the timed part of your simulation. It is entirely done for output verification.
Note that with the new approach, the output values of the bonus question cannot be validated. For the bonus question, points will be awarded entirely based on your implementation.
Using benchmarking output for validation:
You can download the benchmarking output (bench_compareOutput.tar.gz) from Box using this link. After downloading, you will need to untar the file in scripts/compareOutput/
.
This should create a folder bench inside /scripts/compareOutput
. This will be used by your program during the runs to compare your benchmarking simulation output against. (testbench output).
Visualizing your run
To visualize your program, you would need to build a netlrts build, either with or without smp support, since that is the only build that can support live visualization. Additionally, you would also need to install the visualization client liveViz to connect to your run. On completing your exercise, in order to visualize the particle simulation, perform the following steps on your local machine (like your desktop or laptop). For this exercise, you wouldn’t your cluster.
Download the latest Charm++ release
2. Build charm++ using the netlrts machine layer
3. Compile the liveViz library
4. Install the liveViz visualization client
After this, you should be able to see a tool called ‘liveViz' in ccs_tools/bin
. If required, you can add the path to ‘liveViz’ to your PATH environment variable.
5. The visualization code is already added to your particle simulation code and it only needs to be recompiled after setting the CHARM_HOME variable to the recently built charm build and the Makefile variable LIVEVIZ_RUN set to 1.
6. As this is running, in a separate terminal tab, launch the liveViz Client connecting to the port 1234.
With this, you should be able to see the visualization of moving particles.
Some Caveats to Understand
For the coding challenge, you will only need to add code to
exercise.cpp
. However, if you wish, you can change the other source code files as long your result matches the pre-computed output.Important- While deciding to send particles to their correct homes (the 8 neighboring cells), ensure that a particle is sent over only if its coordinates have crossed the boundary. Do not send particles that are still on the boundary of the current cell.
Important - Make sure that particles that go outside the bounding box are wrapped back as explained previously.
Use the totalParticles displayed in the output to validate that no particle in the simulation is being lost. The value of totalParticles displayed should not change across the periodically displayed iterations.
If you’re attempting the bonus question, make sure to go into the Makefile and set the BONUS_QUESTION variable to 1. Since this is a compiler macro, you will have to run
make clean all
to recompile your files.When you’re using the visualization feature and want to visualize using liveViz, make sure to go into the Makefile and set the LIVEVIZ_RUN variable to 1. Since this is also a compiler macro, you will have to run
make clean all
to recompile your files.Among the input parameters listed in this section, for your final benchmarking task, you can only modify logOutput, load balancing frequency, and the load balancer that you pass using
+balancer <load-balancer-name>
.If you are running your program in the SMP mode, which takes advantage of shared memory communication for threads of a process, note that in this mode, for each process, you will be using one of the threads (and hence cores) of the process for a communication thread, whose exclusive job is to send and receive messages. This means that you will lose that core as your compute resource. In the SMP mode, it is required to specify
++ppn <ppn-val>
to specify the number of threads for the process. Additionally, you can specify cpu affinity to bind the threads to cores using+pemap <pe-map>
and+commap <comm-map>
. These are described here.Note from Nitin Bhat:
It was brought to my notice by a few participants on the ISC20-SCC slack (#isc20-scc) that the implementation and version of the math librarylibm
could impact the final output of the particle simulation. For that reason, in order to ensure that yourlibm
versions don’t lead to different results when compared with the pre-computed test results, I decided to provide the details about thelibm
version that I’ll be using for testing. I will be using the standardlibm
that is shipped as a part of GNU libc 2.12. This is also the default library in your Linux environment on the NSCC cluster.Note that you modify the number of processors in the testbench output. The current value is +p96. But, you’re free to play around with this value.
Parts of the code that cannot be modified:
Do not change/remove the timer calls since that is used to evaluate the total execution time.
Do not change/remove asserts in the code since they are basic sanity checks for the simulation to proceed as required.
This includes the code inside
checkParticleBelongsToMe
and the places wherecheckParticleBelongsToMe
is called.This also include the call to
verifyCorrectness
insiderecvParticlesPostSimulation
Ensure that the simulation is performed such that particles have coordinates that are spatially belonging to the cell. (The calls to
checkParticleBelongsToMe
will enforce this behavior).Do not change/remove the calls to
reduceTotalAndOutbound
since they will be performed during the simulation in order to print out the total particles and the total outgoing particles in the simulation.
Download Files
Click here to download the Coding Challenge Files
Good Luck Teams!!
References
Charm++ Manual: https://charm.readthedocs.io/en/latest/index.html
Custom Reduction: https://charm.readthedocs.io/en/latest/charm++/manual.html#defining-a-new-reduction-type
Projections Manual: https://charm.readthedocs.io/en/latest/projections/manual.html
LiveViz: https://charm.readthedocs.io/en/latest/libraries/manual.html#liveviz-library