Gromacs Profiling using APS

Intel® VTune™ Profiler Application Performance Snapshot for a quick view into different aspects of compute intensive applications' performance, such as MPI and OpenMP* usage, CPU utilization, memory access efficiency, vectorization, I/O, and memory footprint. Application Performance Snapshot displays key optimization areas and suggests specialized tools for tuning particular performance aspects, such as Intel VTune Profiler and Intel® Advisor. The tool is designed to be used on large MPI workloads and can help analyze different scalability issues.

https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-application-performance-snapshot/top.html

Loading modules for APS and Intel MPI

module load intel/2020.1.217 module load mkl/2020.1.217 source $INTEL_DIR/vtune_profiler/apsvars.sh module load gcc/8.4.0 module load cmake/3.13.4 module load impi/2019.7.217 module load ucx/1.8.0

Running Gromacs with APS

export MPS_STAT_LEVEL=5 mpirun -np 256 -genv USE_UCX=1 -genv UCX_NET_DEVICES mlx5_0:1 -genv I_MPI_FABRICS shm:ofi \ -genv FI_PROVIDER=mlx aps -c=mpi -r $PWD/aps-impi \ mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout

Generating profile reports

% aps --report=./aps-impi Loading 100.00% | Summary information |-------------------------------------------------------------------- Application : mdrun_mpi Report creation date : 2020-06-09 11:35:16 Number of ranks : 256 Ranks per node : 32 OpenMP threads number per rank: 1 Used statistics : ./aps-impi-HDR100--0hcoll-sharp--20200609_113509-71475 | | Your application is MPI bound. | This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks. | Elapsed time: 114.99 sec MPI Time: 18.70 sec 16.41% | Your application is MPI bound. This may be caused by high busy wait time | inside the library (imbalance), non-optimal communication schema or MPI | library settings. Explore the MPI Imbalance metric if it is available or use | MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore | possible performance bottlenecks. MPI Imbalance: 9.13 sec 8.01% | The application workload is not well balanced between MPI ranks.For more | details about the MPI communication scheme use Intel(R) Trace Analyzer and | Collector available as part of Intel(R) Parallel Studio XE Cluster Edition. Top 5 MPI functions (avg time): Sendrecv 5.39 sec ( 4.69 %) Waitall 5.27 sec ( 4.58 %) Bcast 3.40 sec ( 2.96 %) Init_thread 2.16 sec ( 1.87 %) Alltoall 0.99 sec ( 0.86 %) Disk I/O Bound: 0.14 sec ( 0.13 %) Data read: 166.1 MB Data written: 175.9 KB Memory Footprint: Resident: Per node: Peak resident set size : 11217.82 MB (node helios025.hpcadvisorycouncil.com) Average resident set size : 11106.19 MB Per rank: Peak resident set size : 435.46 MB (rank 0) Average resident set size : 347.07 MB Virtual: Per node: Peak memory consumption : 151387.64 MB (node helios025.hpcadvisorycouncil.com) Average memory consumption : 151189.63 MB Per rank: Peak memory consumption : 4895.98 MB (rank 0) Average memory consumption : 4724.68 MB Graphical representation of this data is available in the HTML report: aps_report_20200609_114024.html

Sample HTML report