Gromacs Profiling using APS

Intel® VTune™ Profiler Application Performance Snapshot for a quick view into different aspects of compute intensive applications' performance, such as MPI and OpenMP* usage, CPU utilization, memory access efficiency, vectorization, I/O, and memory footprint. Application Performance Snapshot displays key optimization areas and suggests specialized tools for tuning particular performance aspects, such as Intel VTune Profiler and Intel® Advisor. The tool is designed to be used on large MPI workloads and can help analyze different scalability issues.

https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-application-performance-snapshot/top.html

Loading modules for APS and Intel MPI

module load intel/2020.1.217
module load mkl/2020.1.217
source $INTEL_DIR/vtune_profiler/apsvars.sh
module load gcc/8.4.0
module load cmake/3.13.4
module load impi/2019.7.217
module load ucx/1.8.0

Running Gromacs with APS

export MPS_STAT_LEVEL=5
mpirun -np 256 -genv USE_UCX=1 -genv UCX_NET_DEVICES mlx5_0:1 -genv I_MPI_FABRICS shm:ofi \
-genv FI_PROVIDER=mlx aps -c=mpi -r $PWD/aps-impi \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout

Generating profile reports

% aps --report=./aps-impi
Loading 100.00%
| Summary information
|--------------------------------------------------------------------
  Application                   : mdrun_mpi
  Report creation date          : 2020-06-09 11:35:16
  Number of ranks               : 256
  Ranks per node                : 32
  OpenMP threads number per rank: 1
  Used statistics               : ./aps-impi-HDR100--0hcoll-sharp--20200609_113509-71475
|
| Your application is MPI bound.
| This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.
|
  Elapsed time:                              114.99 sec
  MPI Time:                   18.70 sec            16.41%
| Your application is MPI bound. This may be caused by high busy wait time
| inside the library (imbalance), non-optimal communication schema or MPI
| library settings. Explore the MPI Imbalance metric if it is available or use
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
| possible performance bottlenecks.
    MPI Imbalance:             9.13 sec             8.01%
| The application workload is not well balanced between MPI ranks.For more
| details about the MPI communication scheme use Intel(R) Trace Analyzer and
| Collector available as part of Intel(R) Parallel Studio XE Cluster Edition.
    Top 5 MPI functions (avg time):
        Sendrecv                     5.39 sec  ( 4.69 %)
        Waitall                      5.27 sec  ( 4.58 %)
        Bcast                        3.40 sec  ( 2.96 %)
        Init_thread                  2.16 sec  ( 1.87 %)
        Alltoall                     0.99 sec  ( 0.86 %)
 Disk I/O Bound:             0.14 sec ( 0.13 %)
       Data read:            166.1 MB
       Data written:         175.9 KB
 Memory Footprint:
 Resident:
       Per node:
           Peak resident set size    :        11217.82 MB (node helios025.hpcadvisorycouncil.com)
           Average resident set size :        11106.19 MB
       Per rank:
           Peak resident set size    :          435.46 MB (rank 0)
           Average resident set size :          347.07 MB
 Virtual:
       Per node:
           Peak memory consumption    :       151387.64 MB (node helios025.hpcadvisorycouncil.com)
           Average memory consumption :       151189.63 MB
       Per rank:
           Peak memory consumption    :         4895.98 MB (rank 0)
           Average memory consumption :         4724.68 MB

Graphical representation of this data is available in the HTML report: aps_report_20200609_114024.html

Sample HTML report

HPC-Works

Gromacs Profiling using APS

Related content