Gromacs Profiling using APS
Intel® VTune™ Profiler Application Performance Snapshot for a quick view into different aspects of compute intensive applications' performance, such as MPI and OpenMP* usage, CPU utilization, memory access efficiency, vectorization, I/O, and memory footprint. Application Performance Snapshot displays key optimization areas and suggests specialized tools for tuning particular performance aspects, such as Intel VTune Profiler and Intel® Advisor. The tool is designed to be used on large MPI workloads and can help analyze different scalability issues.
Loading modules for APS and Intel MPI
module load intel/2020.1.217
module load mkl/2020.1.217
source $INTEL_DIR/vtune_profiler/apsvars.sh
module load gcc/8.4.0
module load cmake/3.13.4
module load impi/2019.7.217
module load ucx/1.8.0
Running Gromacs with APS
export MPS_STAT_LEVEL=5
mpirun -np 256 -genv USE_UCX=1 -genv UCX_NET_DEVICES mlx5_0:1 -genv I_MPI_FABRICS shm:ofi \
-genv FI_PROVIDER=mlx aps -c=mpi -r $PWD/aps-impi \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout
Generating profile reports
% aps --report=./aps-impi
Loading 100.00%
| Summary information
|--------------------------------------------------------------------
Application : mdrun_mpi
Report creation date : 2020-06-09 11:35:16
Number of ranks : 256
Ranks per node : 32
OpenMP threads number per rank: 1
Used statistics : ./aps-impi-HDR100--0hcoll-sharp--20200609_113509-71475
|
| Your application is MPI bound.
| This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.
|
Elapsed time: 114.99 sec
MPI Time: 18.70 sec 16.41%
| Your application is MPI bound. This may be caused by high busy wait time
| inside the library (imbalance), non-optimal communication schema or MPI
| library settings. Explore the MPI Imbalance metric if it is available or use
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
| possible performance bottlenecks.
MPI Imbalance: 9.13 sec 8.01%
| The application workload is not well balanced between MPI ranks.For more
| details about the MPI communication scheme use Intel(R) Trace Analyzer and
| Collector available as part of Intel(R) Parallel Studio XE Cluster Edition.
Top 5 MPI functions (avg time):
Sendrecv 5.39 sec ( 4.69 %)
Waitall 5.27 sec ( 4.58 %)
Bcast 3.40 sec ( 2.96 %)
Init_thread 2.16 sec ( 1.87 %)
Alltoall 0.99 sec ( 0.86 %)
Disk I/O Bound: 0.14 sec ( 0.13 %)
Data read: 166.1 MB
Data written: 175.9 KB
Memory Footprint:
Resident:
Per node:
Peak resident set size : 11217.82 MB (node helios025.hpcadvisorycouncil.com)
Average resident set size : 11106.19 MB
Per rank:
Peak resident set size : 435.46 MB (rank 0)
Average resident set size : 347.07 MB
Virtual:
Per node:
Peak memory consumption : 151387.64 MB (node helios025.hpcadvisorycouncil.com)
Average memory consumption : 151189.63 MB
Per rank:
Peak memory consumption : 4895.98 MB (rank 0)
Average memory consumption : 4724.68 MB
Graphical representation of this data is available in the HTML report: aps_report_20200609_114024.html
Sample HTML report
Â