/
Gromacs Profiling using APS
Gromacs Profiling using APS
Intel® VTune™ Profiler Application Performance Snapshot for a quick view into different aspects of compute intensive applications' performance, such as MPI and OpenMP* usage, CPU utilization, memory access efficiency, vectorization, I/O, and memory footprint. Application Performance Snapshot displays key optimization areas and suggests specialized tools for tuning particular performance aspects, such as Intel VTune Profiler and Intel® Advisor. The tool is designed to be used on large MPI workloads and can help analyze different scalability issues.
Loading modules for APS and Intel MPI
module load intel/2020.1.217
module load mkl/2020.1.217
source $INTEL_DIR/vtune_profiler/apsvars.sh
module load gcc/8.4.0
module load cmake/3.13.4
module load impi/2019.7.217
module load ucx/1.8.0
Running Gromacs with APS
export MPS_STAT_LEVEL=5
mpirun -np 256 -genv USE_UCX=1 -genv UCX_NET_DEVICES mlx5_0:1 -genv I_MPI_FABRICS shm:ofi \
-genv FI_PROVIDER=mlx aps -c=mpi -r $PWD/aps-impi \
mdrun_mpi -v -s stmv.tpr -nsteps 10000 -noconfout
Generating profile reports
% aps --report=./aps-impi
Loading 100.00%
| Summary information
|--------------------------------------------------------------------
Application : mdrun_mpi
Report creation date : 2020-06-09 11:35:16
Number of ranks : 256
Ranks per node : 32
OpenMP threads number per rank: 1
Used statistics : ./aps-impi-HDR100--0hcoll-sharp--20200609_113509-71475
|
| Your application is MPI bound.
| This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.
|
Elapsed time: 114.99 sec
MPI Time: 18.70 sec 16.41%
| Your application is MPI bound. This may be caused by high busy wait time
| inside the library (imbalance), non-optimal communication schema or MPI
| library settings. Explore the MPI Imbalance metric if it is available or use
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
| possible performance bottlenecks.
MPI Imbalance: 9.13 sec 8.01%
| The application workload is not well balanced between MPI ranks.For more
| details about the MPI communication scheme use Intel(R) Trace Analyzer and
| Collector available as part of Intel(R) Parallel Studio XE Cluster Edition.
Top 5 MPI functions (avg time):
Sendrecv 5.39 sec ( 4.69 %)
Waitall 5.27 sec ( 4.58 %)
Bcast 3.40 sec ( 2.96 %)
Init_thread 2.16 sec ( 1.87 %)
Alltoall 0.99 sec ( 0.86 %)
Disk I/O Bound: 0.14 sec ( 0.13 %)
Data read: 166.1 MB
Data written: 175.9 KB
Memory Footprint:
Resident:
Per node:
Peak resident set size : 11217.82 MB (node helios025.hpcadvisorycouncil.com)
Average resident set size : 11106.19 MB
Per rank:
Peak resident set size : 435.46 MB (rank 0)
Average resident set size : 347.07 MB
Virtual:
Per node:
Peak memory consumption : 151387.64 MB (node helios025.hpcadvisorycouncil.com)
Average memory consumption : 151189.63 MB
Per rank:
Peak memory consumption : 4895.98 MB (rank 0)
Average memory consumption : 4724.68 MB
Graphical representation of this data is available in the HTML report: aps_report_20200609_114024.html
Sample HTML report
Related content
GPUDirect Benchmarking
GPUDirect Benchmarking
More like this
Thea - System Architecture
Thea - System Architecture
More like this
Getting Started with Thea Clusters
Getting Started with Thea Clusters
More like this
Getting Started with HPC-AI AC Clusters
Getting Started with HPC-AI AC Clusters
More like this
Getting Started with Secret App for ISC24 SCC
Getting Started with Secret App for ISC24 SCC
More like this
Getting Started with the Coding Challenge for ISC25 SCC
Getting Started with the Coding Challenge for ISC25 SCC
More like this