Running the RAJA Performance Suite¶

This section describes how to run the Suite. Instructions for configuring and building the Suite are provided in Building the RAJA Performance Suite.

Running a test¶

After compilation, a test executable will reside in the test subdirectory of the build space. We use this test for our continuous integration testing to make sure everything works when changes are made to the code. To run the test, type the test executable name:

$ ./test/test-raja-perf-suite.exe

This will run a few iterations of each kernel and variant that was built based on the CMake options specified to configure the build.

You can also test an individual kernel by setting an environment variable to the name of the kernel you want to run. For example, if you use a csh/tcsh shell:

$ setenv RAJA_PERFSUITE_UNIT_TEST DAXPY
$ ./test/test-raja-perf-suite.exe

or, if you use a bash shell:

$ RAJA_PERFSUITE_UNIT_TEST=DAXPY ./test/test-raja-perf-suite.exe

In either case, the test will run all compiled variants of the ‘DAXPY’ kernel.

Running the Suite¶

After compilation, the RAJA Performance Suite executable will reside in the bin subdirectory of the build space. The executable will be able to run all kernels and variants that have been built depending on which CMake options were specified to configure the build.

The Suite can be run in many different ways chosen by the command-line options passed to the executable. For example, you can run or exclude subsets of kernels, variants, or groups. You can also pass options to set problem sizes, number of times each kernel is run (sampled), and many other run parameters. The goal is to build the code once and use scripts or other means to run the Suite in different ways for the analyses you want to perform.

Many options appear in a long form with a double hyphen prefix (i.e., ‘–‘). Commonly used options are also available in a one or two character short form with a single hyphen prefix (i.e., ‘-’) for convenience. To see available options along with a brief description of each, pass the –help or -h option to the executable:

$ ./bin/raja-perf.exe --help

or:

$ ./bin/raja-perf.exe -h

Note

To see all available Suite execution options, pass the –help or -h option to the executable.

Important

We do not describe most of the Suite execution options in this guide since the runtime help output is the main reference for available options, defaults, and arguments they accept.

To run the Suite in its default mode, type the executable name with no command-line arguments:

$ ./bin/raja-perf.exe

This will run all kernels and variants that have been built in their default configurations. Information describing how the Suite will run along with some information about each kernel will appear on the screen. More information about kernel and execution details will also appear in a run report files generated in the run directory after Suite execution completes.

Note

You can pass the --dryrun command-line option to the executable to see a summary of how the Suite will execute by showing run parameters without actually running it. You can pass any other command-line options when doing a “dry run” and you will see that the given options are represented in the screen output.

Lastly, the program will report specific errors if given incorrect input, such as an option that requires a value and no value is provided. It will also emit a summary of command-line arguments it was given if the input contains something that the code does not know how to parse. For example, running the command:

$ ./bin/raja-perf.exe --dry-run -k DAXPY Foo

will report the following in the screen output:

...
kernel_input =
      DAXPY
      Foo
invalid_kernel_input =
      Foo
...
Suite will not be run now due to bad input.
  See run parameters or option messages above.

The output indicates that the kernel input is invalid because the string Foo is not the name of a kernel in the Suite, while DAXPY is the name of a kernel in the Suite.

Note

The Suite executable will attempt to provide helpful information if it is given incorrect input, such as command-line arguments that it does not know how to parse. Ill-formed input will be noted in screen output, hopefully making it easy for users to correct erroneous usage, such as mis-spelled option names.

Running with MPI¶

The Suite can be configured and compiled to run in a distributed memory parallel mode using MPI. Running the Suite on multiple MPI ranks will execute the same code for each kernel on each rank with minimal synchronization points to gather execution timing data from all ranks. This capability is provided so that individual kernel performance more closely aligns with how such kernels would perform in a real application. For example, compute node memory bandwidth may be different when running on a many core system using OpenMP multithreading to exercise all cores than when each core is mapped to an MPI rank.

Running the Suite on multiple MPI ranks is just like running any other MPI application. For example, issuing the following command on a machine with slurm scheduling:

$ srun -n 2 ./bin/raja-perf.exe

will run the entire Suite (all kernels and variants) in their default configurations on each of 2 MPI ranks.

The kernel information output shows how each kernel is run on each rank. Timing is reported on rank 0 and is gathered by invoking an MPI barrier, starting a timer, running the kernel, invoking an MPI barrier, and then stopping the timer. Total problem size across all MPI ranks can be calculated, if desired, by multiplying the number of MPI ranks by the problem size reported in the kernel information.

OpenMP target offload¶

OpenMP target offload variants of the kernels in the Suite are considered a work-in-progress since the RAJA OpenMP target offload back-end is a work-in-progress. If you configure them to build, they can be run with the executable ./bin/raja-perf-omptarget.exe which is distinct from the one described above. When the OpenMP target offload variants were developed, it was not possible for them to co-exist in the same executable as CUDA variants, for example. In the future, the build system may be reworked so that the OpenMP target variants can be run from the same executable as the other variants.

Additional Caliper Use Cases¶

If you specified enabling Caliper in the build configuration (passing the option -DRAJA_PERFSUITE_USE_CALIPER=On to CMake), Caliper .cali output files will be generated.

However, there are a couple of other supported use cases.

Collecting PAPI topdown statistics on Intel Architectures¶

On Intel systems, you can collect topdown PAPI counter statistics by using command line arguments

--add-to-spot-config, -atsc <string> [Default is none]

This appends additional parameters to the built-in Caliper spot config.

To include some PAPI counters (Intel arch), add the following to the command line

-atsc topdown.all

Caliper’s topdown service generates derived metrics from raw PAPI counters; a hierarchy of metrics to identify bottlenecks in out-of-order processors. This is based on an an approach described in Ahmad Yasin’s paper A Top-Down Method for Performance Analysis and Counters Architecture. The top level of the hierarchy has a reliable set of four derived metrics or starting weights (sum to 1.0) which include:

Frontend Bound. Stalls attributed to the front end which is responsible for fetching and decoding program code.
Bad Speculation. Fraction of the workload that is affected by incorrect execution paths, i.e. branch misprediction penalties
Retiring. Increases in this category reflects overall Instructions Per Cycle (IPC) fraction which is good in general. However, a large retiring fraction for non-vectorized code could also be a hint to the user to vectorize their code (see Yasin’s paper)
Backend Bound. Memory Bound where execution stalls are related to the memory subsystem, or Core Bound where execution unit occupancy is sub-optimal lowering IPC (more compiler dependent)

Note

Backend Bound = 1 - (Frontend Bound + Bad Speculation + Retiring)

Note

Caveats:

When collecting PAPI data in this way, you’ll be limited to running only one kernel variant, since Caliper maintains only one PAPI context.
Small kernels should be run at large problem sizes to minimize anomalous readings.
Measured values are only relevant for the innermost level of the Caliper tree hierarchy, i.e. Kernel.Tuning under investigation.
Some lower level derived quantities may appear anomalous with negative values. Collecting raw counters can help identify the discrepancy: -atsc topdown-counters.all

Note

Other caveats: Raw counter values are often noisy and require a lot of accommodation to collect accurate data including:

Turning off Hyperthreading

Turning off Prefetch as is done in Intel’s Memory Latency Checker (requires root access)

Adding LFENCE instruction to serialize and bracket code under test

Disabling preemption and hard interrupts

See Andreas Abel’s dissertation Automatic Generation of Models of Microarchitectures for more info on this and for a comprehensive look at the nanobench machinery.

Some helpful references:

Yasin’s Paper

Vtune-cookbook topdown method

Automatic Generation of Models of Microarchitectures

Generating trace events (time-series) for viewing in chrome://tracing or Perfetto¶

Perfetto user documentation is available at Perfetto.

Use Caliper’s event trace service to collect timestamp info, where kernel timing can be viewed using browser trace profile views. For example,

CALI_CONFIG=event-trace,event.timestamps ./raja-perf.exe -ek PI_ATOMIC INDEXLIST -sp

This will produce a separate .cali file with date prefix which looks something like 221108-100718_724_ZKrHC68b77Yd.cali

Then, we need to convert this .cali file to JSON records. But first, we need to make sure Caliper’s python reader is available in the PYTHONPATH environment variable

export PYTHONPATH=caliper-source-dir/python/caliper-reader

then run cali2traceevent.py. For example,

python3 ~/workspace/Caliper/python/cali2traceevent.py 221108-102406_956_9WkZo6xvetnu.cali RAJAPerf.trace.json

You can then load the resulting JSON file either in Chrome by going to chrome://tracing or in Perfetto.

For CUDA, assuming you built Caliper with CUDA support, you can collect and combine trace information for memcpy, kernel launch, synchronization, and kernels. For example,

CALI_CONFIG="event-trace(event.timestamps,trace.cuda=true,cuda.activities)" ./raja-perf.exe -v RAJA_CUDA Base_CUDA -k Algorithm_REDUCE_SUM -sp

Warning

When you run cali2traceevent.py you need to add –sort option before the filenames. This is needed because the trace.cuda event records need to be sorted before processing. Failing to do so may result in a Python traceback. New versions of the Caliper Python package have this option built in by default to avoid this issue.

~/workspace/Caliper/python/cali2traceevent.py --sort file.cali file.json

For HIP, substitute rocm.activities for cuda.activities.

Note

Currently there is no analog trace.rocm.