.. ##
.. ## Copyright (c) 2017-25, Lawrence Livermore National Security, LLC
.. ## and RAJA Performance Suite project contributors.
.. ## See the RAJAPerf/LICENSE file for details.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##
.. _output-label:
*********************************************
RAJA Performance Suite Output
*********************************************
This section describes the contents of output files generated by the Suite.
When the Suite is run, several output files are generated that contain data
describing the run. By default the files be placed in the directory where the
executable is invoked, and the file names will contain the prefix
``RAJAPerf-`` and a string indicating the contents.
.. note:: You can provide command-line options to place the output files in a
different directory and/or give them a different file name prefix.
Such options and syntax are described in the Suite help output::
$ ./bin/raja-perf.exe -h
Currently, there are five output files generated that provide information
described below. All output files are plain text files. Other than the
checksum file, all file contents are in 'csv' format for easy processing by
common tools for generating plots, etc.
* **Timing** -- execution time (sec.) of each loop kernel and variant run.
* **Checksum** -- checksum values for each loop kernel and variant run to
ensure that they are producing correct results. Typically, a checksum
difference of ~1e-10 or less indicates that results generated by a kernel
variant match a reference variant.
* **Speedup** -- run time speedup of each kernel variant with respect to a
reference variant. The reference variant can be set with a command-line
option. If not specified, the first variant of a kernel that is run will
be used as the reference. The reference variant used is noted in the file.
* **Figure of Merit (FOM)** -- basic statistics about speedup of a RAJA
variant vs. baseline for each programming model run. When the execution
time of a RAJA variant differs from the corresponding baseline variant
by more than some tolerance, this is noted in the file with ``OVER_TOL``.
The default tolerance is 10% and can be changed via a command-line option.
* **Kernel** -- basic information about each kernel that is run, which is
the same for each variant of a kernel that is run. Kernel information
is described in more detail in the next section.
.. _output_kerninfo-label:
===========================
Kernel information output
===========================
Information about kernels run when the RAJA Performance Suite executes is
placed in the ``RAJAPerf-kernels.csv`` file (unless the file prefix name is
changed by the user). This information is reported for rank zero when running
with multiple MPI processes. When running with more than one MPI rank,
information can be easily aggregated across all ranks if needed. For example,
the total aggregate problem size is the number of ranks times the problem size
shown in the kernel information.
Information reported in the file for each kernel is:
* **Name** -- full kernel name, format is group name followed by the kernel
name, separated by an underscore.
* **Feature** -- RAJA features exercised in RAJA variants of kernel.
* **Problem size** -- Size of the problem represented by a kernel. Please see section below.
* **Reps** -- Number of times a kernel runs in a single pass through the
Suite.
* **Iterations/rep** -- Sum of sizes of all parallel iteration spaces for all loops run in a single kernel execution.
* **Kernels/rep** -- total number of loop structures run (or GPU kernels
launched) in each kernel repetition.
* **Bytes/rep** -- Total number of bytes read from and written to memory for
each repetition of kernel.
* **FLOPs/rep** -- Total number of floating point operations executed for
each repetition of kernel. Currently, we count arithmetic operations
(+, -, *, /) and functions, such as exp, sin, etc. as one FLOP. We do not
currently count operations like abs and comparisons (<, >, etc.) in the
FLOP count. So these numbers are rough estimates. For actual FLOP counts,
a performance analysis tool should be used.
.. _output_probsize-label:
============================
Notes about *problem size*
============================
This section describes how the Suite calculates problem sizes and the
rationale behind it.
* **The concept of problem size is subjective and can be interpreted
differently depending on the kernel structure and what one is trying to
measure.** For example, problem size could refer to the amount of data
needed to be stored in memory to run the problem, or it could refer to
the amount of parallel work that is possible, etc.
* The Suite uses three notions of problem size for each kernel: *default*,
*target*, and *actual*. Default is the problem size defined for a kernel
and the size that is run if no run time options are provided to run a
different size. Target is the desired problem size to run based on default
settings and alterations to those if input is provided to change the
default. Actual is the problem size that is run based on how each kernel
calculates it based on defaults and run time input.
* We employ an admittedly loose definition of problem size for each kernel,
which depends on the kernel structure. Of all *loop structures*
(e.g., single loop, nested loops, etc.) that are run for a kernel (note
that some kernels run multiple loops, possibly with different sizes or
loop structures), problem size refers to the size of the data set required
to generate the kernel result. The interpretation of this and the
definition of problem size for each kernel in the suite is determined by
the kernel developer and team discussion.
.. note: Problem size is always reported per process/MPI rank. To get the total
problem size over all ranks when running with MPI, multiply the
problem size by the number of MPI ranks.
Here are a few examples to give a better sense of how we determine problem
size for various kernels in the Suite.
Vector addition::
for (int i = 0; i < 0; i < N; ++i) {
c[i] = a[i] + b[i];
}
The problem size for this kernel is N, the loop length. Note that this happens
to match the size of the vectors a, b, c and the total amount of parallel work
in the kernel. This is common for simple, data parallel kernels.
Matrix-vector multiplication::
for (int r = 0; r < N_r; ++r) {
b[r] = 0;
for (int c = 0; c < N_c; ++c) {
b[r] += A[r][c] + x[c];
}
}
The problem size if N_r * N_c, the size of the matrix. Note that this matches
the total size of the problem iteration space, but the total amount of
parallel work is N_r, the number of rows in the matrix and the length of the
vector b.
Matrix-matrix multiplication::
for (int i = 0; i < N_i; ++i) {
for (int j = 0; j < N_j; ++j) {
A[i][j] = 0;
for (int k = 0; k < N_k; ++k) {
A[i][j] += B[i][k] * C[k][j];
}
}
}
Here, we are multiplying matrix B (N_i x N_k) and matrix C (N_k x N_j) and
storing the result in matrix A (N_i X N_j). Problem size could be chosen to be
the maximum number of entries in matrix B or C. We choose the size of matrix
A (N_i * N_j), which is more closely aligned with the number of independent
operations (i.e., the amount of parallel work) in the kernels.
===========================
Caliper output files
===========================
If you've built RAJAPerf with Caliper support turned on, then in addition to the
outputs mentioned above, we also save a .cali file for each variant & tuning run,
such as:
Base_OpenMP-default.cali, Lambda_OpenMP-default.cali, Base_CUDA-block_128.cali, etc.
Also, by using the `--variants` and `--tunings` flag when running, you can specify
which variant/tunings to run.
There are several techniques to display the Caliper trees (Timing Hierarchy)
| 1: Caliper's cali-query tool.
| The first technique is with Caliper's own tool cali-query, we run it with
| **-T** to display tree, or you can specify **--tree**.
|
| cali-query -T $HOME/data/default_problem_size/gcc/RAJA_Seq.cali
2: Caliper's Python module *caliperreader*::
import os
import caliperreader as cr
DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
os.chdir(DATA_DIR)
r = cr.CaliperReader()
r.read("RAJA_Seq.cali")
metric = 'avg#inclusive#sum#time.duration'
for rec in r.records:
path = rec['path'] if 'path' in rec else 'UNKNOWN'
time = rec[metric] if metric in rec else '0'
if not 'UNKNOWN' in path:
if (isinstance(path, list)):
path = "/".join(path)
print("{0}: {1}".format(path, time))
You can add a couple of lines to view the metadata keys captured by Caliper/Adiak::
for g in r.globals:
print(g)
You can also add a line to display metadata value in the dictionary **r.globals**
For example print out the OpenMP Max Threads value recorded at runtime::
print('OMP Max Threads: ' + r.globals['omp_max_threads'])`
or the variant represented in this file::
print('Variant: ' + r.globals['variant'])
.. note:: The script above was written using caliper-reader 0.3.0,
but is fairly generic. Other version usage notes may be
found at the link below
`caliper-reader `_
3: Using the *Hatchet* Python module for single files::
import hatchet as ht
DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
os.chdir(DATA_DIR)
gf1 = ht.GraphFrame.from_caliperreader("RAJA_Seq.cali")
print(gf1.tree())
`Find out more on hatchet `_
4: Using the *Thicket* Python module for multiple files::
import thicket as th
DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
os.chdir(DATA_DIR)
th1 = th.Thicket.from_caliperreader(["RAJA_Seq-default.cali", "Base_Seq-default.cali", "Base_CUDA-block_128", "Base_CUDA-block_256"])
print(th1.tree())
`Find out more on thicket `_