.. ##
.. ## Copyright (c) Lawrence Livermore National Security, LLC and other
.. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT
.. ## files for dates and other details. No copyright assignment is required
.. ## to contribute to RAJA Performance Suite.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##

.. _output-label:

*********************************************
RAJA Performance Suite Output
*********************************************

This section describes the contents of output files generated by the Suite.

When the Suite is run, several output files are generated that contain data 
describing the run. By default the files be placed in the directory where the
executable is invoked, and the file names will contain the prefix
``RAJAPerf-``, a string indicating the contents, and a suffix indicating the
type of data in the file, either raw text (*.txt) or CSV (*.csv)

.. note:: You can provide command-line options to place the output files in a
          different directory and/or give them a different file name prefix.
          Such options and syntax are described in the Suite help output::

            $ ./bin/raja-perf.exe -h 
          
Currently, there are five output files generated that provide information
described below. All output files are plain text files; some file contents are
in 'csv' format for easy processing by common tools for generating plots, etc.
The output files include:

  * **Kernel Run Data** -- a CSV file containing summarized run data about each
    kernel variant tuning that is run. These commonly used values are calculated
    using values output in other files, how that is done is described in more
    detail in :ref:`output_kernel_run_data-label`.
  * **Kernel Details** -- a CSV file containing basic information about each kernel
    that is run, which is the same for each variant of a kernel that is run.
    Kernel information is described in more detail in :ref:`output_kernel_details-label`.
  * **Timing** -- a CSV file containing execution time (sec.) of each loop
    kernel and variant run. Variants that are not run are indicated with the
    string "Not run".
  * **Checksum** -- checksum values for each loop kernel and variant and a
    checksum difference compared to the reference variant (first variant listed
    for each kernel). This file helps to ensure that all kernel variants are
    producing correct results. Typically, a checksum difference of ~1e-10 or
    less indicates that results generated by a kernel variant match the
    reference variant.
  * **Speedup** -- a CSV file containing run time speedup of each kernel
    variant with respect to its reference variant. The reference variant can
    be set with a command-line option. If not specified, the first variant of
    a kernel that is run will be used as the reference. The reference variant
    used is noted at the top of the file.
  * **Figure of Merit (FOM)** -- a CSV file containing signed speedup of a RAJA 
    variant vs. baseline for each programming model run. When the execution 
    time of a RAJA variant differs from the corresponding baseline variant 
    by more than some tolerance, this is noted in the file with ``OVER_TOL``. 
    The default tolerance is 10% and can be changed via a command-line option.

.. _output_kernel_run_data-label:

======================
Kernel Run Data output
======================

Summarized run data about the kernel variant tunings run when the RAJA
Performance Suite executes is placed in the ``RAJAPerf-kernel-run-data.csv`` file
(unless the file prefix name is changed by the user).

Data reported in the file for each kernel variant tuning is:

  * **Name** -- full kernel name, format is group name followed by the kernel
    name, separated by an underscore.
  * **Variant** -- variant name, format is the implementation approach set
    name followed by the backend set name, separated by an underscore.
  * **Tuning** -- tuning name, these names are normally chosen to differentiate
    tunings by indicating how each tuning was implemented. For example
    "Default" is used for tunings similar to the reference implementation. For
    GPU variant tunings the block size is often included, for example
    "block_256" is a tuning using a block size of 256. Some tuning names refer
    to vendor libraries or RAJA APIs used in the implementation of the tuning,
    for example REDUCE_SUM has a "cub" tuning in Base_CUDA that uses the Cub
    library from Nvidia to implement the reduction.
  * **Problem Size** -- Size of the problem run in a kernel. Find a discussion
    about the meaning of problem size in :ref:`output_probsize-label`.
  * **Checksum** -- Whether the checksum of the kernel passes or fails to meet
    the tolerance relative to the reference variant tuning. Find more
    information on checksums here :ref:`kernel_class_impl_gen-label`.
  * **Mean time per rep (sec.)** -- the execution time for a single repetition
    to complete averaged over all passes. This is calculated from the timing
    information in the ``RAJAPerf-timing-Average.csv`` file divided by the *Reps*
    for the kernel from the ``RAJAPerf-kernel-details.csv`` file.
  * **Mean Bandwidth (GiB per sec.)** -- the bandwidth, in giga-bytes per
    second, achieved by the benchmark averaged over all passes. This is
    calculated by taking from the *BytesMoved/rep* for the kernel from the
    ``RAJAPerf-kernel-details.csv`` file divided by the
    *Mean time per rep (sec.)*.
  * **Mean flops (gigaFLOP per sec.)** -- the
    floating point operation rate, in giga-flops, achieved by the benchmark
    averaged over all passes. This is calculated by taking from the *FLOPs/rep*
    for the kernel from the ``RAJAPerf-kernel-details.csv`` file divided by the
    *Mean time per rep (sec.)*.

.. _output_kernel_details-label:

=====================
Kernel Details output
=====================

Information about kernels run when the RAJA Performance Suite executes is 
placed in the ``RAJAPerf-kernel-details.csv`` file (unless the file prefix name
is changed by the user). This information is reported for rank zero when running
with multiple MPI processes. When running with more than one MPI rank, 
information can be easily aggregated across all ranks if needed. For example,
the total aggregate problem size is the number of ranks times the problem size 
shown in the kernel information. 

Information reported in the file for each kernel is:

  * **Name** -- full kernel name, format is group name followed by the kernel 
    name, separated by an underscore.
  * **Problem size** -- Size of the problem run in a kernel. Find a discussion
    about the meaning of problem size in :ref:`output_probsize-label`.
  * **Reps** -- Number of times a kernel runs in a single pass through the 
    Suite.  
  * **Iterations/rep** -- Sum of sizes of all parallel iteration spaces for all
    loops run in a single kernel execution.
  * **Kernels/rep** -- total number of loop structures run (or GPU kernels 
    launched) in each kernel repetition.
  * **BytesMoved/rep** -- Total number of bytes read from and written to memory
    for each repetition of kernel. This is a best case scenario of the total
    traffic to and from memory assuming perfect cache reuse and ignoring partial
    usage of data in some memory transactions.
  * **FLOPs/rep** -- Total number of floating point operations executed for 
    each repetition of kernel. Currently, we count arithmetic operations 
    (+, -, *, /) and functions, such as exp, sin, etc. as one FLOP. We do not 
    currently count operations like abs and comparisons (<, >, etc.) in the 
    FLOP count. So these numbers are rough estimates. For actual FLOP counts, 
    a performance analysis tool should be used.
  * **BytesTouched/rep** -- Total number of bytes accessed in memory for each
    repetition of kernel. This is a best case scenario for the amount of cache
    needed to fit all of the data used by the kernel ignoring partial usage of
    some cache lines.
  * **BytesRead/rep** -- Total number of bytes read from memory for 
    each repetition of kernel.
  * **BytesWritten/rep** -- Total number of bytes written to memory for 
    each repetition of kernel.
  * **BytesModifyWritten/rep** -- Total number of bytes modified in memory for
    each repetition of kernel. The intersection of bytes in both ``BytesRead/rep``
    and ``BytesWritten/rep``.
  * **BytesAtomicModifyWritten/rep** -- Total number of bytes modified in memory
    by atomic operations in a kernel. If a kernel contains no atomic operations,
    the value of zero is reported.
  * **ChecksumConsistency** -- The consistency of the checksums of the kernel.
    Kernels that always get the same checksum are ``Consistent``, kernels that
    can get different checksums for each variant tuning are
    ``ConsistentPerVariantTuning``, and kernels with checksums that can vary from
    run to run are ``Inconsistent``.
  * **OperationalComplexity** -- The operational complexity of the kernel, where
    N is the *problem size* of the kernel.
  * **MaxPerfectLoopDimensions** -- Number of levels in the largest perfectly
    nested loop. This only counts parallelized dimensions and ignores inner or
    outer sequential loops. For example the GEMM kernel has 2 perfectly nested
    loop levels as the inner loop is implemented sequentially to perform a reduction.
  * **ProblemDimensionality** -- The dimensionality of the problem domain,
    regardless of physical data layout. For example, the LTIMES kernel has
    a problem dimensionality of 3, because phi (g, m, and z) and psi
    (g, d, and z) are indexed over 3 dimensions.

.. note:: The ``Bytes*/rep`` attributes count how many bytes are accessed in memory
          like DRAM or HBM under idealized conditions. They assume caching is
          perfect so even if the same byte is read multiple times then it
          assumes that the byte is only read from memory once.

.. note:: The ``Bytes*/rep`` and ``FLOPs/rep`` counts are estimates for kernels
          involving randomness or difficult to count algorithms. The counts are
          meant to give a reasonable approximation of achieved bandwidth and flop
          rate. Kernels that perform significantly outside of expectations are
          good candidates for more detailed performance studies.

.. _output_probsize-label:

============================
Notes about *problem size*
============================

This section describes how the Suite calculates problem sizes and the 
rationale behind it.

  * **The concept of problem size is subjective and can be interpreted 
    differently depending on the kernel structure and what one is trying to 
    measure.** For example, problem size could refer to the amount of data 
    needed to be stored in memory to run the problem, or it could refer to 
    the amount of parallel work that is possible, etc.
  * The Suite uses three notions of problem size for each kernel: *default*, 
    *target*, and *actual*. Default is the problem size defined for a kernel 
    and the size that is run if no command-line options are provided to run a 
    different size. Target is the desired problem size to run based on default 
    settings and alterations to those if input is provided to change the 
    default. Actual is the problem size that is run based on how each kernel 
    calculates it based on defaults and run time input.
  * We employ an admittedly loose definition of problem size for each kernel, 
    which depends on the kernel structure. Of all *loop structures* 
    (e.g., single loop, nested loops, etc.) that are run for a kernel (note 
    that some kernels run multiple loops, possibly with different sizes or 
    loop structures), problem size refers to the size of the data set required 
    to generate the kernel result. The interpretation of this and the 
    definition of problem size for each kernel in the suite is determined by 
    the kernel developer and team discussion.

.. note: Problem size is always reported per process/MPI rank. To get the total 
         problem size over all ranks when running with MPI, multiply the 
         problem size by the number of MPI ranks.

Here are a few examples to give a better sense of how we determine problem 
size for various kernels in the Suite.

Vector addition::

   for (int i = 0; i < 0; i < N; ++i) {
     c[i] = a[i] + b[i];
   }

The problem size for this kernel is N, the loop length. Note that this happens 
to match the size of the vectors a, b, c and the total amount of parallel work 
in the kernel. This is common for simple, data parallel kernels.

Matrix-vector multiplication::

   for (int r = 0; r < N_r; ++r) {
     b[r] = 0;
     for (int c = 0; c < N_c; ++c) {
       b[r] += A[r][c] + x[c];
     }
   }

The problem size if N_r * N_c, the size of the matrix. Note that this matches 
the total size of the problem iteration space, but the total amount of 
parallel work is N_r, the number of rows in the matrix and the length of the 
vector b.

Matrix-matrix multiplication::

   for (int i = 0; i < N_i; ++i) {
     for (int j = 0; j < N_j; ++j) {
       A[i][j] = 0;
       for (int k = 0; k < N_k; ++k) {
         A[i][j] += B[i][k] * C[k][j];
       }
     }
   }

Here, we are multiplying matrix B (N_i x N_k) and matrix C (N_k x N_j) and 
storing the result in matrix A (N_i X N_j). Problem size could be chosen to be 
the maximum number of entries in matrix B or C. We choose the size of matrix 
A (N_i * N_j), which is more closely aligned with the number of independent 
operations (i.e., the amount of parallel work) in the kernels.


===========================
Caliper output files
===========================

If you've built RAJAPerf with Caliper support turned on, then in addition to the
outputs mentioned above, we also save a .cali file for each variant & tuning run,
such as:
Base_OpenMP-default.cali, Lambda_OpenMP-default.cali, Base_CUDA-block_128.cali, etc.

Also, by using the `--variants` and `--tunings` options on the command-line,
you can specify which variant/tunings to run.

There are several techniques to display the Caliper trees (Timing Hierarchy)

| 1: Caliper's cali-query tool.
| The first technique is with Caliper's own tool cali-query, we run it with 
| **-T** to display tree, or you can specify **--tree**. 
|
| cali-query -T $HOME/data/default_problem_size/gcc/RAJA_Seq.cali

2: Caliper's Python module *caliperreader*::

  import os
  import caliperreader as cr
  DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
  os.chdir(DATA_DIR)
  r = cr.CaliperReader()
  r.read("RAJA_Seq.cali")
  metric = 'avg#inclusive#sum#time.duration'
  for rec in r.records:
    path = rec['path'] if 'path' in rec else 'UNKNOWN'
    time = rec[metric] if metric in rec else '0'
    if not 'UNKNOWN' in path:
        if (isinstance(path, list)):
            path = "/".join(path)
        print("{0}: {1}".format(path, time))
  
You can add a couple of lines to view the metadata keys captured by Caliper/Adiak::

  for g in r.globals:  
    print(g)  

You can also add a line to display metadata value in the dictionary **r.globals**

For example print out the OpenMP Max Threads value recorded at runtime:: 

  print('OMP Max Threads: ' + r.globals['omp_max_threads'])`  

or the variant represented in this file::  
  
  print('Variant: ' + r.globals['variant'])
 

.. note:: The script above was written using caliper-reader 0.3.0, 
          but is fairly generic. Other version usage notes may be 
          found at `caliper-reader <https://pypi.org/project/caliper-reader/>`_ 


3: Using the *Hatchet* Python module for single files::

  import hatchet as ht
  DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
  os.chdir(DATA_DIR)
  gf1 = ht.GraphFrame.from_caliperreader("RAJA_Seq.cali")
  print(gf1.tree())

To learn more about Hatchet, please see `Hatchet <https://github.com/LLNL/hatchet>`_

4: Using the *Thicket* Python module for multiple files::

  import thicket as th
  DATA_DIR = os.getenv('HOME')+"/data/default_problem_size/gcc"
  os.chdir(DATA_DIR)
  th1 = th.Thicket.from_caliperreader(["RAJA_Seq-default.cali", "Base_Seq-default.cali", "Base_CUDA-block_128", "Base_CUDA-block_256"])
  print(th1.tree())

To learn more about Thicket, please see `Thicket <https://github.com/LLNL/thicket>`_