Kernel Class Implementation

Each kernel in the Suite follows a similar source file organization and implementation pattern for consistency and ease of analysis and understanding. Here, we describe important and conventions applies in each kernel class implementation that must be followed to ensure that all kernels integrate into the RAJA Performance Suite in the same way.

General class methods

Class methods that do not execute kernel variants and which are not specific to any kernel variant implementation are defined in one implementation file. For the ADD kernel that we are describing, this is the source file ADD.cpp, which in its entirety is:

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
// Copyright (c) 2017-25, Lawrence Livermore National Security, LLC
// and RAJA Performance Suite project contributors.
// See the RAJAPerf/LICENSE file for details.
//
// SPDX-License-Identifier: (BSD-3-Clause)
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//

#include "ADD.hpp"

#include "RAJA/RAJA.hpp"

#include "common/DataUtils.hpp"

namespace rajaperf
{
namespace stream
{


ADD::ADD(const RunParams& params)
  : KernelBase(rajaperf::Stream_ADD, params)
{
  setDefaultProblemSize(1000000);
  setDefaultReps(1000);

  setActualProblemSize( getTargetProblemSize() );

  setItsPerRep( getActualProblemSize() );
  setKernelsPerRep(1);
  setBytesReadPerRep( 2*sizeof(Real_type) * getActualProblemSize() );
  setBytesWrittenPerRep( 1*sizeof(Real_type ) * getActualProblemSize() );
  setBytesAtomicModifyWrittenPerRep( 0 );
  setFLOPsPerRep(1 * getActualProblemSize());

  setComplexity(Complexity::N);

  setUsesFeature(Forall);

  setVariantDefined( Base_Seq );
  setVariantDefined( Lambda_Seq );
  setVariantDefined( RAJA_Seq );

  setVariantDefined( Base_OpenMP );
  setVariantDefined( Lambda_OpenMP );
  setVariantDefined( RAJA_OpenMP );

  setVariantDefined( Base_OpenMPTarget );
  setVariantDefined( RAJA_OpenMPTarget );

  setVariantDefined( Base_CUDA );
  setVariantDefined( Lambda_CUDA );
  setVariantDefined( RAJA_CUDA );

  setVariantDefined( Base_HIP );
  setVariantDefined( Lambda_HIP );
  setVariantDefined( RAJA_HIP );

  setVariantDefined( Base_SYCL );
  setVariantDefined( RAJA_SYCL );

  setVariantDefined( Kokkos_Lambda );
}

ADD::~ADD()
{
}

void ADD::setUp(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
  allocAndInitData(m_a, getActualProblemSize(), vid);
  allocAndInitData(m_b, getActualProblemSize(), vid);
  allocAndInitDataConst(m_c, getActualProblemSize(), 0.0, vid);
}

void ADD::updateChecksum(VariantID vid, size_t tune_idx)
{
  checksum[vid][tune_idx] += calcChecksum(m_c, getActualProblemSize(), vid);
}

void ADD::tearDown(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
  (void) vid;
  deallocData(m_a, vid);
  deallocData(m_b, vid);
  deallocData(m_c, vid);
}

} // end namespace stream
} // end namespace rajaperf

The methods in the source file are:

  • Class constructor, which calls the KernelBase class constructor passing the KernelID and the RunParams object, which are used to initialize the base class. The constructor calls other base class methods to set information about the kernel, which is specific to the kernel. Such information includes:

    • Default problem size and number of kernel repetitions to generate execution run time.

    • The actual problem size that will be run, which is a function of the default size and command-line input.

    • The number of loop iterations that are performed and the number of loop kernels that run each time the kernel is executed. Note that the ADD kernel is based on a simple, single for-loop. However, other kernels in the Suite execute multiple loop kernels.

    • The number of bytes read and written and the number of FLOPS performed for each kernel execution.

    • Which RAJA features the kernel exercises.

    • Which Suite variants are defined, or implemented for the kernel. Each variant requires a call to the setVariantDefined method. Note that not every kernel implements every variant. So this is a mechanism to account for what is being run for analysis proposes.

  • Class destructor, which must be provided to deallocate kernel state that is allocated in the constructor and which persists throughout the execution of the Suite. Note that in the case of the ADD kernel, the destructor is empty since no state is dynamically allocated in the constructor.

  • setUp method, which allocates and initializes data required for the kernel to execute and produce results.

  • tearDown method, which deallocates and resets any data that will be re-allocated and/or initialized in subsequent kernel executions.

    Note

    The tearDown method frees and/or resets all kernel data that is allocated and/or initialized in the setUp method.

  • updateChecksum method, which computes a checksum from the results of an execution of the kernel and adds it to the checksum value, which is a member of the KernelBase class, for the variant and tuning index that was run.

    Note

    The checksum must be computed in the same way for each variant of a kernel so that checksums for different variants can be compared to help identify differences, and potential errors in implementations, compiler optimizations, programming model execution, etc.

The setUp, tearDown, and updateChecksum methods are called each time a kernel variant is run. We allocate and deallocate data arrays in the setUp and tearDown methods to prevent any performance timing bias that may be introduced by artificially reusing data in cache, for example, when doing performance experiments. Also, note that the setUp and tearDown methods take a VariantID argument and pass it to data allocation, initialization, and deallocation methods so this data management can be done in a variant-specific manner as needed.

To simplify these operations and help ensure consistency, there exist utility methods to allocate, initialize, deallocate, and copy data, and compute checksums defined in the various data utils files in the common directory.

Kernel object construction

It is important to note that there will only be one instance of each kernel class created by the program. Thus, each kernel class constructor and destructor must only perform operations that are not specific to any kernel variant.

The Executor class in the common directory creates kernel objects, one for each kernel that will be run based on command-line input options. To ensure a new kernel object will be created properly, add a call to its class constructor based on its KernelID in the getKernelObject() method in the RAJAPerfSuite.cpp file. For example:

KernelBase* getKernelObject(KernelID kid,
                            const RunParams& run_params)
{
  KernelBase* kernel = 0;

  switch ( kid ) {

    ...

    case Stream_ADD : {
      kernel = new stream::ADD(run_params);
      break;
    }

    ...

  } // end switch on kernel id

  return kernel;
}

}

Kernel execution methods

In the discussion of the ADD Kernel Class Files, we noted that the class implementation involves multiple files containing variants for each execution back-end. In particular, these files contain implementations of the run methods declared in the ADD Kernel class header file to execute the variants.

Each method takes a variant ID argument that identifies the variant to run and a tuning index that identifies the tuning of the variant to run. Note that the tuning index can be ignored when there is only one tuning. Each method is responsible for multiple tasks which involve a combination of kernel and variant specific operations and calling kernel base class methods, such as:

  • Setting up and initializing data needed by a kernel variant before it is run

  • Starting an execution timer before a kernel is run

  • Running the proper number of kernel executions

  • Stopping the time after the kernel is run

  • Putting the class member data in an appropriate state to update a checksum

For example, here is the method to run sequential CPU variants of the ADD kernel in the ADD-Seq.cpp file:

void ADD::runSeqVariant(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
  const Index_type run_reps = getRunReps();
  const Index_type ibegin = 0;
  const Index_type iend = getActualProblemSize();

  ADD_DATA_SETUP;

#if defined(RUN_RAJA_SEQ)
  auto add_lam = [=](Index_type i) {
                   ADD_BODY;
                 };
#endif

  switch ( vid ) {

    case Base_Seq : {

      startTimer();
      for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

        for (Index_type i = ibegin; i < iend; ++i ) {
          ADD_BODY;
        }

      }
      stopTimer();

      break;
    }

#if defined(RUN_RAJA_SEQ)
    case Lambda_Seq : {

      startTimer();
      for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

        for (Index_type i = ibegin; i < iend; ++i ) {
          add_lam(i);
        }

      }
      stopTimer();

      break;
    }

    case RAJA_Seq : {

      auto res{getHostResource()}; 

      startTimer();
      for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

        RAJA::forall<RAJA::seq_exec>( res,
          RAJA::RangeSegment(ibegin, iend), add_lam);

      }
      stopTimer();

      break;
    }
#endif // RUN_RAJA_SEQ

    default : {
      getCout() << "\n  ADD : Unknown variant id = " << vid << std::endl;
    }

  }

}

A few details are worth noting:

  • Thee tuning index argument is ignored because there is only one tuning for the sequential kernel variants.

  • Execution parameters, such as kernel loop length and number of execution repetitions, are set by calling base class methods which return values based on kernel defaults and input parameters. This ensures that the execution will be consistent across run variants and results will be what is expected.

  • Simple switch-case statement logic is used to execute the proper variant based on the VariantID argument.

  • We guard sequential variants apart from the Base_Seq variant with the RUN_RAJA_SEQ macro. This ensures that the base sequential variant will always run to be used as a reference variant for execution timing. By default, we turn off the other sequential variants when we build an executable with OpenMP target offload enabled.

  • Macros defined in the ADD.hpp header file are used to reduce the amount of redundant code, such as for data initialization (ADD_DATA_SETUP) and the kernel body (ADD_BODY).

All kernel source files follow a similar organization and implementation pattern for each set of back-end execution variants. However, there are some important differences to note that we describe next in the discussion of the CUDA variant execution file.

The key contents related to execution of CUDA GPU variants of the ADD kernel in the ADD-Cuda.cpp file are:

#include "ADD.hpp"

#include "RAJA/RAJA.hpp"

#if defined(RAJA_ENABLE_CUDA)

#include "common/CudaDataUtils.hpp"

#include <iostream>

namespace rajaperf
{
namespace stream
{

template < size_t block_size >
__launch_bounds__(block_size)
__global__ void add(Real_ptr c, Real_ptr a, Real_ptr b,
                    Index_type iend)
{
  Index_type i = blockIdx.x * block_size + threadIdx.x;
  if (i < iend) {
    ADD_BODY;
  }
}


template < size_t block_size >
void ADD::runCudaVariantImpl(VariantID vid)
{
  const Index_type run_reps = getRunReps();
  const Index_type ibegin = 0;
  const Index_type iend = getActualProblemSize();

  auto res{getCudaResource()};

  ADD_DATA_SETUP;

  if ( vid == Base_CUDA ) {

    startTimer();
    for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

      const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
      constexpr size_t shmem = 0;

      RPlaunchCudaKernel( (add<block_size>),
                          grid_size, block_size,
                          shmem, res.get_stream(),
                          c, a, b, iend );

    }
    stopTimer();

  } else if ( vid == Lambda_CUDA ) {

    startTimer();
    for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

      auto add_lambda = [=] __device__ (Index_type i) {
        ADD_BODY;
      };

      const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
      constexpr size_t shmem = 0;

      RPlaunchCudaKernel( (lambda_cuda_forall<block_size, 
                                              decltype(add_lambda)>),
                          grid_size, block_size,
                          shmem, res.get_stream(),
                          ibegin, iend, add_lambda );

    }
    stopTimer();

  } else if ( vid == RAJA_CUDA ) {

    startTimer();
    for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

      RAJA::forall< RAJA::cuda_exec<block_size, true /*async*/> >( res,
        RAJA::RangeSegment(ibegin, iend), [=] __device__ (Index_type i) {
        ADD_BODY;
      });

    }
    stopTimer();

  } else {
     getCout() << "\n  ADD : Unknown Cuda variant id = " << vid << std::endl;
  }
}

RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATE(ADD, Cuda)

} // end namespace stream
} // end namespace rajaperf

#endif  // RAJA_ENABLE_CUDA

Notable differences with the sequential variant file are:

  • Most of the file is guarded using the RAJA_ENABLE_CUDA macro.

    Note

    The contents of all non-sequential variant implementation files are guarded using the RAJA_ENABLE_<backend> macros.

  • In addition to using the ADD_DATA_SETUP macro, which is also used in the sequential variant implementation file discussed above, we define two other macros, ADD_DATA_SETUP_CUDA and ADD_DATA_TEARDOWN_CUDA. The first macro allocates GPU device data needed to run a kernel and initialize the data by copying host CPU data to it. After a kernel executes, the second macro copies data needed to compute a checksum to the host and then deallocates the device data.

  • A CUDA GPU kernel add is implemented for the Base_CUDA variant.

  • The method to exjcute the CUDA kernel variants ADD::runCudaVariantImpl is templated on a block_size parameter, which represents the tuning parameter, and is passes to the kernel lauch methods.

  • The RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATE macro is used (outside the method implementation, to generate different kernel tuning implementations at compile-time to run the GPU block_size versions specified via command-line input mentioned in Build and Install.

Important

Following the established implementation patterns for kernels in the Suite help to ensure that the code is consistent, understandable, easily maintained, and needs minimal documentation.