Kernel Class Implementation¶
Each kernel in the Suite follows a similar source file organization and implementation pattern for consistency and ease of analysis and understanding. Here, we describe important and conventions applies in each kernel class implementation that must be followed to ensure that all kernels integrate into the RAJA Performance Suite in the same way.
General class methods¶
Class methods that do not execute kernel variants and which are not specific to
any kernel variant implementation are defined in one implementation file. For
the ADD kernel that we are describing, this is the source file ADD.cpp,
which in its entirety is:
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
// Copyright (c) 2017-25, Lawrence Livermore National Security, LLC
// and RAJA Performance Suite project contributors.
// See the RAJAPerf/LICENSE file for details.
//
// SPDX-License-Identifier: (BSD-3-Clause)
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
#include "ADD.hpp"
#include "RAJA/RAJA.hpp"
#include "common/DataUtils.hpp"
namespace rajaperf
{
namespace stream
{
ADD::ADD(const RunParams& params)
: KernelBase(rajaperf::Stream_ADD, params)
{
setDefaultProblemSize(1000000);
setDefaultReps(1000);
setActualProblemSize( getTargetProblemSize() );
setItsPerRep( getActualProblemSize() );
setKernelsPerRep(1);
setBytesReadPerRep( 2*sizeof(Real_type) * getActualProblemSize() );
setBytesWrittenPerRep( 1*sizeof(Real_type ) * getActualProblemSize() );
setBytesAtomicModifyWrittenPerRep( 0 );
setFLOPsPerRep(1 * getActualProblemSize());
setComplexity(Complexity::N);
setUsesFeature(Forall);
setVariantDefined( Base_Seq );
setVariantDefined( Lambda_Seq );
setVariantDefined( RAJA_Seq );
setVariantDefined( Base_OpenMP );
setVariantDefined( Lambda_OpenMP );
setVariantDefined( RAJA_OpenMP );
setVariantDefined( Base_OpenMPTarget );
setVariantDefined( RAJA_OpenMPTarget );
setVariantDefined( Base_CUDA );
setVariantDefined( Lambda_CUDA );
setVariantDefined( RAJA_CUDA );
setVariantDefined( Base_HIP );
setVariantDefined( Lambda_HIP );
setVariantDefined( RAJA_HIP );
setVariantDefined( Base_SYCL );
setVariantDefined( RAJA_SYCL );
setVariantDefined( Kokkos_Lambda );
}
ADD::~ADD()
{
}
void ADD::setUp(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
allocAndInitData(m_a, getActualProblemSize(), vid);
allocAndInitData(m_b, getActualProblemSize(), vid);
allocAndInitDataConst(m_c, getActualProblemSize(), 0.0, vid);
}
void ADD::updateChecksum(VariantID vid, size_t tune_idx)
{
checksum[vid][tune_idx] += calcChecksum(m_c, getActualProblemSize(), vid);
}
void ADD::tearDown(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
(void) vid;
deallocData(m_a, vid);
deallocData(m_b, vid);
deallocData(m_c, vid);
}
} // end namespace stream
} // end namespace rajaperf
The methods in the source file are:
Class constructor, which calls the
KernelBaseclass constructor passing theKernelIDand theRunParamsobject, which are used to initialize the base class. The constructor calls other base class methods to set information about the kernel, which is specific to the kernel. Such information includes:
Default problem size and number of kernel repetitions to generate execution run time.
The actual problem size that will be run, which is a function of the default size and command-line input.
The number of loop iterations that are performed and the number of loop kernels that run each time the kernel is executed. Note that the ADD kernel is based on a simple, single for-loop. However, other kernels in the Suite execute multiple loop kernels.
The number of bytes read and written and the number of FLOPS performed for each kernel execution.
Which RAJA features the kernel exercises.
Which Suite variants are defined, or implemented for the kernel. Each variant requires a call to the
setVariantDefinedmethod. Note that not every kernel implements every variant. So this is a mechanism to account for what is being run for analysis proposes.Class destructor, which must be provided to deallocate kernel state that is allocated in the constructor and which persists throughout the execution of the Suite. Note that in the case of the ADD kernel, the destructor is empty since no state is dynamically allocated in the constructor.
setUpmethod, which allocates and initializes data required for the kernel to execute and produce results.
tearDownmethod, which deallocates and resets any data that will be re-allocated and/or initialized in subsequent kernel executions.Note
The
tearDownmethod frees and/or resets all kernel data that is allocated and/or initialized in thesetUpmethod.
updateChecksummethod, which computes a checksum from the results of an execution of the kernel and adds it to the checksum value, which is a member of theKernelBaseclass, for the variant and tuning index that was run.Note
The checksum must be computed in the same way for each variant of a kernel so that checksums for different variants can be compared to help identify differences, and potential errors in implementations, compiler optimizations, programming model execution, etc.
The setUp, tearDown, and updateChecksum methods are
called each time a kernel variant is run. We allocate and deallocate
data arrays in the setUp and tearDown methods to prevent any
performance timing bias that may be introduced by artificially reusing data
in cache, for example, when doing performance experiments. Also, note that
the setUp and tearDown methods take a VariantID argument and pass
it to data allocation, initialization, and deallocation methods so
this data management can be done in a variant-specific manner as needed.
To simplify these operations and help ensure consistency, there exist utility
methods to allocate, initialize, deallocate, and copy data, and compute
checksums defined in the various data utils files in the common
directory.
Kernel object construction¶
It is important to note that there will only be one instance of each kernel class created by the program. Thus, each kernel class constructor and destructor must only perform operations that are not specific to any kernel variant.
The Executor class in the common directory creates kernel objects,
one for each kernel that will be run based on command-line input options. To
ensure a new kernel object will be created properly, add a call to its class
constructor based on its KernelID in the getKernelObject() method in
the RAJAPerfSuite.cpp file. For example:
KernelBase* getKernelObject(KernelID kid,
const RunParams& run_params)
{
KernelBase* kernel = 0;
switch ( kid ) {
...
case Stream_ADD : {
kernel = new stream::ADD(run_params);
break;
}
...
} // end switch on kernel id
return kernel;
}
}
Kernel execution methods¶
In the discussion of the ADD Kernel Class Files, we noted that the class implementation involves multiple files containing variants for each execution back-end. In particular, these files contain implementations of the run methods declared in the ADD Kernel class header file to execute the variants.
Each method takes a variant ID argument that identifies the variant to run and a tuning index that identifies the tuning of the variant to run. Note that the tuning index can be ignored when there is only one tuning. Each method is responsible for multiple tasks which involve a combination of kernel and variant specific operations and calling kernel base class methods, such as:
Setting up and initializing data needed by a kernel variant before it is run
Starting an execution timer before a kernel is run
Running the proper number of kernel executions
Stopping the time after the kernel is run
Putting the class member data in an appropriate state to update a checksum
For example, here is the method to run sequential CPU variants of the ADD
kernel in the ADD-Seq.cpp file:
void ADD::runSeqVariant(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
const Index_type run_reps = getRunReps();
const Index_type ibegin = 0;
const Index_type iend = getActualProblemSize();
ADD_DATA_SETUP;
#if defined(RUN_RAJA_SEQ)
auto add_lam = [=](Index_type i) {
ADD_BODY;
};
#endif
switch ( vid ) {
case Base_Seq : {
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
for (Index_type i = ibegin; i < iend; ++i ) {
ADD_BODY;
}
}
stopTimer();
break;
}
#if defined(RUN_RAJA_SEQ)
case Lambda_Seq : {
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
for (Index_type i = ibegin; i < iend; ++i ) {
add_lam(i);
}
}
stopTimer();
break;
}
case RAJA_Seq : {
auto res{getHostResource()};
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
RAJA::forall<RAJA::seq_exec>( res,
RAJA::RangeSegment(ibegin, iend), add_lam);
}
stopTimer();
break;
}
#endif // RUN_RAJA_SEQ
default : {
getCout() << "\n ADD : Unknown variant id = " << vid << std::endl;
}
}
}
A few details are worth noting:
Thee tuning index argument is ignored because there is only one tuning for the sequential kernel variants.
Execution parameters, such as kernel loop length and number of execution repetitions, are set by calling base class methods which return values based on kernel defaults and input parameters. This ensures that the execution will be consistent across run variants and results will be what is expected.
Simple switch-case statement logic is used to execute the proper variant based on the
VariantIDargument.We guard sequential variants apart from the
Base_Seqvariant with theRUN_RAJA_SEQmacro. This ensures that the base sequential variant will always run to be used as a reference variant for execution timing. By default, we turn off the other sequential variants when we build an executable with OpenMP target offload enabled.Macros defined in the
ADD.hppheader file are used to reduce the amount of redundant code, such as for data initialization (ADD_DATA_SETUP) and the kernel body (ADD_BODY).
All kernel source files follow a similar organization and implementation pattern for each set of back-end execution variants. However, there are some important differences to note that we describe next in the discussion of the CUDA variant execution file.
The key contents related to execution of CUDA GPU variants of the ADD
kernel in the ADD-Cuda.cpp file are:
#include "ADD.hpp"
#include "RAJA/RAJA.hpp"
#if defined(RAJA_ENABLE_CUDA)
#include "common/CudaDataUtils.hpp"
#include <iostream>
namespace rajaperf
{
namespace stream
{
template < size_t block_size >
__launch_bounds__(block_size)
__global__ void add(Real_ptr c, Real_ptr a, Real_ptr b,
Index_type iend)
{
Index_type i = blockIdx.x * block_size + threadIdx.x;
if (i < iend) {
ADD_BODY;
}
}
template < size_t block_size >
void ADD::runCudaVariantImpl(VariantID vid)
{
const Index_type run_reps = getRunReps();
const Index_type ibegin = 0;
const Index_type iend = getActualProblemSize();
auto res{getCudaResource()};
ADD_DATA_SETUP;
if ( vid == Base_CUDA ) {
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
constexpr size_t shmem = 0;
RPlaunchCudaKernel( (add<block_size>),
grid_size, block_size,
shmem, res.get_stream(),
c, a, b, iend );
}
stopTimer();
} else if ( vid == Lambda_CUDA ) {
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
auto add_lambda = [=] __device__ (Index_type i) {
ADD_BODY;
};
const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
constexpr size_t shmem = 0;
RPlaunchCudaKernel( (lambda_cuda_forall<block_size,
decltype(add_lambda)>),
grid_size, block_size,
shmem, res.get_stream(),
ibegin, iend, add_lambda );
}
stopTimer();
} else if ( vid == RAJA_CUDA ) {
startTimer();
for (RepIndex_type irep = 0; irep < run_reps; ++irep) {
RAJA::forall< RAJA::cuda_exec<block_size, true /*async*/> >( res,
RAJA::RangeSegment(ibegin, iend), [=] __device__ (Index_type i) {
ADD_BODY;
});
}
stopTimer();
} else {
getCout() << "\n ADD : Unknown Cuda variant id = " << vid << std::endl;
}
}
RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATE(ADD, Cuda)
} // end namespace stream
} // end namespace rajaperf
#endif // RAJA_ENABLE_CUDA
Notable differences with the sequential variant file are:
Most of the file is guarded using the
RAJA_ENABLE_CUDAmacro.Note
The contents of all non-sequential variant implementation files are guarded using the
RAJA_ENABLE_<backend>macros.In addition to using the
ADD_DATA_SETUPmacro, which is also used in the sequential variant implementation file discussed above, we define two other macros,ADD_DATA_SETUP_CUDAandADD_DATA_TEARDOWN_CUDA. The first macro allocates GPU device data needed to run a kernel and initialize the data by copying host CPU data to it. After a kernel executes, the second macro copies data needed to compute a checksum to the host and then deallocates the device data.A CUDA GPU kernel
addis implemented for theBase_CUDAvariant.The method to exjcute the CUDA kernel variants
ADD::runCudaVariantImplis templated on ablock_sizeparameter, which represents the tuning parameter, and is passes to the kernel lauch methods.The
RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATEmacro is used (outside the method implementation, to generate different kernel tuning implementations at compile-time to run the GPUblock_sizeversions specified via command-line input mentioned in Build and Install.
Important
Following the established implementation patterns for kernels in the Suite help to ensure that the code is consistent, understandable, easily maintained, and needs minimal documentation.