Kernel Class Implementation¶
Each kernel in the Suite follows a similar source file organization and implementation pattern for consistency and ease of analysis and understanding. Here, we describe important conventions that apply in each kernel class implementation that must be followed to ensure that all kernels look the same and integrate into the RAJA Performance Suite in the same way.
General class methods¶
Class methods that do not execute kernel variants and which are not specific to
any kernel variant implementation are defined in one implementation file, for
example the source file ADD.cpp for the ADD kernel that we are
describing. The file contents in their entirety are:
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
// Copyright (c) Lawrence Livermore National Security, LLC and other
// RAJA Project Developers. See top-level LICENSE and COPYRIGHT
// files for dates and other details. No copyright assignment is required
// to contribute to RAJA Performance Suite.
//
// SPDX-License-Identifier: (BSD-3-Clause)
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
#include "ADD.hpp"
#include "RAJA/RAJA.hpp"
#include "common/DataUtils.hpp"
namespace rajaperf
{
namespace stream
{
ADD::ADD(const RunParams& params)
: KernelBase(rajaperf::Stream_ADD, params)
{
setDefaultProblemSize(1000000);
setDefaultReps(1000);
setSize(params.getTargetSize(getDefaultProblemSize()),
params.getReps(getDefaultReps()));
setChecksumConsistency(ChecksumConsistency::ConsistentPerVariantTuning);
setChecksumTolerance(ChecksumTolerance::tight);
setComplexity(Complexity::N);
setMaxPerfectLoopDimensions(1);
setProblemDimensionality(1);
setUsesFeature(Forall);
addVariantTunings();
}
void ADD::setSize(Index_type target_size, Index_type target_reps)
{
setActualProblemSize( target_size );
setRunReps( target_reps );
setItsPerRep( getActualProblemSize() );
setKernelsPerRep(1);
setBytesAllocatedPerRep( 3*sizeof(Real_type) * getActualProblemSize() ); // a, b, c
setBytesReadPerRep( 2*sizeof(Real_type) * getActualProblemSize() ); // a, b
setBytesWrittenPerRep( 1*sizeof(Real_type) * getActualProblemSize() ); // c
setBytesModifyWrittenPerRep( 0 );
setBytesAtomicModifyWrittenPerRep( 0 );
setFLOPsPerRep(1 * getActualProblemSize());
}
ADD::~ADD()
{
}
void ADD::setUp(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
allocAndInitData(m_a, getActualProblemSize(), vid);
allocAndInitData(m_b, getActualProblemSize(), vid);
allocAndInitDataConst(m_c, getActualProblemSize(), 0.0, vid);
}
void ADD::updateChecksum(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
addToChecksum(m_c, getActualProblemSize(), vid);
}
void ADD::tearDown(VariantID vid, size_t RAJAPERF_UNUSED_ARG(tune_idx))
{
deallocData(m_a, vid);
deallocData(m_b, vid);
deallocData(m_c, vid);
}
} // end namespace stream
} // end namespace rajaperf
The methods in the source file are:
Class constructor, which calls the
KernelBaseclass constructor passing theKernelIDand theRunParamsobject, which are used to initialize the base class. The constructor calls other base class methods to set information about the kernel, which is specific to the kernel. Such information includes:
Default problem size and number of kernel repetitions to generate execution run time.
The actual problem size that will be run, which is a function of the default size and command-line input for target problem size.
The number of loop iterations that are executed and the number of loop kernels that run each time the kernel is executed. Note that the ADD kernel contains a single for-loop so the number of iterations is the problem size and the number of loop kernels is one. Other kernels in the Suite may execute multiple loop kernels with different sizes, so these methods are used to describe this.
The number of bytes allocated for each kernel execution.
The number of bytes read for each kernel execution.
The number of bytes written for each kernel execution.
The number of bytes read, modified, and written for each kernel execution.
The number of bytes atomically read, modified, and written for each kernel execution.
The number of floating point operations (FLOPS) performed for each kernel execution.
The consistency of the checksums of the kernel. If the kernel always produces the same checksum value for all variant tunings then the checksums are
Consistent. Most kernels get a different but consistent checksum for each variant tuning so the checksums areConsistentPerVariantTuning. On the other hand, some kernels have variant tunings that get different checksums on each run of that variant tuning, for example due to the ordering of floating-point atomic add operations, so the checksums areInconsistent.The tolerance of the checksums of the kernel. A number of predefined values are available in the
KernelBase\:\:ChecksumToleranceclass. If the kernel consistently produces the same checksums thenzerotolerance is used. Most kernels use thenormaltolerance. Some kernels are very simple, for example they have a single floating-point operation per iteration, so they use thetighttolerance.The scale factor to use with the checksums of the kernel. This is an arbitrary multiplier on the checksum values used to scale the checksums to a desired range. Mostly used for kernels with floating-point operation complexity that does not scale linearly with problem size.
The operational complexity of the kernel, where N is the problem size of the kernel.
The number of levels in the largest perfectly nested loop. This only counts parallelized dimensions and ignores inner or outer sequential loops. For example the GEMM kernel has 2 perfectly nested loop levels as the inner loop is implemented sequentially to perform a reduction.
The dimensionality of the problem domain, regardless of physical data layout. For example, the LTIMES kernel has a problem dimensionality of 3, because phi (g, m, and z) and psi (g, d, and z) are indexed over 3 dimensions.
Which RAJA features the kernel exercises.
Adding Suite variants and tunings via
addVariantTunings. This calls the variousdefine*VariantTuningsmethods that are defined in the source file where the variants and tunings are implemented. Note that not every kernel implements every variant, soKernelBaseprovides a “default” implementation that defines no variants or tunings.
- ..note:: The byte counters are intended to count traffic to and from main
memory like DRAM or HBM under idealized conditions with perfect caching. They are not intended to count the total number of bytes requested by load and store instructions. So, even if a memory address is read in multiple different iterations of a loop with a stencil access pattern it is only counted once in bytes read. However caching is not assumed between loops/kernel launches so an address is counted once for each separate loop or kernel launch.
- ..note:: To simplify counting each address accessed should only be counted
in one of the byte counter attributes. For example an address that is read and written is counted in the “read, modified, and written” counter, but not in the “read” or “written” counters. The final output however does add the “read” and “read, modified, and written” counters when showing the bytes read.
- ..note:: Available variant tunings for each kernel are specified using a
...BOILERPLATE...macro invocation in each kernel variant source file. This is discussed in Kernel execution methods.Class destructor, which must be provided to deallocate kernel state that is allocated in the constructor and which persists throughout the execution of the Suite. Note that in the case of the ADD kernel, the destructor is empty since no state is dynamically allocated in the constructor.
setUpmethod, which allocates and initializes data required for the kernel to execute and produce results. This method is called before each kernel variant is run.
tearDownmethod, which deallocates and resets any data that will be re-allocated and/or initialized in subsequent kernel executions.Note
The
tearDownmethod frees and/or resets all kernel data that is allocated and/or initialized in thesetUpmethod.
updateChecksummethod, which computes a checksum from the results of an execution of the kernel and adds it to the checksum value, which is a member of theKernelBaseclass, for the variant and tuning index that was run.Note
Kernel checksum values are used to determine whether kernels variants are producing the same results. Thus, the checksum must be computed in the same way for each variant of a kernel so that checksums for different variants can be compared to help identify differences, and potential errors in implementations, compiler optimizations, programming model execution, etc.
The setUp, tearDown, and updateChecksum methods are
called each time a kernel variant is run. We allocate and deallocate
data arrays in the setUp and tearDown methods to prevent any
performance timing bias that may be introduced by artificially reusing data
in cache, for example, when doing performance experiments. Also, note that
the setUp and tearDown methods take a VariantID argument and pass
it to data allocation, initialization, and deallocation methods so
this data management can be done in a variant-specific manner as needed.
Data Utility Methods¶
To simplify these operations and help ensure consistency, there exist utility
methods to allocate, initialize, deallocate, and copy data, and compute
checksums defined in the various data utils files in the common
directory.
When calculating checksums use the addToChecksum methods. Individual numbers
are added directly to the overall checksum. Arrays of numbers are checksummed to
an intermediate checksum value via a function in the data utils file discussed
below and then the intermediate checksum value is added into the overall
checksum. Checksums are calculated via a Kahan sum to improve accuracy.
This function transforms each number in the array before adding the number to its checksum. The function converts the number into the checksum type, takes the absolute value of the result, and multiplies the result by a value that differs for each member of the array and depends on the sign of the number. This procedure creates different checksums for permutations of the same numbers and numbers with opposite signs. See the implementation below for details.
/*
* Calculate a different multiplier for each index.
* The multiplier is in the range [0.5, 1.5]
*/
Checksum_type calcMultiplier(double index, double offset)
{
static constexpr double pi_inv = 0.3183098861837906715377675267450287240689L;
static constexpr double half = 0.5;
double val = (index + offset) * pi_inv;
double mult = val - std::floor(val) + half;
return mult;
}
/*
* Calculate and return checksum for data arrays.
*/
template < typename Data_getter >
Checksum_type calcChecksumImpl(Data_getter data, Size_type len)
{
static constexpr Checksum_type zero = 0.0;
RAJA::KahanSum<Checksum_type> chk(0.0);
for (Size_type j = 0; j < len; ++j) {
Checksum_type val = data(j);
chk += calcMultiplier(j, (val >= zero) ? 1.0 : 0.5) * std::abs(val);
}
return chk.get();
}
Kernel object construction¶
It is important to note that there will only be one instance of each kernel class created by the program. Thus, each kernel class constructor and destructor must only perform operations that are not specific to any kernel variant.
The Executor class in the common directory creates kernel objects,
one for each kernel that will be run based on command-line input options. To
ensure a new kernel object will be created properly, make sure to add a call
to its class constructor in the switch case section for its KernelID in the
getKernelObject() method in the RAJAPerfSuite.cpp file. For example:
KernelBase* getKernelObject(KernelID kid,
const RunParams& run_params)
{
KernelBase* kernel = 0;
switch ( kid ) {
...
case Stream_ADD : {
kernel = new stream::ADD(run_params);
break;
}
...
} // end switch on kernel id
return kernel;
}
}
Kernel execution methods¶
In the discussion of the ADD Kernel Class Files, we noted that the class implementation involves multiple files containing variants for each execution back-end. In particular, these files contain implementations of the run methods declared in the ADD Kernel class header file to execute the variants.
By convention each of the run methods takes a variant ID argument that identifies the variant to run. Some kernels have multiple run methods for different tunings of some variants. Each method is responsible for multiple tasks which involve a combination of kernel and variant specific operations and calling kernel base class methods, such as:
Setting up and initializing data needed by a kernel variant before it is run
Starting an execution timer before a kernel is run
Running the proper number of kernel executions
Stopping the time after the kernel is run
Putting the class member data in an appropriate state to update a checksum
For example, here is the method to run sequential CPU variants of the ADD
kernel in the ADD-Seq.cpp file:
void ADD::runSeqVariant(VariantID vid)
{
const Index_type run_reps = getRunReps();
const Index_type ibegin = 0;
const Index_type iend = getActualProblemSize();
ADD_DATA_SETUP;
#if defined(RUN_RAJA_SEQ)
auto add_lam = [=](Index_type i) {
ADD_BODY;
};
#endif
switch ( vid ) {
case Base_Seq : {
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
for (Index_type i = ibegin; i < iend; ++i ) {
ADD_BODY;
}
}
stopTimer();
break;
}
#if defined(RUN_RAJA_SEQ)
case Lambda_Seq : {
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
for (Index_type i = ibegin; i < iend; ++i ) {
add_lam(i);
}
}
stopTimer();
break;
}
case RAJA_Seq : {
auto res{getHostResource()};
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
RAJA::forall<RAJA::seq_exec>( res,
RAJA::RangeSegment(ibegin, iend), add_lam);
}
stopTimer();
break;
}
#endif // RUN_RAJA_SEQ
default : {
getCout() << "\n ADD : Unknown variant id = " << vid << std::endl;
}
}
}
RAJAPERF_DEFAULT_TUNING_DEFINE_BOILERPLATE(ADD, Seq, Base_Seq, Lambda_Seq, RAJA_Seq)
A few details are worth noting:
Execution parameters, such as kernel loop length and number of execution repetitions, are set by calling base class methods which return values based on kernel defaults and input parameters. This ensures that the execution will be consistent across run variants and results will be what is expected.
Simple switch-case statement logic is used to execute the proper variant based on the
VariantIDargument.We guard sequential variants apart from the
Base_Seqvariant with theRUN_RAJA_SEQmacro. This ensures that the base sequential variant will always run to be used as a reference variant for execution timing. By default, we turn off the other sequential variants when we build an executable with OpenMP target offload enabled.Macros defined in the
ADD.hppheader file are used to reduce the amount of redundant code, such as for data initialization (ADD_DATA_SETUP) and the kernel body (ADD_BODY).The
RAJAPERF_DEFAULT_TUNING_DEFINE_BOILERPLATEmacro appears in the file outside of the class implementation to define the variants and “default” tunings for theADDkernel for theSeqback-end. The macro defines thedefineSeqVariantTuningsmethod for this kernel and assumes the existence ofrunSeqVariant. Note that it adds a “default” tuning for each of the variants listed in the final arguments to the macro.Each kernel variant execution contains a call to
startTimer()andendTimer()methods which are used to gather execution timing information. Between these method calls is a loop over kernel repetitions. Each kernel defines a default number of run repetitions to record a sufficient amount of execution time to reduce execution timing noise. The number of repetitions can be controlled via command-line arguments.Note
The kernel repetition loop counter increment is done using the
macro
RP_REPCOUNTINCfor consistency and to reduce code verbosity. The loop counter variableirepis declaredvolatileto prevent compilers from optimizing certain loop execution operations. The syntax inside the macro definition quiets compiler warnings when the code is compiled using the C++20 standard.
All kernel source files follow a similar organization and implementation pattern for each set of back-end execution variants. However, there are some important differences to note that we describe next in the discussion of the CUDA variant execution file.
The key contents related to execution of CUDA GPU variants of the ADD
kernel in the ADD-Cuda.cpp file are:
#include "ADD.hpp"
#include "RAJA/RAJA.hpp"
#if defined(RAJA_ENABLE_CUDA)
#include "common/CudaDataUtils.hpp"
#include <iostream>
namespace rajaperf
{
namespace stream
{
template < size_t block_size >
__launch_bounds__(block_size)
__global__ void add(Real_ptr c, Real_ptr a, Real_ptr b,
Index_type iend)
{
Index_type i = blockIdx.x * block_size + threadIdx.x;
if (i < iend) {
ADD_BODY;
}
}
template < size_t block_size >
void ADD::runCudaVariantImpl(VariantID vid)
{
setBlockSize(block_size);
const Index_type run_reps = getRunReps();
const Index_type ibegin = 0;
const Index_type iend = getActualProblemSize();
auto res{getCudaResource()};
ADD_DATA_SETUP;
if ( vid == Base_CUDA ) {
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
constexpr size_t shmem = 0;
RPlaunchCudaKernel( (add<block_size>),
grid_size, block_size,
shmem, res.get_stream(),
c, a, b, iend );
}
stopTimer();
} else if ( vid == Lambda_CUDA ) {
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
auto add_lambda = [=] __device__ (Index_type i) {
ADD_BODY;
};
const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size);
constexpr size_t shmem = 0;
RPlaunchCudaKernel( (lambda_cuda_forall<block_size,
decltype(add_lambda)>),
grid_size, block_size,
shmem, res.get_stream(),
ibegin, iend, add_lambda );
}
stopTimer();
} else if ( vid == RAJA_CUDA ) {
startTimer();
// Loop counter increment uses macro to quiet C++20 compiler warning
for (RepIndex_type irep = 0; irep < run_reps; RP_REPCOUNTINC(irep)) {
RAJA::forall< RAJA::cuda_exec<block_size, true /*async*/> >( res,
RAJA::RangeSegment(ibegin, iend), [=] __device__ (Index_type i) {
ADD_BODY;
});
}
stopTimer();
} else {
getCout() << "\n ADD : Unknown Cuda variant id = " << vid << std::endl;
}
}
RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATE(ADD, Cuda, Base_CUDA, Lambda_CUDA, RAJA_CUDA)
} // end namespace stream
} // end namespace rajaperf
#endif // RAJA_ENABLE_CUDA
Notable differences with the sequential variant file are:
Most of the file is guarded using the
RAJA_ENABLE_CUDAmacro.Note
The contents of all non-sequential variant implementation files are guarded using the
RAJA_ENABLE_<back-end>macros.A CUDA GPU kernel
addis implemented for theBase_CUDAvariant.The method to execute the CUDA kernel variants
ADD::runCudaVariantImplis templated on ablock_sizeparameter, which represents the tuning parameter, and is passed to the kernel launch methods. ThesetBlockSizefunction is called to provide this block_size to caliper.The
RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATEmacro is used outside the method implementation, to define “block_size” tunings for theADDkernel for theCudaback-end. The macro defines thedefineCudaVariantTuningsmethod for this kernel and assumes the existence ofrunCudaVariantImpl<block_size>andgpu_block_sizes_type. Note that this sets the kernel up to run with the GPUblock_sizeinformation specified via command-line input mentioned in Build and Install.Finally, the method
RPLaunchCudaKernelmethod is used to launch the non-RAJA variants so that they use the same launch mechanics for GPU kernels used inside RAJA for a fair timing comparison. A similar methodRPLaunchHipKernelis used for HIP back-end kernel variants.
Important
Following the established implementation patterns for kernels in the Suite help to ensure that the code is consistent, understandable, easily maintained, and needs minimal specific documentation for any individual kernel or variant.
Kernel tuning definition methods¶
Following on from the previous section, the ADD Kernel Class Files back-end files also contain define methods which add tuning definitions for the variants of the kernel in the same back-end file.
By convention each of the define methods takes no arguments. Each kernel must
provide such a define method for each of the back-ends that it implements.
This method defines the tunings for the variants of the back-end by calling
the addVariantTuning kernel base class method for each variant and tuning.
For example, here is the method to define CUDA GPU variants of the MEMSET
kernel in the MEMSET-Cuda.cpp file:
A few details are worth noting:
The loop over variants and variant conditionals helps define all of the relevant tunings for each of the variants.
The
seq_forfunction overgpu_block_sizes_type{}passes a compile time integer constant type intoblock_sizeso it may be used as a template argument.The
addVariantTuningmethod is called with a unique name for each tuning. Note that the same tuning may appear for multiple variants.