.. ## .. ## Copyright (c) 2017-25, Lawrence Livermore National Security, LLC .. ## and RAJA Performance Suite project contributors. .. ## See the RAJAPerf/LICENSE file for details. .. ## .. ## SPDX-License-Identifier: (BSD-3-Clause) .. ## .. _kernel_class_impl-label: ============================= Kernel Class Implementation ============================= Each kernel in the Suite follows a similar source file organization and implementation pattern for consistency and ease of analysis and understanding. Here, we describe important and conventions applies in each kernel class implementation that must be followed to ensure that all kernels integrate into the RAJA Performance Suite in the same way. .. _kernel_class_impl_gen-label: ---------------------- General class methods ---------------------- Class methods that do not execute kernel variants and which are not specific to any kernel variant implementation are defined in one implementation file. For the **ADD** kernel that we are describing, this is the source file ``ADD.cpp``, which in its entirety is: .. literalinclude:: ../../../src/stream/ADD.cpp :language: C++ The methods in the source file are: * **Class constructor**, which calls the ``KernelBase`` class constructor passing the ``KernelID`` and the ``RunParams`` object, which are used to initialize the base class. The constructor calls other base class methods to set information about the kernel, which is specific to the kernel. Such information includes: * Default problem size and number of kernel repetitions to generate execution run time. * The actual problem size that will be run, which is a function of the default size and command-line input. * The number of *loop iterations* that are performed and the number of loop kernels that run each time the kernel is executed. Note that the **ADD** kernel is based on a simple, single for-loop. However, other kernels in the Suite execute multiple loop kernels. * The number of bytes read and written and the number of FLOPS performed for each kernel execution. * Which RAJA features the kernel exercises. * Which Suite variants are defined, or implemented for the kernel. Each variant requires a call to the ``setVariantDefined`` method. Note that not every kernel implements every variant. So this is a mechanism to account for what is being run for analysis proposes. * **Class destructor**, which must be provided to deallocate kernel state that is allocated in the constructor and which persists throughout the execution of the Suite. Note that in the case of the **ADD** kernel, the destructor is empty since no state is dynamically allocated in the constructor. * ``setUp`` method, which allocates and initializes data required for the kernel to execute and produce results. * ``tearDown`` method, which deallocates and resets any data that will be re-allocated and/or initialized in subsequent kernel executions. .. note:: The ``tearDown`` method frees and/or resets all kernel data that is allocated and/or initialized in the ``setUp`` method. * ``updateChecksum`` method, which computes a checksum from the results of an execution of the kernel and adds it to the checksum value, which is a member of the ``KernelBase`` class, for the variant and tuning index that was run. .. note:: The checksum must be computed in the same way for each variant of a kernel so that checksums for different variants can be compared to help identify differences, and potential errors in implementations, compiler optimizations, programming model execution, etc. The ``setUp``, ``tearDown``, and ``updateChecksum`` methods are called **each time a kernel variant is run**. We allocate and deallocate data arrays in the ``setUp`` and ``tearDown`` methods to prevent any performance timing bias that may be introduced by artificially reusing data in cache, for example, when doing performance experiments. Also, note that the ``setUp`` and ``tearDown`` methods take a ``VariantID`` argument and pass it to data allocation, initialization, and deallocation methods so this data management can be done in a variant-specific manner as needed. To simplify these operations and help ensure consistency, there exist utility methods to allocate, initialize, deallocate, and copy data, and compute checksums defined in the various *data utils* files in the ``common`` directory. --------------------------- Kernel object construction --------------------------- It is important to note that there will only be one instance of each kernel class created by the program. Thus, each kernel class constructor and destructor must only perform operations that are not specific to any kernel variant. The ``Executor`` class in the ``common`` directory creates kernel objects, one for each kernel that will be run based on command-line input options. To ensure a new kernel object will be created properly, add a call to its class constructor based on its ``KernelID`` in the ``getKernelObject()`` method in the ``RAJAPerfSuite.cpp`` file. For example:: KernelBase* getKernelObject(KernelID kid, const RunParams& run_params) { KernelBase* kernel = 0; switch ( kid ) { ... case Stream_ADD : { kernel = new stream::ADD(run_params); break; } ... } // end switch on kernel id return kernel; } } .. _kernel_class_impl_exec-label: ------------------------- Kernel execution methods ------------------------- In the discussion of the **ADD** :ref:`kernel_class-label`, we noted that the class implementation involves multiple files containing variants for each execution back-end. In particular, these files contain implementations of the *run* methods declared in the **ADD** :ref:`kernel_class_header-label` to execute the variants. Each method takes a variant ID argument that identifies the variant to run and a tuning index that identifies the tuning of the variant to run. Note that the tuning index can be ignored when there is only one tuning. Each method is responsible for multiple tasks which involve a combination of kernel and variant specific operations and calling kernel base class methods, such as: * Setting up and initializing data needed by a kernel variant before it is run * Starting an execution timer before a kernel is run * Running the proper number of kernel executions * Stopping the time after the kernel is run * Putting the class member data in an appropriate state to update a checksum For example, here is the method to run sequential CPU variants of the **ADD** kernel in the ``ADD-Seq.cpp`` file: .. literalinclude:: ../../../src/stream/ADD-Seq.cpp :start-after: _add_run_seq_start :end-before: _add_run_seq_end :language: C++ A few details are worth noting: * Thee tuning index argument is ignored because there is only one tuning for the sequential kernel variants. * Execution parameters, such as kernel loop length and number of execution repetitions, are set by calling base class methods which return values based on kernel defaults and input parameters. This ensures that the execution will be consistent across run variants and results will be what is expected. * Simple switch-case statement logic is used to execute the proper variant based on the ``VariantID`` argument. * We guard sequential variants apart from the ``Base_Seq`` variant with the ``RUN_RAJA_SEQ`` macro. This ensures that the base sequential variant will always run to be used as a reference variant for execution timing. By default, we turn off the other sequential variants when we build an executable with OpenMP target offload enabled. * Macros defined in the ``ADD.hpp`` header file are used to reduce the amount of redundant code, such as for data initialization (``ADD_DATA_SETUP``) and the kernel body (``ADD_BODY``). All kernel source files follow a similar organization and implementation pattern for each set of back-end execution variants. However, there are some important differences to note that we describe next in the discussion of the CUDA variant execution file. The key contents related to execution of CUDA GPU variants of the **ADD** kernel in the ``ADD-Cuda.cpp`` file are: .. literalinclude:: ../../../src/stream/ADD-Cuda.cpp :start-after: _add_run_cuda_start :end-before: _add_run_cuda_end :language: C++ Notable differences with the sequential variant file are: * Most of the file is guarded using the ``RAJA_ENABLE_CUDA`` macro. .. note:: The contents of all non-sequential variant implementation files are guarded using the ``RAJA_ENABLE_`` macros. * In addition to using the ``ADD_DATA_SETUP`` macro, which is also used in the sequential variant implementation file discussed above, we define two other macros, ``ADD_DATA_SETUP_CUDA`` and ``ADD_DATA_TEARDOWN_CUDA``. The first macro allocates GPU device data needed to run a kernel and initialize the data by copying host CPU data to it. After a kernel executes, the second macro copies data needed to compute a checksum to the host and then deallocates the device data. * A CUDA GPU kernel ``add`` is implemented for the ``Base_CUDA`` variant. * The method to exjcute the CUDA kernel variants ``ADD::runCudaVariantImpl`` is templated on a ``block_size`` parameter, which represents the *tuning parameter*, and is passes to the kernel lauch methods. * The ``RAJAPERF_GPU_BLOCK_SIZE_TUNING_DEFINE_BOILERPLATE`` macro is used (outside the method implementation, to generate different kernel tuning implementations at compile-time to run the GPU ``block_size`` versions specified via command-line input mentioned in :ref:`build_build-label`. .. important:: Following the established implementation patterns for kernels in the Suite help to ensure that the code is consistent, understandable, easily maintained, and needs minimal documentation.