![]() |
CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers
|
#include <output_tile_thread_map.h>
Classes | |
| struct | Detail |
Public Types | |
| using | WarpCount = WarpCount_ |
| using | MmaCount = MmaCount_ |
| using | Iterations = MmaCount |
| using | Delta = layout::PitchLinearShape< kWarpSize *kElementsPerAccess, 1 > |
Static Public Member Functions | |
| static CUTLASS_HOST_DEVICE layout::PitchLinearCoord | initial_offset (int thread_idx) |
| Initial offset function. More... | |
Static Public Attributes | |
| static int const | kWarpSize = 32 |
| static int const | kThreads = Threads |
| static int const | kWarpCount = kThreads / kWarpSize |
| static int const | kElementsPerAccess = ElementsPerAccess |
| static int const | kElementSize = ElementSize |
Template metaprogram for partitioning a 3D interleaved layout across warps to achieve several performance objectives:
| using cutlass::epilogue::threadblock::InterleavedOutputTileThreadMap< WarpCount_, MmaCount_, Threads, ElementsPerAccess, ElementSize >::Delta = layout::PitchLinearShape<kWarpSize * kElementsPerAccess, 1> |
| using cutlass::epilogue::threadblock::InterleavedOutputTileThreadMap< WarpCount_, MmaCount_, Threads, ElementsPerAccess, ElementSize >::Iterations = MmaCount |
| using cutlass::epilogue::threadblock::InterleavedOutputTileThreadMap< WarpCount_, MmaCount_, Threads, ElementsPerAccess, ElementSize >::MmaCount = MmaCount_ |
| using cutlass::epilogue::threadblock::InterleavedOutputTileThreadMap< WarpCount_, MmaCount_, Threads, ElementsPerAccess, ElementSize >::WarpCount = WarpCount_ |
|
inlinestatic |
|
static |
|
static |
|
static |
|
static |
|
static |
1.8.11