This example is a self-contained use of the scan and
sort primitives, meant to plot performance. This builds on
the simpler
functionality example. Set your
parameters in the pane and click "Start" to run and plot performance data
for a WebGPU scan/reduce/sort. The inputCount input specifies
how many different input lengths to run, which will be evenly
(logarithmically) interpolated between the specified start and end
lengths. Otherwise, the parameters are the same as in the
functionality example. This
example explains
how to time a Gridwise primitive.
The entire JS source file is in github.
To measure CPU and/or GPU timing, include a timing directive in the call
to primitive.execute. Typically we call the primitive once
without any timing information to handle warmup effects (e.g., compiling
the kernel) and then call the kernel many times and average the runtimes
of that second set of calls. We then average the total runtime over the
number of trials.
/* call the primitive once to warm up */
await primitive.execute({
inputBuffer: memsrcBuffer,
outputBuffer: memdestBuffer,
});
/* call params.trials times */
await primitive.execute({
inputBuffer: memsrcBuffer,
outputBuffer: memdestBuffer,
trials: params.trials, /* integer */
enableGPUTiming: true,
enableCPUTiming: true,
});
We can get timing information back from the primitive with a
getResults call. The GPU time might be an array of timings if
the GPU call has multiple kernels within it. In the below example, we
simply flatten that array by adding it up into a total time.
let { gpuTotalTimeNS, cpuTotalTimeNS } = await primitive.getTimingResult();
if (gpuTotalTimeNS instanceof Array) {
// gpuTotalTimeNS might be a list, in which case just sum it up
gpuTotalTimeNS = gpuTotalTimeNS.reduce((x, a) => x + a, 0);
}
averageGpuTotalTimeNS = gpuTotalTimeNS / params.trials;
averageCpuTotalTimeNS = cpuTotalTimeNS / params.trials;
Timing the sort primitive is frustratingly complicated
because sort overwrites its input with its output. The most meaningful
timing results will therefore need to reset sort's input on each pass to
make sure it has the same workload on each pass. For simplicity, we are
not doing that here.