← Back to Performance

Scan / Sort Performance

This example is a self-contained use of the scan and sort primitives, meant to plot performance. This builds on the simpler functionality example. Set your parameters in the pane and click "Start" to run and plot performance data for a WebGPU scan/reduce/sort. The inputCount input specifies how many different input lengths to run, which will be evenly (logarithmically) interpolated between the specified start and end lengths. Otherwise, the parameters are the same as in the functionality example. This example explains how to time a Gridwise primitive. The entire JS source file is in github.

To measure CPU and/or GPU timing, include a timing directive in the call to primitive.execute. Typically we call the primitive once without any timing information to handle warmup effects (e.g., compiling the kernel) and then call the kernel many times and average the runtimes of that second set of calls. We then average the total runtime over the number of trials.

/* call the primitive once to warm up */
await primitive.execute({
  inputBuffer: memsrcBuffer,
  outputBuffer: memdestBuffer,
});
/* call params.trials times */
await primitive.execute({
  inputBuffer: memsrcBuffer,
  outputBuffer: memdestBuffer,
  trials: params.trials, /* integer */
  enableGPUTiming: true,
  enableCPUTiming: true,
});

We can get timing information back from the primitive with a getResults call. The GPU time might be an array of timings if the GPU call has multiple kernels within it. In the below example, we simply flatten that array by adding it up into a total time.

let { gpuTotalTimeNS, cpuTotalTimeNS } = await primitive.getTimingResult();
if (gpuTotalTimeNS instanceof Array) {
  // gpuTotalTimeNS might be a list, in which case just sum it up
  gpuTotalTimeNS = gpuTotalTimeNS.reduce((x, a) => x + a, 0);
}
averageGpuTotalTimeNS = gpuTotalTimeNS / params.trials;
averageCpuTotalTimeNS = cpuTotalTimeNS / params.trials;

Timing the sort primitive is frustratingly complicated because sort overwrites its input with its output. The most meaningful timing results will therefore need to reset sort's input on each pass to make sure it has the same workload on each pass. For simplicity, we are not doing that here.