← Back to Performance

Reduce Performance

This example is a self-contained use of the reduce primitive, meant to plot performance. This builds on the simpler functionality example. Set your parameters in the pane and click "Start" to run and plot performance data for a WebGPU reduce. The inputCount input specifies how many different input lengths to run, which will be evenly (logarithmically) interpolated between the specified start and end lengths. Otherwise, the parameters are the same as in the functionality example. This example explains how to time a Gridwise primitive. The entire JS source file is in github.

To measure CPU and/or GPU timing, include a timing directive in the call to primitive.execute. Typically we call the primitive once without any timing information to handle warmup effects (e.g., compiling the kernel) and then call the kernel many times and average the runtimes of that second set of calls. We then average the total runtime over the number of trials.

/* call the primitive once to warm up */
await primitive.execute({
  inputBuffer: memsrcBuffer,
  outputBuffer: memdestBuffer,
});
/* call params.trials times */
await primitive.execute({
  inputBuffer: memsrcBuffer,
  outputBuffer: memdestBuffer,
  trials: params.trials, /* integer */
  enableGPUTiming: true,
  enableCPUTiming: true,
});

We can get timing information back from the primitive with a getResults call. The GPU time might be an array of timings if the GPU call has multiple kernels within it. In the below example, we simply flatten that array by adding it up into a total time.

let { gpuTotalTimeNS, cpuTotalTimeNS } = await primitive.getTimingResult();
if (gpuTotalTimeNS instanceof Array) {
  // gpuTotalTimeNS might be a list, in which case just sum it up
  gpuTotalTimeNS = gpuTotalTimeNS.reduce((x, a) => x + a, 0);
}
averageGpuTotalTimeNS = gpuTotalTimeNS / params.trials;
averageCpuTotalTimeNS = cpuTotalTimeNS / params.trials;

The reduce primitive computes a single output value from an input array using a binary operation (such as add, max, or min). This makes it simpler to time than sort (which overwrites its input) since the input remains unchanged after each execution.