PETSc fig 6 - Exascale Computing Project

Figure 6. (L) Pipelined kernel launches vs. (R) interrupted kernel launches. Suppose a kernel launch takes 10 µs, and to run, kernel A takes 20 µs, kernel B takes 5 µs, and kernel C takes 5 µs. (L) shows a timeline with fully pipelined kernel launches. (R) shows a timeline with a device sync after kernel A. MPI communication forces syncs, such as in (R); NVSHMEM does not force syncs and allows a timeline, such as in (L).