FlashAttention4 Forward

sequence=256, head_dim=64, tile=64 → 4 tiles. This chart models a single CTA and one head. cp.async loads overlap with tensor-core MMA, TMEM stores/loads overlap with compute, and epilogues run concurrently using cp.async bulk commits. Hover for exact op and tile.
Colors: purples for compute/memory, orange for synchronization.