### Description Optimize the backward pass kernels for all supported linear attention variants to improve training throughput. ### Tasks - [ ] Profile backward pass performance and identify bottlenecks - [ ] Implement cuda KDA bwd subchunk intra and wy_dqkg - [ ] Benchmark against FLA Triton backward pass - [ ] Validate gradient correctness
Description
Optimize the backward pass kernels for all supported linear attention variants to improve training throughput.
Tasks