How to make_tiled_copy from register to global memory for this layout? m:32 n:64, threads: 128
(row0)t0v0...v7, t1v0-v7, t2v0-v7, t3v0-v7, t64v0...v7, t65v0-v7, t66v0-v7, t67v0-v7 (fp16)
t4v0...v7, t5v0-v7, t6v0-v7, t7v0-v7, t68v0...v7, t69v0-v7, t70v0-v7, t71v0-v7
...
(row8)t0v8...v15, t1v8...v15, t2vv8...v15 t3v8...v15 t64v8...v15 t65vv8...v15, t66v8...v15 t67v8...v15
t4v8...v15 t5vv8...v15 t6v8...v15,t7vv8...v15, t68vv8...v15, t69v8...v15, t70v8...v15, t71v8...v15
...
(row16)t32v0...v7, t33v0-v7, t34v0-v7, t35v0-v7, t96v0...v7, t97v0-v7, t98v0-v7, t99v0-v7 (fp16)
..
(row31)t60v8...v15, t61v8...v15, t62v8...v15, t63v8...v15 t124v8...v15 t125vv8...v15, t126v8...v15 t127v8...v15
permutation layout:
i have already transpose v2v3 and get 128B contiguous register for each threads. how to make_tiled_copy write 128B contiguous to global memory for this layout.
How to make_tiled_copy from register to global memory for this layout? m:32 n:64, threads: 128
(row0)t0v0...v7, t1v0-v7, t2v0-v7, t3v0-v7, t64v0...v7, t65v0-v7, t66v0-v7, t67v0-v7 (fp16)
t4v0...v7, t5v0-v7, t6v0-v7, t7v0-v7, t68v0...v7, t69v0-v7, t70v0-v7, t71v0-v7
...
(row8)t0v8...v15, t1v8...v15, t2vv8...v15 t3v8...v15 t64v8...v15 t65vv8...v15, t66v8...v15 t67v8...v15
t4v8...v15 t5vv8...v15 t6v8...v15,t7vv8...v15, t68vv8...v15, t69v8...v15, t70v8...v15, t71v8...v15
...
(row16)t32v0...v7, t33v0-v7, t34v0-v7, t35v0-v7, t96v0...v7, t97v0-v7, t98v0-v7, t99v0-v7 (fp16)
..
(row31)t60v8...v15, t61v8...v15, t62v8...v15, t63v8...v15 t124v8...v15 t125vv8...v15, t126v8...v15 t127v8...v15
permutation layout:
i have already transpose v2v3 and get 128B contiguous register for each threads. how to make_tiled_copy write 128B contiguous to global memory for this layout.