Thanks. If I have time I may give you some more to bench… but need more time 
I did test the last hgemm … and it is interesting.
with default config I get:
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time 3.48 ms 3.36 ms 201 TFLOPS=0.617884 bytes_per_second=1.68587Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time 30.6 ms 29.4 ms 23 TFLOPS=0.561361 bytes_per_second=784.208Mi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time 244 ms 243 ms 3 TFLOPS=0.563502 bytes_per_second=393.601Mi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time 1960 ms 1943 ms 1 TFLOPS=0.560897 bytes_per_second=195.891Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time 2.24 ms 2.26 ms 296 TFLOPS=0.957727 bytes_per_second=2.61292Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time 13.8 ms 13.7 ms 50 TFLOPS=1.24787 bytes_per_second=1.70183Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time 175 ms 175 ms 4 TFLOPS=0.78327 bytes_per_second=547.107Mi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time 1607 ms 1602 ms 1 TFLOPS=0.68412 bytes_per_second=238.926Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time 0.933 ms 0.950 ms 558 TFLOPS=2.30206 bytes_per_second=6.27989Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time 10.5 ms 10.5 ms 64 TFLOPS=1.64068 bytes_per_second=2.2365Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time 77.6 ms 77.4 ms 8 TFLOPS=1.77166 bytes_per_second=1.20845Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time 716 ms 714 ms 1 TFLOPS=1.53489 bytes_per_second=536.056Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time 3.92 ms 3.93 ms 178 TFLOPS=0.548059 bytes_per_second=1.49521Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time 30.9 ms 30.9 ms 22 TFLOPS=0.555448 bytes_per_second=775.73Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time 250 ms 250 ms 3 TFLOPS=0.549295 bytes_per_second=383.668Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time 2052 ms 2047 ms 1 TFLOPS=0.535741 bytes_per_second=187.105Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time 3.84 ms 3.84 ms 182 TFLOPS=0.559416 bytes_per_second=1.52621Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time 40.8 ms 40.7 ms 17 TFLOPS=0.420842 bytes_per_second=587.726Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time 252 ms 252 ms 3 TFLOPS=0.544357 bytes_per_second=380.219Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time 2038 ms 1987 ms 1 TFLOPS=0.539572 bytes_per_second=188.443Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time 4.00 ms 3.98 ms 175 TFLOPS=0.536504 bytes_per_second=1.46324Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time 37.1 ms 36.9 ms 19 TFLOPS=0.462848 bytes_per_second=646.526Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time 315 ms 314 ms 2 TFLOPS=0.435824 bytes_per_second=304.42Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time 2599 ms 2591 ms 1 TFLOPS=0.423027 bytes_per_second=147.74Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time 3.67 ms 3.68 ms 190 TFLOPS=0.604062 bytes_per_second=1.59762Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time 35.0 ms 34.5 ms 19 TFLOPS=0.491014 bytes_per_second=685.921Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time 302 ms 301 ms 2 TFLOPS=0.455492 bytes_per_second=318.156Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time 2446 ms 2439 ms 1 TFLOPS=0.449461 bytes_per_second=156.973Mi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time 2.20 ms 2.22 ms 314 TFLOPS=0.975214 bytes_per_second=2.65925Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time 21.5 ms 21.4 ms 32 TFLOPS=0.800195 bytes_per_second=1.09158Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time 182 ms 182 ms 4 TFLOPS=0.757997 bytes_per_second=527.1Mi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time 1623 ms 1618 ms 1 TFLOPS=0.677647 bytes_per_second=236.666Mi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time 4.21 ms 4.21 ms 164 TFLOPS=0.517805 bytes_per_second=1.39336Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time 41.1 ms 39.5 ms 17 TFLOPS=0.418247 bytes_per_second=584.271Mi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time 338 ms 335 ms 2 TFLOPS=0.406989 bytes_per_second=284.279Mi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time 2715 ms 2608 ms 1 TFLOPS=0.40491 bytes_per_second=141.413Mi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time 1.88 ms 1.89 ms 361 TFLOPS=1.14198 bytes_per_second=3.11308Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time 16.7 ms 16.2 ms 42 TFLOPS=1.02704 bytes_per_second=1.40103Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time 140 ms 140 ms 5 TFLOPS=0.979427 bytes_per_second=684.114Mi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time 1139 ms 1094 ms 1 TFLOPS=0.965238 bytes_per_second=337.106Mi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time 1.55 ms 1.56 ms 425 TFLOPS=1.39208 bytes_per_second=3.79245Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time 13.8 ms 13.8 ms 50 TFLOPS=1.2434 bytes_per_second=1.69582Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time 119 ms 118 ms 6 TFLOPS=1.15431 bytes_per_second=806.265Mi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time 1040 ms 1036 ms 1 TFLOPS=1.05743 bytes_per_second=369.304Mi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time 0.496 ms 0.516 ms 1357 TFLOPS=4.33507 bytes_per_second=11.8025Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time 3.49 ms 3.50 ms 189 TFLOPS=4.95832 bytes_per_second=6.71357Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time 34.5 ms 34.4 ms 20 TFLOPS=3.98225 bytes_per_second=2.7157Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time 310 ms 309 ms 2 TFLOPS=3.54481 bytes_per_second=1.209Gi/s
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time 0.811 ms 0.831 ms 849 TFLOPS=2.64929 bytes_per_second=7.22589Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time 6.35 ms 6.36 ms 91 TFLOPS=2.71682 bytes_per_second=3.68985Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time 28.6 ms 28.5 ms 24 TFLOPS=4.81047 bytes_per_second=3.28086Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time 227 ms 226 ms 3 TFLOPS=4.85445 bytes_per_second=1.65531Gi/s
as you look it is horrible 
For this APU we need to change the config size : with 4x2 / 2x4 (and not the 4x4 / 4x4)… with that I get:
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:1024,n:1024,k:1024}/manual_time 3.47 ms 3.48 ms 200 TFLOPS=0.619061 bytes_per_second=1.68908Gi/s
{hgemm:kernel_type::shared,m:2048,n:2048,k:2048}/manual_time 29.6 ms 29.5 ms 23 TFLOPS=0.58104 bytes_per_second=811.68Mi/s
{hgemm:kernel_type::shared,m:4096,n:4096,k:4096}/manual_time 245 ms 244 ms 3 TFLOPS=0.561046 bytes_per_second=391.886Mi/s
{hgemm:kernel_type::shared,m:8192,n:8192,k:8192}/manual_time 1947 ms 1931 ms 1 TFLOPS=0.564614 bytes_per_second=197.189Mi/s
{hgemm:kernel_type::wmma_naive,m:1024,n:1024,k:1024}/manual_time 2.24 ms 2.26 ms 292 TFLOPS=0.957852 bytes_per_second=2.61325Gi/s
{hgemm:kernel_type::wmma_naive,m:2048,n:2048,k:2048}/manual_time 13.4 ms 13.4 ms 52 TFLOPS=1.28136 bytes_per_second=1.7469Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:4096,k:4096}/manual_time 169 ms 169 ms 4 TFLOPS=0.812453 bytes_per_second=567.486Mi/s
{hgemm:kernel_type::wmma_naive,m:8192,n:8192,k:8192}/manual_time 1599 ms 1594 ms 1 TFLOPS=0.687432 bytes_per_second=240.083Mi/s
{hgemm:kernel_type::wmma_shared,m:1024,n:1024,k:1024}/manual_time 1.01 ms 1.03 ms 535 TFLOPS=2.13061 bytes_per_second=5.81247Gi/s
{hgemm:kernel_type::wmma_shared,m:2048,n:2048,k:2048}/manual_time 9.44 ms 9.43 ms 68 TFLOPS=1.82273 bytes_per_second=2.48359Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:4096,k:4096}/manual_time 84.6 ms 84.4 ms 7 TFLOPS=1.62375 bytes_per_second=1.10759Gi/s
{hgemm:kernel_type::wmma_shared,m:8192,n:8192,k:8192}/manual_time 893 ms 891 ms 1 TFLOPS=1.23072 bytes_per_second=429.824Mi/s
{hgemm:kernel_type::wmma_shared_warp,m:1024,n:1024,k:1024}/manual_time 0.862 ms 0.880 ms 594 TFLOPS=2.49131 bytes_per_second=6.79626Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:2048,n:2048,k:2048}/manual_time 10.2 ms 10.2 ms 94 TFLOPS=1.68698 bytes_per_second=2.2998Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:4096,k:4096}/manual_time 82.2 ms 82.0 ms 9 TFLOPS=1.67288 bytes_per_second=1.14071Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:8192,n:8192,k:8192}/manual_time 747 ms 745 ms 1 TFLOPS=1.4718 bytes_per_second=514.019Mi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:1024,n:1024,k:1024}/manual_time 0.966 ms 0.983 ms 818 TFLOPS=2.22354 bytes_per_second=6.06639Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:2048,n:2048,k:2048}/manual_time 8.98 ms 8.96 ms 71 TFLOPS=1.91643 bytes_per_second=2.61072Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:4096,k:4096}/manual_time 68.2 ms 68.0 ms 8 TFLOPS=2.01666 bytes_per_second=1.37555Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:8192,n:8192,k:8192}/manual_time 741 ms 739 ms 1 TFLOPS=1.48344 bytes_per_second=518.086Mi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:1024,n:1024,k:1024}/manual_time 0.285 ms 0.306 ms 2239 TFLOPS=7.53191 bytes_per_second=20.5462Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:2048,n:2048,k:2048}/manual_time 2.81 ms 2.82 ms 244 TFLOPS=6.16142 bytes_per_second=8.34298Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:4096,k:4096}/manual_time 30.9 ms 30.9 ms 22 TFLOPS=4.44535 bytes_per_second=3.03155Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:8192,n:8192,k:8192}/manual_time 362 ms 361 ms 2 TFLOPS=3.03907 bytes_per_second=1.0365Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:1024,n:1024,k:1024}/manual_time 0.275 ms 0.296 ms 2341 TFLOPS=7.80504 bytes_per_second=21.2901Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:2048,n:2048,k:2048}/manual_time 2.60 ms 2.61 ms 247 TFLOPS=6.6617 bytes_per_second=9.03007Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:4096,k:4096}/manual_time 35.5 ms 35.4 ms 20 TFLOPS=3.8728 bytes_per_second=2.64106Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:8192,n:8192,k:8192}/manual_time 387 ms 386 ms 2 TFLOPS=2.84352 bytes_per_second=993.086Mi/s
{hgemm:kernel_type::wmma_prefetch,m:1024,n:1024,k:1024}/manual_time 0.281 ms 0.301 ms 2296 TFLOPS=7.65088 bytes_per_second=20.8711Gi/s
{hgemm:kernel_type::wmma_prefetch,m:2048,n:2048,k:2048}/manual_time 2.71 ms 2.72 ms 256 TFLOPS=6.38203 bytes_per_second=8.64167Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:4096,k:4096}/manual_time 32.6 ms 32.5 ms 22 TFLOPS=4.23867 bytes_per_second=2.87918Gi/s
{hgemm:kernel_type::wmma_prefetch,m:8192,n:8192,k:8192}/manual_time 384 ms 383 ms 2 TFLOPS=2.86278 bytes_per_second=999.649Mi/s
{hgemm:kernel_type::wmma_opt_1,m:1024,n:1024,k:1024}/manual_time 0.262 ms 0.284 ms 2448 TFLOPS=8.18948 bytes_per_second=22.3256Gi/s
{hgemm:kernel_type::wmma_opt_1,m:2048,n:2048,k:2048}/manual_time 2.48 ms 2.50 ms 259 TFLOPS=6.95749 bytes_per_second=9.43684Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:4096,k:4096}/manual_time 32.0 ms 31.9 ms 21 TFLOPS=4.29709 bytes_per_second=2.92791Gi/s
{hgemm:kernel_type::wmma_opt_1,m:8192,n:8192,k:8192}/manual_time 373 ms 372 ms 2 TFLOPS=2.94894 bytes_per_second=1.00576Gi/s
{hgemm:kernel_type::wmma_opt_2,m:1024,n:1024,k:1024}/manual_time 0.224 ms 0.245 ms 2825 TFLOPS=9.6022 bytes_per_second=26.1952Gi/s
{hgemm:kernel_type::wmma_opt_2,m:2048,n:2048,k:2048}/manual_time 1.66 ms 1.67 ms 342 TFLOPS=10.5321 bytes_per_second=14.1535Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:4096,k:4096}/manual_time 12.7 ms 12.7 ms 47 TFLOPS=10.7918 bytes_per_second=7.3593Gi/s
{hgemm:kernel_type::wmma_opt_2,m:8192,n:8192,k:8192}/manual_time 146 ms 146 ms 6 TFLOPS=7.63096 bytes_per_second=2.56267Gi/s
{hgemm:kernel_type::wmma_opt_3,m:1024,n:1024,k:1024}/manual_time 0.371 ms 0.391 ms 1846 TFLOPS=5.79339 bytes_per_second=15.8025Gi/s
{hgemm:kernel_type::wmma_opt_3,m:2048,n:2048,k:2048}/manual_time 2.83 ms 2.83 ms 241 TFLOPS=6.12953 bytes_per_second=8.28346Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:4096,k:4096}/manual_time 22.1 ms 22.1 ms 31 TFLOPS=6.21059 bytes_per_second=4.23623Gi/s
{hgemm:kernel_type::wmma_opt_3,m:8192,n:8192,k:8192}/manual_time 209 ms 208 ms 3 TFLOPS=5.261 bytes_per_second=1.79432Gi/s
{hgemm:kernel_type::wmma_opt_4,m:1024,n:1024,k:1024}/manual_time 0.240 ms 0.261 ms 2650 TFLOPS=8.94375 bytes_per_second=24.3992Gi/s
{hgemm:kernel_type::wmma_opt_4,m:2048,n:2048,k:2048}/manual_time 1.84 ms 1.86 ms 318 TFLOPS=9.50211 bytes_per_second=12.7354Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:4096,k:4096}/manual_time 14.6 ms 14.6 ms 47 TFLOPS=9.43729 bytes_per_second=6.4159Gi/s
{hgemm:kernel_type::wmma_opt_4,m:8192,n:8192,k:8192}/manual_time 142 ms 142 ms 5 TFLOPS=7.87731 bytes_per_second=2.63676Gi/s
{hgemm:kernel_type::rocblas,m:1024,n:1024,k:1024}/manual_time 0.809 ms 0.830 ms 861 TFLOPS=2.65381 bytes_per_second=7.2403Gi/s
{hgemm:kernel_type::rocblas,m:2048,n:2048,k:2048}/manual_time 6.33 ms 6.33 ms 91 TFLOPS=2.72738 bytes_per_second=3.70353Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:4096,k:4096}/manual_time 28.2 ms 28.1 ms 25 TFLOPS=4.88062 bytes_per_second=3.32902Gi/s
{hgemm:kernel_type::rocblas,m:8192,n:8192,k:8192}/manual_time 224 ms 224 ms 3 TFLOPS=4.89919 bytes_per_second=1.67091Gi/s
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
{hgemm:kernel_type::shared,m:4096,n:128,k:16384}/manual_time 37.9 ms 37.9 ms 19 TFLOPS=0.452836 bytes_per_second=3.42343Gi/s
{hgemm:kernel_type::shared,m:4096,n:256,k:16384}/manual_time 75.4 ms 75.2 ms 9 TFLOPS=0.455428 bytes_per_second=1.78626Gi/s
{hgemm:kernel_type::shared,m:4096,n:512,k:16384}/manual_time 151 ms 151 ms 5 TFLOPS=0.454019 bytes_per_second=977.804Mi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:128,k:16384}/manual_time 36.9 ms 36.8 ms 18 TFLOPS=0.465934 bytes_per_second=3.52246Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:256,k:16384}/manual_time 77.3 ms 77.1 ms 9 TFLOPS=0.444647 bytes_per_second=1.74387Gi/s
{hgemm:kernel_type::wmma_naive,m:4096,n:512,k:16384}/manual_time 150 ms 149 ms 5 TFLOPS=0.459304 bytes_per_second=989.148Mi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:128,k:16384}/manual_time 13.7 ms 13.7 ms 54 TFLOPS=1.25153 bytes_per_second=9.45662Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:256,k:16384}/manual_time 26.0 ms 25.9 ms 25 TFLOPS=1.32278 bytes_per_second=5.18818Gi/s
{hgemm:kernel_type::wmma_shared,m:4096,n:512,k:16384}/manual_time 52.5 ms 52.4 ms 12 TFLOPS=1.30833 bytes_per_second=2.75167Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:128,k:16384}/manual_time 12.9 ms 12.9 ms 53 TFLOPS=1.32921 bytes_per_second=10.0403Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:256,k:16384}/manual_time 20.6 ms 20.6 ms 34 TFLOPS=1.66657 bytes_per_second=6.53644Gi/s
{hgemm:kernel_type::wmma_shared_warp,m:4096,n:512,k:16384}/manual_time 44.7 ms 44.6 ms 16 TFLOPS=1.53647 bytes_per_second=3.23149Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:128,k:16384}/manual_time 13.0 ms 13.0 ms 53 TFLOPS=1.32251 bytes_per_second=9.99036Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:256,k:16384}/manual_time 20.2 ms 20.1 ms 34 TFLOPS=1.70177 bytes_per_second=6.67452Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf,m:4096,n:512,k:16384}/manual_time 43.4 ms 43.3 ms 16 TFLOPS=1.58234 bytes_per_second=3.32796Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:128,k:16384}/manual_time 3.02 ms 3.02 ms 226 TFLOPS=5.78647 bytes_per_second=43.067Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:256,k:16384}/manual_time 5.40 ms 5.40 ms 103 TFLOPS=6.41001 bytes_per_second=24.9471Gi/s
{hgemm:kernel_type::wmma_shared_warp_vec,m:4096,n:512,k:16384}/manual_time 19.1 ms 19.0 ms 37 TFLOPS=3.60631 bytes_per_second=7.5767Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:128,k:16384}/manual_time 2.70 ms 2.71 ms 242 TFLOPS=6.49395 bytes_per_second=48.0907Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:256,k:16384}/manual_time 5.26 ms 5.26 ms 107 TFLOPS=6.587 bytes_per_second=25.6191Gi/s
{hgemm:kernel_type::wmma_shared_warp_buf_vec,m:4096,n:512,k:16384}/manual_time 18.5 ms 18.5 ms 37 TFLOPS=3.71419 bytes_per_second=7.7919Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:128,k:16384}/manual_time 2.43 ms 2.44 ms 264 TFLOPS=7.2384 bytes_per_second=53.433Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:256,k:16384}/manual_time 4.81 ms 4.81 ms 113 TFLOPS=7.21005 bytes_per_second=28.0302Gi/s
{hgemm:kernel_type::wmma_prefetch,m:4096,n:512,k:16384}/manual_time 13.9 ms 13.9 ms 48 TFLOPS=5.03227 bytes_per_second=10.4089Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:128,k:16384}/manual_time 2.50 ms 2.50 ms 257 TFLOPS=7.02371 bytes_per_second=52.0044Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:256,k:16384}/manual_time 4.61 ms 4.61 ms 114 TFLOPS=7.52881 bytes_per_second=29.2088Gi/s
{hgemm:kernel_type::wmma_opt_1,m:4096,n:512,k:16384}/manual_time 16.8 ms 16.7 ms 41 TFLOPS=4.12055 bytes_per_second=8.62215Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:128,k:16384}/manual_time 2.12 ms 2.13 ms 297 TFLOPS=8.26598 bytes_per_second=61.2295Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:256,k:16384}/manual_time 3.46 ms 3.47 ms 200 TFLOPS=10.0568 bytes_per_second=38.9209Gi/s
{hgemm:kernel_type::wmma_opt_2,m:4096,n:512,k:16384}/manual_time 6.85 ms 6.84 ms 88 TFLOPS=10.3186 bytes_per_second=21.1117Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:128,k:16384}/manual_time 4.59 ms 4.48 ms 114 TFLOPS=3.77939 bytes_per_second=28.2773Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:256,k:16384}/manual_time 7.70 ms 7.69 ms 81 TFLOPS=4.47885 bytes_per_second=17.497Gi/s
{hgemm:kernel_type::wmma_opt_3,m:4096,n:512,k:16384}/manual_time 13.3 ms 12.8 ms 50 TFLOPS=5.175 bytes_per_second=10.8775Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:128,k:16384}/manual_time 3.27 ms 3.27 ms 206 TFLOPS=5.31326 bytes_per_second=39.7381Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:256,k:16384}/manual_time 4.15 ms 4.00 ms 166 TFLOPS=8.36172 bytes_per_second=32.447Gi/s
{hgemm:kernel_type::wmma_opt_4,m:4096,n:512,k:16384}/manual_time 8.10 ms 7.87 ms 77 TFLOPS=8.50537 bytes_per_second=17.835Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:128,k:16384}/manual_time 8.57 ms 8.56 ms 76 TFLOPS=2.00911 bytes_per_second=15.1509Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:256,k:16384}/manual_time 16.6 ms 16.5 ms 43 TFLOPS=2.07607 bytes_per_second=8.14096Gi/s
{hgemm:kernel_type::rocblas,m:4096,n:512,k:16384}/manual_time 33.2 ms 33.1 ms 20 TFLOPS=2.07233 bytes_per_second=4.35838Gi/s
the pick perf is ~18TFlops …
What is itteresting is that for now my kernel only achive ~5TFlops si it look I can do better.
The other is I tune it for my GPU, and it is not the best for the MAX GPU.
need to look closely but I can have gain on my kernel for the “old” Ryzen 7940HS… and a lot more for the MAX (x3 ???)

(keep in mind that the bench compute A[fp16]@B[fp16]=C[fp16] and we need to compute trans(A[fp16/bf16])@B[fp32]=C[fp32]…)
What did you get with rocminfo:
Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon 780M
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5567(0x15bf)
ASIC Revision: 7(0x7)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2799
BDFID: 49920
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 40
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 28688300(0x1b5bfac) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 28688300(0x1b5bfac) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1103
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
curus to know what diff there is…