Transcode using ffmpeg With AMD dGPU

mire3212 · May 1, 2024, 12:03am

I’m struggling to figure out how to use ffmpeg to encode/decode video on the Framework 16 with AMD dGPU (Radeon 7700). I believe ffmpeg supports hardware enc/dec with VAAPI on 6.1 and Mesa 23, both of which I have installed, but when I try to transcode from H265 to H264, it only runs at maybe 4x speed. I can’t seem to get Vulkan to work at all (which I think now supports this? as of the later versions).

I’m running Arch with 6.8.7 kernel; Framework 16 with the AMD Radeon RX 7700S.
I have mesa 24 and libva 2.21 and the amdgpu drivers also installed.

I’m attempting the following command as a test to see what kind of throughput I can do. The media is located on the internal NVMe drive (so disk read/write shouldn’t really be a concern). I have 64GB of RAM, so that should be plenty (file size is only 10GB). The file is HEVC with HDR in 4K.

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mkv -c:v h264_vaapi -vf 'format=nv12,hwupload'  -f null -

As mentioned, this runs at about 4x (3.8-4), but when I run the equivalent transcode on a Debian 12 Linux machine with an NVIDIA 1050 Ti, it runs at around 8x (7.5-8).

I would have assumed that this newer card would be able to handle a transcode much better than a card 5+ years old.

To top it off, using nvtop, the graph only seems to indicate about 25% of the dGPU is being used (this let’s me know that I’m at least engaging the dGPU, but it’s definitely not fully being utilized). This really makes me think that maybe I’m decoding on the CPU then encoding on the dGPU, which would be much slower, but I can’t seem to figure out how to do all of the transcode in the dGPU.

I’ve followed the pages on the ffmpeg Wiki, but it seems that I can’t quite get it all working correctly and my knowledge of ffmpeg is vastly low.

I know this example is suppose to be “all in hardware”:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -c:v h264_vaapi -b:v 2M -maxrate 2M output.mp4

But I get a failure:

No usable encoding profile found.

I think this means I need to setup something in the -vf flag, but I don’t know what.

At the end of the day, I’m mostly just trying to get the most speed out of transcoding; mostly from HEVC to H.264, sometimes 4K to 1080p, or just 4K to 4K (with a different encoder).

I’m very lost and I can’t seem to find anything online. A lot of it is really old and it seems that a lot of new advancements have been made to the software to enable this type or workflow, but I know so little about this that I’m just looking for guidance.

yessikg · May 6, 2024, 1:03am

Have you taken a look at the wiki? Hardware/VAAPI – FFmpeg

mire3212 · May 6, 2024, 1:40am

Yes. I even linked that exact page and lifted an example command and it’s output when tried on my system.

yessikg · May 6, 2024, 1:51am

Ah, yeah ok. Hm, then maybe it’s related to not supporting multithreading: #10566 (AV1 decoding with -hwaccel d3d11va performance issue) – FFmpeg

mire3212 · May 6, 2024, 6:29pm

While that may be a possible cause, the specific issue you’re linking too isn’t using the same hardware acceleration or codec, so doesn’t really relate here.

yessikg · May 6, 2024, 9:35pm

My bad for some reason I thought that was the same thing. Looking at the hardware acceleration intro page, there is a table that says that VAAPI support is partial when using Linux and AMD. Link to the table: HWAccelIntro – FFmpeg

Richard · May 17, 2024, 11:06am

Now, there are several things you talk about.

I can’t seem to get Vulkan to work at all (which I think now supports this? as of the later versions).

You mean in general or for ffmpeg? Because right now, ffmpeg doesn in theory support using Vulkan media extensions, but I’m not sure if they can be used for encoding right now. Especially on AMD hardware that won’t be possible right now. Instead AMD is pushing for their cumbersome AMF, that can’t run on like 99% of all Linux Distros: https://www.phoronix.com/news/AMD-AMF-FFmpeg-Better-2024

If you can’t get Vulkan to work at all, that’s a whole different issue.

Now having multiple GPUs on Linux still is quite a pain. Nvidia does seem to have some kind of working support through PRIME, but finding anything on AMD dGPUs is pretty much impossible. E.g. how to tell apps on Wayland (including the DE) which GPU to use. Similar is the sittuation for ffmpge. You could check if there is another hwaccel_device in /dev/dri, maybe it’s handled that way. In theory you should also have /dev/dri/renderD129. But your error mesage from trying to encode on D128 lets me guess that VA-API drivers aren’t properly set up. So what does vainfo say?

mire3212 · May 17, 2024, 1:46pm

I do have two devices listed in /dev/dri, renderD128 and renderD129, and I believe that renderD128 is the dedicated GPU; at least when I was trying ffmpeg and watching nvtop, it was hitting the dedicated GPU, but only at about 25%

Here is the output of vainfo

Trying display: wayland
error: XDG_RUNTIME_DIR is invalid or not set in the environment.
Trying display: x11
Authorization required, but no authorization protocol specified

error: can't connect to X server!
Trying display: drm
vainfo: VA-API version: 1.21 (libva 2.21.0)
vainfo: Driver version: Mesa Gallium driver 24.0.7-arch1.3 for AMD Radeon RX 7700S (radeonsi, navi33, LLVM 17.0.6, DRM 3.57, 6.8.9-arch1-2)
vainfo: Supported profile and entrypoints
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD
      VAProfileAV1Profile0            : VAEntrypointVLD
      VAProfileAV1Profile0            : VAEntrypointEncSlice
      VAProfileNone                   : VAEntrypointVideoProc

Richard · May 17, 2024, 4:16pm

Then I don’t really see any reason why it should fail. Have you tried encoding with another codec, like AV1?

Worst case, what does ffmpeg -encoders | grep h264 show? Maybe for some reason vaapi support hasn’t been included in ffmpeg at compile time.

Also, I’m not sure if nvtop can show AMD VCN usage. But this tool can: GitHub - Umio-Yasuno/amdgpu_top: Tool to display AMDGPU usage

John_Flatness · May 18, 2024, 2:11am

The full output of your failing command might help. Also with -loglevel verbose or -loglevel debug maybe.

If I had to take a guess, I’d say that maybe your input in the second test has a format that the GPU can decode but can’t encode… taking in 10-bit input in something like H.265 and trying to encode to H.264 would do it, as an example.

If that’s the problem you might fix it with -vf scale_vaapi=format=nv12

mire3212 · May 22, 2024, 9:55pm

@Richard

I have not. It’s on my list of things to try, but I know my video setup plays nice with HEVC/H.265 and H.264 (non-4k).

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mkv -c:v h264_vaapi -b:v 8M -maxrate 12M -v verbose output.mp4 -y 2>&1 | tee ffmpeg-log
ffmpeg version n6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13.2.1 (GCC) 20230801
  configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-amf --enable-avisynth --enable-cuda-llvm --enable-lto --enable-fontconfig --enable-frei0r --enable-gmp --enable-gnutls --enable-gpl --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libdav1d --enable-libdrm --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libharfbuzz --enable-libiec61883 --enable-libjack --enable-libjxl --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libplacebo --enable-libpulse --enable-librav1e --enable-librsvg --enable-librubberband --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxcb --enable-libxml2 --enable-libxvid --enable-libzimg --enable-nvdec --enable-nvenc --enable-opencl --enable-opengl --enable-shared --enable-vapoursynth --enable-version3 --enable-vulkan
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
Selecting decoder 'hevc' because of requested hwaccel method vaapi
Input #0, matroska,webm, from 'input.mkv':
  Metadata:
    encoder         : libebml v1.4.4 + libmatroska v1.7.1
  Duration: 01:14:57.57, start: 0.000000, bitrate: 19008 kb/s
  Chapters:
    <truncated>
  Stream #0:0: Video: hevc (Main 10), 1 reference frame, yuv420p10le(tv, bt2020nc/bt2020/smpte2084, left), 3840x2160 [SAR 1:1 DAR 16:9], 24 fps, 24 tbr, 1k tbn (default)
    Metadata:
      BPS             : 18427507
      DURATION        : 01:14:57.500000000
      NUMBER_OF_FRAMES: 107940
      NUMBER_OF_BYTES : 10359714157
      _STATISTICS_WRITING_APP: mkvmerge v81.0 ('Milliontown') 64-bit
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
  Stream #0:1(eng): Audio: eac3 (Dolby Digital Plus + Dolby Atmos), 48000 Hz, 5.1(side), fltp, 576 kb/s (default)
    Metadata:
      BPS             : 576000
      DURATION        : 01:14:57.568000000
      NUMBER_OF_FRAMES: 140549
      NUMBER_OF_BYTES : 323824896
      _STATISTICS_WRITING_APP: mkvmerge v81.0 ('Milliontown') 64-bit
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
  Stream #0:2(eng): Subtitle: subrip
    Metadata:
      BPS             : 39
      DURATION        : 01:09:25.792000000
      NUMBER_OF_FRAMES: 699
      NUMBER_OF_BYTES : 20804
      _STATISTICS_WRITING_APP: mkvmerge v81.0 ('Milliontown') 64-bit
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
  Stream #0:3(eng): Subtitle: subrip (hearing impaired)
    Metadata:
      title           : SDH
      BPS             : 49
      DURATION        : 01:10:45.708000000
      NUMBER_OF_FRAMES: 946
      NUMBER_OF_BYTES : 26308
      _STATISTICS_WRITING_APP: mkvmerge v81.0 ('Milliontown') 64-bit
      _STATISTICS_TAGS: BPS DURATION NUMBER_OF_FRAMES NUMBER_OF_BYTES
  <truncated>
[out#0/mp4 @ 0x60f93a2f0800] No explicit maps, mapping streams automatically...
[vost#0:0/h264_vaapi @ 0x60f93a2f1180] Created video stream from input stream 0:0
[AVHWDeviceContext @ 0x60f93a31ba40] libva: VA-API version 1.21.0
[AVHWDeviceContext @ 0x60f93a31ba40] libva: Trying to open /usr/lib/dri/radeonsi_drv_video.so
[AVHWDeviceContext @ 0x60f93a31ba40] libva: Found init function __vaDriverInit_1_21
[AVHWDeviceContext @ 0x60f93a31ba40] libva: va_openDriver() returns 0
[AVHWDeviceContext @ 0x60f93a31ba40] Initialised VAAPI connection: version 1.21
[AVHWDeviceContext @ 0x60f93a31ba40] VAAPI driver: Mesa Gallium driver 24.0.7-arch1.3 for AMD Radeon RX 7700S (radeonsi, navi33, LLVM 17.0.6, DRM 3.57, 6.9.1-arch1-1).
[AVHWDeviceContext @ 0x60f93a31ba40] Driver not found in known nonstandard list, using standard behaviour.
[aost#0:1/aac @ 0x60f93b9923c0] Created audio stream from input stream 0:1
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (native) -> h264 (h264_vaapi))
  Stream #0:1 -> #0:1 (eac3 (native) -> aac (native))
Press [q] to stop, [?] for help
[graph_1_in_0_1 @ 0x60f93ba65700] tb:1/48000 samplefmt:fltp samplerate:48000 chlayout:5.1(side)
[aac @ 0x60f93b9927c0] Using a PCE to encode channel layout "5.1(side)"
[graph 0 input from stream 0:0 @ 0x60f93bcb0880] w:3840 h:2160 pixfmt:vaapi tb:1/1000 fr:24/1 sar:1/1
[h264_vaapi @ 0x60f93a2f1540] Using input frames context (format vaapi) with h264_vaapi encoder.
[h264_vaapi @ 0x60f93a2f1540] Input surface format is p010le.
[h264_vaapi @ 0x60f93a2f1540] Compatible profile VAProfileH264High10 (36) is not supported by driver.
[h264_vaapi @ 0x60f93a2f1540] No usable encoding profile found.
[vost#0:0/h264_vaapi @ 0x60f93a2f1180] Error while opening encoder - maybe incorrect parameters such as bit_rate, rate, width or height.
Error while filtering: Function not implemented
[vist#0:0/hevc @ 0x60f93a2f3980] Decoder thread received EOF packet
[vist#0:0/hevc @ 0x60f93a2f3980] Decoder returned EOF, finishing
[vist#0:0/hevc @ 0x60f93a2f3980] Terminating decoder thread
[aist#0:1/eac3 @ 0x60f93a2eb3c0] Decoder thread received EOF packet
[aist#0:1/eac3 @ 0x60f93a2eb3c0] Decoder returned EOF, finishing
[aist#0:1/eac3 @ 0x60f93a2eb3c0] Terminating decoder thread
[out#0/mp4 @ 0x60f93a2f0800] Nothing was written into output file, because at least one of its streams received no packets.
frame=    0 fps=0.0 q=0.0 Lsize=       0kB time=00:00:00.74 bitrate=   0.0kbits/s speed=6.93x    
[aac @ 0x60f93b9927c0] Qavg: 15165.507
[AVIOContext @ 0x60f93ba60780] Statistics: 0 bytes written, 0 seeks, 0 writeouts
[in#0/matroska,webm @ 0x60f93a276540] Terminating demuxer thread
[in#0/matroska,webm @ 0x60f93a276540] Input file #0 (input.mkv):
[in#0/matroska,webm @ 0x60f93a276540]   Input stream #0:0 (video): 18 packets read (1369656 bytes); 18 frames decoded; 0 decode errors; 
[in#0/matroska,webm @ 0x60f93a276540]   Input stream #0:1 (audio): 26 packets read (59904 bytes); 24 frames decoded; 0 decode errors (36864 samples); 
[in#0/matroska,webm @ 0x60f93a276540]   Total: 44 packets (1429560 bytes) demuxed
[AVIOContext @ 0x60f93a27ef00] Statistics: 1480432 bytes read, 2 seeks
Conversion failed!

I do see the following:

[h264_vaapi @ 0x60f93a2f1540] Compatible profile VAProfileH264High10 (36) is not supported by driver.

So, based on

If that’s the problem you might fix it with -vf scale_vaapi=format=nv12

I tried it and it worked! But how do I learn what other profiles are available? I recall having to do something similar with the NVIDIA box using yuv420p.

Sadly, I’m still showing ~3.8-4x speed nvtop shows ~6% of the GPU being used, so this is less ideal than what I had tested earlier. Performance is definitely not great…

John_Flatness · May 23, 2024, 7:05am

Your vainfo output from before shows what profiles are supported.

The VAEntrypointVLD ones are for decoding, the VAEntrypointEncSlice ones are for encoding. Note that there’s no “10” profiles for H264 in that list, only for HEVC (though VP9 profile 2 and AV1 profile 0 also indicate 10-bit support). vainfo -a can give more details, including what pixel formats are allowed for each profile. (The nv12, or p010le for 10-bit, formats used in the filters are about the internal format, vs. the yuv420p and friends you’ll likely have as input/output).

As for encoding speed, you could try setting -async_depth higher to see if that does anything (the default is 2). But that may or may not accomplish much of anything.

Richard · May 23, 2024, 7:42am

If I look at the output of vainfo -a on the 7840HS, under the VAProfileH264High/VAEntrypointEncSlice, it does list under VAConfigAttribRTFormat VA_RT_FORMAT_YUV420, VA_RT_FORMAT_YUV420_10 and VA_RT_FORMAT_YUV420_10BPP, just like under VAProfileHEVCMain10/VAEntrypointEncSlice. Shouldn’t that mean taht the h264 High profile does support 10 bit content?

Richard · May 23, 2024, 7:53am

@mire3212 As you can tell from the ffmpeg wiki you linked yourself, it seems that at least the filter -vf 'scale_vaapi=format=p010' is a requirement to have 10 bit output. What you have done with nv12 is probably convert it to 8 bit video. Also, while I have never transcoded videos of that resolution, my guess is that the h264 codec just isn’t faster and nobody bothers to make it faster. Hardware codecs don’t guarantee you 10x, 20x or more, just that they are faster and more officient than the software ones. And since the maximum of h264 hardware encoding is at 4096x4096 I guess 4x is already very fast. Of course you could compare it with the hardware codec in your CPU, after all it’s very much just a waste of engergy to spin up the dGPU just to do something that could easily be done on the CPU. My guess is that there isn’t really a difference in encoding speed. And encoding with libx264 will most likely be way slower.

John_Flatness · May 24, 2024, 2:23am

Yeah, I see the same. Regardless, ffmpeg will look for the H264High10 profile, which is the official H.264 profile for doing 4:2:0 10-bit. Whether H264High reporting the 10-bit options is a mesa or AMD bug, or a quirk in the hardware maybe supporting 10 bits but not the actual Hi10P profile, or ffmpeg being too conservative in what it will attempt… I don’t know.

Whichever is the case, the practical answer is probably to just use H.265 or AV1 if you want to encode at 10-bit. 10-bit support’s just generally better for those anyway.

mire3212 · May 24, 2024, 2:57am

I’d love to, but sadly the platform (Plex) I’m trying to use it with it doesn’t like 4K over H.265 streaming to a Roku, but it does with H.264… go figure.

As I mentioned, I’m ok with 8-bit, but trying to use scale_vaapi=format=p010 to keep the 10-bit doesn’t work and throws the same error we saw before.

I’ll have to try AV1, that’s one I haven’t tried at all yet.

As for performance, I’ve done all of this with an NVIDIA 1050 Ti and an NVIDIA 4090; the 4090 can transcode this exact same content at a rate of 18x-20x, so this slowness is either (1) I’m doing it wrong with my command (which is why I’m posting here because I really don’t know what I’m doing) or (2) the AMD version of all of this just isn’t as good as NVIDIA. I obviously don’t mean to compare my AMD 7700 to an NVIDIA 4090, but that does help to convey that I know the possibilities.

Anyway, it seems that we’ve at least ironed out the command to (hopefully) get this to execute fully within the hardware (assuming that’s the fastest possible speed) so I suppose I’m stuck at about 4x at best…

John_Flatness · May 24, 2024, 3:11am

Basically what happens is that, once the GPU decodes your 10-bit input, it’s already in p010 internally. That filter above will try to convert to p010, in other words in this case it will do nothing. The error occurs because ffmpeg can’t/won’t encode to 10-bit H.264, and it won’t change the pixel format on its own either. That’s why the nv12 filter fixes things: it’s just converting to 8-bit, and you’re getting regular 8-bit output.

In a case where you’re streaming to a box, there’s a reasonable chance the box can’t do H.264 10-bit anyway. It has much less support.

The p010 version of the filter would be useful if you wanted to always output to 10-bit regardless of what the input was.

Richard · May 24, 2024, 10:45am

I’ve just done some testing with this 4kp60 10 bit video. All just on the CPU integrated codec and I don’t have any dGPUs for comparison. Also just running on battery. Bitrate was set to 15M. Converting to 8 bit h264 was with 1.37x, to 10 bit HEVC with Main10 profile (-profile:v main10, although most likely not needed) with 2x and to AV1 10 bit (no additional settings) 2.3x. So I’d guess AMDs h264 encoder is just that slow, but they don’t bother improving it as it would probably need more silicon that they would have to take from another feature.

Also, the Raspberry Pi 5 already took the step to removing h264 hardware support, and I would guess this will be more widespread in the future. The industry is moving away from h264 as it’s just too inefficient, and just as support for codecs like VC1 or h262 was removed to make space for more important things, h264 will be scraped too eventually. Also, all modern CPUs are more than capable of handling h264, so there isn’t that much of a speed/efficiency improments here.

mire3212 · May 28, 2024, 4:40pm

That makes sense. Do you think there’s anything else that might optimize this process? 4X feels pretty slow, but this is my first AMD hardware so I’m not certain what it’s capable of in other contexts or other deployment modes.