Nvidia eGPU Randomly Crashes on Ubuntu 22.04 (FW 13 AMD)

Hey everyone, I’m having pretty severe issues with my eGPU setup and could use some help diagnosing and hopefully fixing the problem.

My setup:

  • Framework Laptop 13 w/ Ryzen 7640u (BIOS version 03.05)
  • 96GB DDR5-RAM
  • Ubuntu 22.04 (Wayland) and 6.5.0-1025-oem kernel
  • Nvidia RTX 3090 in Razer Core X eGPU enclosure

The issue:

My eGPU randomly crashes and becomes unusable. The behavior is highly inconsistent, making it difficult to pinpoint the cause. Here’s what I’ve observed:

  1. Sometimes it crashes during high load, especially when running Stable Diffusion (using the Fooocus web UI as a frontend). This can happen even after just 2 minutes of use.

  2. Other times, it runs fine for 20+ minutes at almost 100% load without issues.

  3. Occasionally, it crashes even when there’s barely any load.

This randomness indicates that the crash is not simply caused by high GPU usage.

Additional observations:

  1. To get the eGPU detected, I have to power off the laptop, plug in the eGPU, and then turn it on. This works about 8 out of 10 times.

  2. Once detected, it works fine initially. I use the 3090 solely for AI workloads, like local LLM’s, Stable Diffusion and others.

  3. I’ve installed NVIDIA drivers 535.183.01 and blacklisted the Nouveau open-source drivers.

  4. When the eGPU crash occurs, the whole system becomes stuttery with the laptop freezing regularly or sometimes outright restarting.

I’ve attached two log files:

  1. logs_before_crash.txt: Contains journalctl, dmseg and other logs collected when the eGPU was working properly.

  2. logs_after_crash.txt: Contains the same logs collected immediately after the eGPU crash.

I’ve tried updating drivers and checking connections, but the issue persists. The randomness of the crashes and the system-wide impact make the situation particularly frustrating.

Any ideas on what could be causing this or what else I should check? Thanks in advance for any help!