The algorithms of artificial intelligences are undoubtedly on the crest of the wave, given that in recent months there has been nothing but talk about chatbots and text creators such as Chat GPT or image generators like Dall-E 2. Generally, these tools rely on gigantic data centers equipped with the latest generation equipment to carry out all the calculations necessary for the required operations, but even the modern graphics cards that equip your PCs contain hardware units dedicated to this purpose. But how fast are they actually at doing AI inference?

To find out, we used Stable Diffusiona popular AI image maker, putting it to the test with the latest graphics cards from Nvidia, AMD And intel. We used three different versions of Stable Diffusion for our tests, as none of the packages worked flawlessly on all GPUs. For NVIDIA, we chose the webui version of Automatic 1111while for AMD GPUs we used the Shark version of No come on and finally, for the Arc ones from Intel we opted for Stable Diffusion OpenVINO.

Test results on NVIDIA 30 series have been satisfying, particularly when enabling xformersthat provides an additional 20% performance boost. However, those on the RTX 40 series fell a bit short of expectations, likely due to the lack of optimizations for the new Ada Lovelace architecture. The results on AMD are also mixed, with the RDNA 3 GPUs performing quite well, while the RDNA 2 ones struggled a bit. Finally, Intel GPUs had similar final performance to AMD, despite significantly longer rendering times, likely due to background processes. Stable Diffusion version 1.4 was used instead of the newer 2.0 or 2.1 editions, as running SD 2.1 on non-NVIDIA hardware would have required more work.

The parameters for the tests were the same for all GPUs:

Positive Prompts:
postapocalyptic steampunk city, exploration, cinematic, realistic, hyper detailed, photorealistic maximum detail, volumetric light, (((focus))), wide-angle, (((brightly lit))), (((vegetation))), lightning , vines, destruction, devastation, wartorn, ruins

Negative Prompts:
(((blurry))), ((foggy)), (((dark))), ((monochrome)), sun, (((depth of field)))

Steps:
100

Classifier Free Guidance:
15.0

Sampling Algorithm:
Some Euler variant (Ancestral, Discrete)

The sampling algorithm doesn’t seem to impact performance significantly, although it may affect the output. Automatic 1111 offers the most options, while the Intel OpenVINO build offers no choice at all. In the table below you can see the results obtained: it is possible to notice that each NVIDIA GPU presents two results, one that uses the default calculation model (in black) and a second one that uses the more efficient “xformers” library of Facebook (in green) .

Photo Credit: Tom’s Hardware

As expected, GPUs from NVIDIA perform better than those from AMD or Intel, although there are some glitches. The fastest GPU in our initial tests was the RTX 3090 Tiwhich he achieved almost 20 iterations per second, or about five seconds per image, using the above parameters. Next, we find the RTX 3080 Ti which tied with AMD’s new RX 7900 XTX, while the RTX 3050 Ti surpassed the RX 6950 XT.

Now let’s move on to the results that leave us most dumbfounded. First, we expected the RTX 4090 to crush the competition, but that was clearly not the case. In fact, it was slower than the 7900 XT and also the RTX 3080. Similarly, the RTX 4080 landed between the 3070 and 3060 Ti, while the RTX 4070 Ti landed between the 3060 and 3060 Ti. Most likely, applying proper optimizations could easily double the performance of the RTX 40 series.

Intel’s Arc GPUs currently offer very disappointing results, especially since they support XMX (matrix) operations which should guarantee up to four times the throughput of normal FP32 calculations. We suspect that the current Stable Diffusion OpenVINO project also has a lot of room for improvement. This is already understood by the fact that if to run Stable Diffusion on an Arc GPU it is necessary to edit the ‘stable_diffusion_engine.py’ file and change “CPU” to “GPU”, otherwise it will not use the graphics card for calculations.

In summary, we can say that the NVIDIA RTX 30 series cards are fine, as are the AMD RX 7000 series cards, while the RTX 40 offer lower performance, followed by the RX 6000 and, finally, the Arc GPUs. Things could change dramatically with updated software and, given the popularity of artificial intelligence, we expect it to be only a matter of time before we see dramatic improvements. In the graph below we can see the theoretical FP16 performance (in TFLOPS) of the various GPUs, using Tensor/Matrix cores where possible.

Photo Credit: Tom’s Hardware
Stable Diffusion GPU test

The Tensor Cores in NVIDIA graphics cards are clearly very powerful, although our Stable Diffusion tests showed below-expected performance on the RTX 40 series. For example, the RTX 4090 is 35% slower than the RTX 3090 Ti, probably due to the software used (Automatic 1111) which does not make the most of the new FP8 instructions of the Ada Lovelace GPUs, which could potentially double the performance.

The Arc GPUs on the other hand don’t even reach the expected performance. Their Matrix cores should deliver similar performance to the RTX 3060 Ti and RX 7900 XTX, with the A380 close to the RX 6800, but, in practice, come nowhere near these values. In fact, the A770s sit between the RX 6600 and the RX 6600 XT, the A750 is just behind the RX 6600, and the A380 offers about a quarter of the speed of the A750. Probably, your graphics cards use shaders in full precision FP32 mode, decreasing their performance due to lack of optimization.

Furthermore, we can see that the theoretical calculation on the RX 7900 XTX/XT has improved a lot compared to the RX 6000 series and memory bandwidth is not critical: The 10GB and 12GB 3080 models are relatively close. The RX 7900 XTX performs almost identically to the XTX, while theoretically it should be about 19% faster, compared to the 3% we found.

Ultimately, these tests gave us a snapshot of Stable Diffusion performance on AMD, Intel, and NVIDIA GPUs. With various optimizations, the final graph should come closer to that of theoretical TFLOPs and surely the new RTX 40 series shouldn’t be any less than the previous RTX 30 series.

Photo Credit: Tom’s Hardware
Stable Diffusion GPU test

Finally, we did some high-res testing, although not all previous GPUs were tested and Linux was used on the AMD RX 6000 series. Apparently, the target resolution of 2048×1152 was able to make the most of at least the RTX 4090. It will be interesting to see how the situation develops next year.

California18

Welcome to California18, your number one source for Breaking News from the World. We’re dedicated to giving you the very best of News.

Leave a Reply