PL4 GPU benchmarking?

Jacques4242 · November 15, 2020, 6:07pm

Yeah, I know. My M1 MBA will arrive on Thursday. I’ll test PL4 on it and report here. I’m worried, though, that if it doesn’t work, I may not be able to get an i7 mini by then. They seem to be selling fast at near-new prices. As long as PL4 runs reasonably well, if somewhat slowly, on the MBA, I won’t cancel the mini order and will hope for a native binary from DxO within, say, 6 months. OTOH, if it doesn’t run at all, I’ll keep the MBA but cancel the M1 mini order (not due to arrive until mid-December) and go the i5/i7 mini + eGPU route. Even if we get a native binary, the eGPU (RX 580/590) may be faster than the M1 GPU, so I could hold onto that until an M2 mini arrives, at which point I’d sell the kit at a pretty substantial loss…

Man · November 15, 2020, 6:21pm

Pff i win with lots of margin : 3748 seconds thanks to my rocket HD4850

just for fun…

I forced GPU in the preferences file. GPU seems to be helping as it reduces processing time for DP, but CPU (Q9450) is turning 100% however.

Lucas · November 15, 2020, 8:23pm

We use Core ML, so it is theoretically easy to use the Neural Engine, if… there’re no other condition to benefit from it. For now we don’t know, testing hence time is needed, so like all other features it needs to be prioritized. If not already done, you may vote on the Apple Silicon support feature request to help push this forward.

traderjay · November 16, 2020, 6:56am

I have the ASUS Strix RTX 3090 OC with an AMD 3950X. Exporting a 45MP file from my R5 with deep prime took about 13 seconds.

Jacques4242 · November 16, 2020, 5:18pm

Does this imply that DeepPRIME performance may be affected more by the 16-core Neural Engine than the 8-core GPU of the M1? I ask because the GPU’s performance is somewhat behind a mobile Radeon Pro 5300M and a fair bit behind an RX 580 eGPU. Might we expect more than this from Core ML with the M1? Also, would you kindly post your impressions of PL4’s DeepPRIME performance with the M1 (obviously) under Rosetta 2 emulation, as compared to, say, an i7 mini with an RX 580 eGPU?

Lucas · November 16, 2020, 5:35pm

At this point we don’t know. Theoretically the M1 Mac are able to mix use of Neural Engine and GPU when most appropriate. It is “possible” that the Neural Engine could make it faster than an eGPU, as it is dedicated and embedded hardware, but it’s all hypothetical for now. Until we support the Neural Engine we won’t know.

Unfortunately the tests we did were on the prototype hardware shipped earlier to developers. It is not representative of what you’ll see with M1 Mac and we don’t have yet a M1 Mac to check.

Jacques4242 · November 16, 2020, 5:38pm

I have a first benchmark from my just-arrived entry-level M1 MacBook Air. Processing the Egypte image with DeepPRIME in PL4, running under emulation (obviously), it is almost exactly as fast as my 8-core 3.3GHz 2013 Mac Pro with FirePro D500 GPUs when using CPU-only and TWICE as fast as the Mac Pro when using the GPU. Times were 41 seconds with GPU, 244 seconds with CPU. This puts the M1 GPU roughly on par with an AMD RX 560. Can’t wait to see how much faster it gets with a native binary.
FWIW, this was a very quick test, so I have no idea whether there are any issues with PL4 running under emulation.
I’ll update the spreadsheet in a few minutes…

P.S.: I’ve added my M1 Mac mini scores as well. 5-image DeepPRIME export 2:06 with MBA and 1:46 with mini running PL 4.1. PRIME times of 1:57 and 1:32 respectively. No speed-up from PL 4.02 to 4.1. Egype image DP run is 0:41 MBA and 0:36 mini.

Savay · November 18, 2020, 10:23am

The Batch run would be interesting though!

GPUs don’t run x86 code anyways, so personally i don’t expect really huge gains on that side since PL4 is probably mostly GPU bound, and as with other mostly GPU bound workloads the negative impact of the Rosetta emulation is somewhat negligible.
Additionally the M1 GPU already seems to roughly perform as to be expected by it`s RAW GFLOPs throughput.
My guess is in the end it will mostly perform like a 1650.

However The CPU Only Mode might gain a fair bit of performance or when the Neural Engine can get utilized.

PowerSlam · November 18, 2020, 10:39am

M1 GPU 41Sec / CPU 244sec
4750G GPU 46Sec / CPU 119Sec

VRAM bandwidth is improved by on-die memory.

Savay · November 18, 2020, 11:10am

The M1 is using LPDDR4X-4266 (and not “on-Die” but “on-Package”) which is in fact also supported by Renoirs IMC but of course not available on Desktops (or any Renoir Laptop that i`m aware of…) so the Vega is officially limited to DDR4-3200.

The 8CU Vega in the Renoir is AFAIR a little bit narrower than the 8C M1 anyway though! So my guess is, the impact of the Bandwith is somewhat limited and what gets visible here are just the RAW GFLOPs the GPUs are capable of.

traderjay · November 20, 2020, 5:39pm

The webpage only let me download JPEG images. How can I download the RAW files?

I tried the EGYPT file, and my time is 11 seconds with GPU and 66 seconds CPU only. HUGE uplift with GPU!

SAFC01 · November 20, 2020, 6:29pm

If you scroll further down the page you should see a section of RAW files for the same images as the JPEGs nearer the top of the page.

jch2103 · November 20, 2020, 7:13pm

Yes, I missed that the first time myself.

noir.fonce · December 28, 2020, 3:41pm

I added my results to the spreadsheet: consistent with the others.
An intersting link to compare the graphic cards:

For every GPU, you directly get the score in Tflops (FP32 performance).

MouseAT · January 4, 2021, 7:17am

I’ve just upgraded my main rig, and have been running some additional benchmarks. I’ve not done the spreadsheet benchmarks yet, as I’ve not grabbed the RAWs that are used for that, but I’ve been doing test runs with 20-megapixel Sony RAW files, as those are what I normally work with.

My old rig was a Intel i7 6700K (4 core/8 thread 4GHz Skylake) with 16GB of 2133MHz RAM, and a GTX 1080Ti Founders Edition GPU. The old rig could process around 2 20-megapixel RAWs per second using traditional PRIME on the CPU, and was reaching around 8.5 images processed per second using DeepPRIME on the 1080Ti. The GPU was not being fully loaded with this set-up, and having compared this with the new machine, it’s clear that the process was CPU bound with the old configuration.

My new machine has a AMD Ryzen 5600X (6 core/12 thread, potentially hitting 4.4-4.6GHz across all cores with PBO enabled) with 32GB of 3600MHz RAM and the same 1080Ti Founders Edition GPU. With this machine, I’ve got OpenCL enabled in the options, and have configured PhotoLab to process four images at once, as that seems to be the optimal count with this particular CPU/GPU combo. Less than 3 images at once isn’t enough to keep the GPU fully loaded, 3-4 simultaneous images is within the margin for error, and beyond that the process appears to be GPU bound on this machine. The new system can process around 13 20-megapixel RAWs per minute using DeepPRIME, which is a substantial improvement over what I was getting before.

Hopefully that information is of use to some of you, if you’re trying to put together a balanced machine for running PhotoLab. Weaker CPUs can definitely bottleneck higher end GPUs in DeepPRIME, and a Ryzen 5600X definitely has a bit of untapped headroom that could allow some GPUs faster than the 1080Ti to shine. That said, at around 70% CPU utilization in conjunction with the 1080Ti, I suspect the 5600X wouldn’t be able to fully keep up with something like at 3090 in DeepPRIME.

I’ll run the spreadsheet benchmarks in a bit, and update the spreadsheet once I’ve got those done.

Savay · January 4, 2021, 3:36pm

Have you run the Spreadsheet Benchmark with the changed Queue Depth of 4?
Then the results aren`t really comparable to the others since the default is 2!

And yes you can improve the processtime of large batches with a bigger queue depth, but it really depends a lot on the hardware and it doesn`t seem to improve at all above 4!
BTW that does also apply to the normal OpenCL HQ NR.

Man · January 4, 2021, 7:39pm

Hi
If you use same gpu, shouldn’t results for deepprime (using gpu) be roughly the same ?

MouseAT · January 5, 2021, 6:23am

That’s a fair point. I’ve re-run the batch set of the 5 Nikon RAWs with the default setting of two simultaneous images being processed. The time was unchanged at 40 seconds with the five images.

There is a slight difference that becomes apparent with bigger jobs - when processing 300 20-megapixel images, going from 2 images at once to 4 consistently made a difference of around 1 minute (approximately 23 vs 24 minutes), which is beneficial to me personally, given that I shoot high speed bursts with a 1" type sensor, and then batch process 1000+ images at once in post, so I’ll take every benefit I can get!

When processing smaller batches, the difference between 2 and 4 is barely noticeable. I mention it mainly because I’m being limited by the GPU performance of the 1080Ti - as GPU speed increases, I’m making an educated guess that playing with the number of simultaneous images may help keep the GPU from being underutilized. It’s clear from testing that the 1080Ti couldn’t perform to its full potential on the old machine, as the old system simply couldn’t feed the GPU with data fast enough.

pierre5018 · January 5, 2021, 2:27pm

@MouseAT Raws per second ?

MouseAT · January 6, 2021, 12:34pm

Sorry, that’s me being an idiot. I proof read, and still that slipped through (twice). That should be RAWs per minute. Unfortunately, I can’t go back and edit the original post