PL4 GPU benchmarking?

Hi
If you use same gpu, shouldn’t results for deepprime (using gpu) be roughly the same ?

That’s a fair point. I’ve re-run the batch set of the 5 Nikon RAWs with the default setting of two simultaneous images being processed. The time was unchanged at 40 seconds with the five images.

There is a slight difference that becomes apparent with bigger jobs - when processing 300 20-megapixel images, going from 2 images at once to 4 consistently made a difference of around 1 minute (approximately 23 vs 24 minutes), which is beneficial to me personally, given that I shoot high speed bursts with a 1" type sensor, and then batch process 1000+ images at once in post, so I’ll take every benefit I can get!

When processing smaller batches, the difference between 2 and 4 is barely noticeable. I mention it mainly because I’m being limited by the GPU performance of the 1080Ti - as GPU speed increases, I’m making an educated guess that playing with the number of simultaneous images may help keep the GPU from being underutilized. It’s clear from testing that the 1080Ti couldn’t perform to its full potential on the old machine, as the old system simply couldn’t feed the GPU with data fast enough.

@MouseAT Raws per second ?

Sorry, that’s me being an idiot. I proof read, and still that slipped through (twice). That should be RAWs per minute. Unfortunately, I can’t go back and edit the original post :frowning:

@Man Sorry, I completely missed this post in the thread - I think I must have skipped over it, as there was a delay between my post being made (in response to the post above yours) and it being approved by moderators and appearing below yours.

In response to your question, the GPU accelerated parts of the process should take the same time to process, per image, but not all of the work done during an image export is done on the GPU. The traditionally slower parts of the DeepPRIME noise reduction process are definitely offloaded to the GPU wherever possible, but for all I know, some parts of the process may not lend themselves well to GPU processing, so may still involve some CPU work.

Even if the vast majority of the DeepPRIME noise reduction process is now GPU accelerated, exporting an image, involves more than just DeepPRIME noise reduction. Colour and exposure adjustments, sharpening, compressing and exporting the JPEG etc. are all part of the export process as well, and even with OpenCL GPU assistance enabled for processing, the CPU still has some work to do. The DeepPRIME export process will be like a production line, with images in progress being passed back and forth between the CPU and the GPU in the background. If one is faster at doing its parts of the process than the other, it’s going to spend some time idle, either waiting for the other one to provide it with more data, or waiting for the other to be ready to process the next piece of data it needs to give it.

Hopefully that makes sense. I expected something along these lines to happen in my testing, as I suspected my GPU wasn’t being used to its full potential before. Now I’ve had chance to run some tests, confirm my hypothesis, and get some rough numbers. I hope that’s of use to someone.

1 Like

According to DxO staff based on a recent post here (16 Months Later: Still No Universal App for Apple Silicon - #11 by Lucas), DxO PhotoLab 5 is now using Tensor Cores in NVIDIA GeForce RTX GPUs.

It would be great to have some users here that have both DxO PhotoLab 5 and an NVIDIA RTX GPU to post new results in the Google spreadsheet (DxO DEEPPrime Processing Times - Google Sheets).

Yes, there is no change in DeepPRIME processing speed on my GPU without active Tensor Cores – GeForce GTX 1660 Super. Actually, I have added my results with DxO PhotoLab 4.3.3 and PhotoLab 5.0.1 into DxO DeepPRIME Processing Times - Google Sheets and the difference is only on level of measurement mistake.
So, without the card with GeForce RTX GPU there is no upgrade in DeepPRIME processing time on PhotoLab 5 :pensive:

1 Like

Hi,
I updated the spreadsheet with my results in DPL5 next to those in DPL4: great improvement!
• Egypt: 23 s to 14 s
• D850 (5 photos): 62 s to 34 s

For best results, be sure to update the graphics card driver.

Congratulations to the developers!

I updated the spreadsheet with my results in PL5; along with quite a variety of different settings added as notes. I imagine these settings would make more difference on larger exports.
I’m currently on 16GB of RAM but expect to upgrade to 32GB in the next few days; at which stage I plan to re-run all tests to allow a direct comparison of different RAM sizes.

Couple of issues - I couldn’t find where you guys got GPU GFLOPs figures from. I can’t see the source for R5 and 1000D images. If anyone can point me in the right direction I’ll populate them too.

I couldn’t find where you guys got GPU GFLOPs figures from.

@rymac
Here you can find the specs of each GPU.

1 Like

I have added my results with MacBook Pro Silicon 16".
I have found and processed Egypt, R6, 90D, D850 but not R5 or 1000D.

I’ve tidied the spreadsheet up a bit. Someone had left filters on which was hiding most results. I’ve now removed these and added missing formulas / cell shading to some users PL 5.X results. When time permits I’ll set the spreadsheet up to do this automatically for new entries.

There’s no link for the R5/1000D test images. @Savay as the only person to run these, are you able to provide this, either here or in row7/8 of the spreadsheet?

I’ve now automated the cell shading on the spreadsheet for V5.X of Photolab.
Are there any other automations/features that people would like to see?

Is it possible for an admin to split this topic? Would be good to have a thread specifically for the spreadsheet with the initial post being a link to the spreadsheet, rather than a PL4 specific topic with the sheet buried in it.

Settings for image correction make a big difference. I’ve just run the D850 test with PhotoLab 5.1.1 on macOS 12.1 on a MBP 14 M1 Pro 10/16 16GB memory.

With just DeepPrime turned on at 40, the five images processed in 32s. With my default set of corrections for D850 files (Leica M9 colour, lens sharpening, auto-horizon, auto-crop), the process time jumped to 36s. With another more intensive set of corrections but no local corrections, processing jumped to 45s.

These times are quite extraordinary as my MBP is now overloaded (didn’t close many apps before starting up PhotoLab 5 as I didn’t feel like restarting all my other work) with memory pressure of 63 and 9 GB of swap and I’m tabbing back into my browser to keep working during the testing (which is the other application using the most memory).

Great export times don’t prevent PhotoLab 5 from misbehaving on an M1 Pro Mac. Memory for PhotoLab 5 is at 10.5 GB (would be about 4 GB on an Intel Mac). Spotify playback got choppy while exporting and typing slowed down (nothing like as bad as the base M1 Mac Mini with 8/8 configuration though).

The main reason I did these tests right now was that there was a result in for the M1 Pro from forum name 4 which I couldn’t believe, 31s while there is a conflicting result of 77s for the same M1 Pro 10/16 configuration.

This is half the time of the Radeon 5500XT in the charts and faster or competitive with some of the most powerful GPU’s in existence. My Radeon VII with AMD hardware acceleration enabled (Apple disables it by default, one has to add OpenCore to get it) did outperform the M1 Pro (5 or 6 sec per image with full set of adjustments including DeepPrime) but I don’t have it hooked up right now.

In any case, the M1 Pro does get these good numbers, which I would have expected from the M1 Max but not the M1 Pro. I wonder if the M1 numbers on PhotoLab 5 (almost as good at 35s) hold up to scrutiny. I don’t have a plain M1 here any more to test against. When I tested on my own files, there was a huge difference with PhotoLab 5 on an M1 Mac Mini vs PhotoLab 4.


Just dug up the previous test I ran on 61 D850/D810 images in a real set on an M1 Mac Mini with 8GB memory.

  • PhotoLab 4 - 32m
  • PhotoLab 5 - 10m38s

That is about 3x faster so the M1 35s result looks reasonable. This suggests that for PhotoLab export it doesn’t make much difference which M1 Mac one chooses (M1, M1 Pro, M1 Max). Again I don’t find the editing experience under Rosetta particularly good in comparison to my Intel Macs, with small delays and micro-stutter as well those memory leak issues so I don’t recommend moving to M1 architecture for PhotoLab at all, pushing those of us without Catalina/Big Sur/Monterey Macs to upgrade to either a poor user experience or to Intel Macs which will soon be made obsolete.

All the more reason for DxO not to have cut off Mojave before its time. Capture One 22 does run on Mojave.

Hello,
I have updated my results on the “DxO DeepPRIME Processing Times” spreadsheet:
• PhotoLab 5 (v.5.1.3) with Windows 11 and the latest GPU driver
• PhotoLab 4 (v.4.3.6) with Windows 11 and the latest GPU driver

Summary
For the same hardware platform:
• Great interest in upgrading from PhotoLab 4 to PhotoLab 5: times are almost divided by 2.
• Significant interest in upgrading to the latest GPU drivers, latest minor version of DPL, and WIndows 11: at least 15% increase in performance.

Hi everyone! Sorry for spamming and english level in adavnce.
I have a GTX 680 2Gb GPU with quite old phenom cpu. DxO Photolab (pure raw also) partially support gpu and process photos at about 0.08 MP/sec using it. CPU is even slower. Is it worth upgrading to AMD APU (not cpu) for to use Vega for processing with my 680 installed at one time? Vega performs at last two times worse (lower Flops count than GTX 680). Table shown in this thread tells that vega is 2X faster than RX 560 I used before (Gtx 680 is much slower in DxO than rx 560).
Is there any ways to tweak system to increase perfomance or there is no ways over upgrade? I only have money to buy modern cpu with at last sse 4.1 instructions and at last 1050-level gpu. Is 2400G/another APU good purchase for me?

I basically only ran those to check the scaling vs. PL4 on different RAW Types and Sizes and to check if there are some (CPU-)bottlenecks with smaller batches and individiual RAWs even on a 5950X. (which seems to be the case. Well…at least with the single GFX50 RAW…and then there are some slight hints with the 1000D and R6 Batches)

Unfortunately I don’t have any copyright at all for 20x R5 RAWs and downloaded examples from DPR.
Maybe someone else can provide 20 R5 Raws?!
However i’m under the impression that DeepPrime seems to be completely content angnostic and the image content itself doesn’t matter at all, since it is basically crunching all the Pixels of each file no matter what.
So theoretically you can also use 20x an identical RAW or 20x completely random ones, the results will be the same. So you can in theory use any R5 RAW you are stumbeling upon anyway.

But I think for now the 20x 90D RAWs (for which i indeed have the copyright) with it’s 32-33MPix are sufficiently close to capping out even the fastest GPUs without getting limited by the memory subsystem or the CPU too much and without the time for one complete run getting too out of hand on slower systems.

The 1000D RAWs on the other hand seems to be a tad too small to be of any further interest, since those are only 10MPix per Image.
Even the R6 batch with the 20MPix files seems to show some slight effects of some kind of (CPU? I/O?) Bottleneck for the fastest systems…at least with PL5.

The results you mentioned are not conflicting!
One run is performed on the GPU…the other one on the NPU (aka AI Accelerator aka the 16 Core Apple Neural Engine or ANE! with it’s ~11TOPS which equals ~11TFLOPS for fp16) Hence also the difference in the coloured cells in the “GFLOP” Rows.

All M1 (and also the A14) have the same 16 Core ANE. The Max and the Pro even the same CPU.
Only for the memory controller there is a slight difference…so it’s no wonder those are basically performing the same while being a tad faster than a regular M1.

In fact the presence of the same NPU in each A14/M1 derivative suggests even a iPhone 12 or an iPad Mini could theoretically perform almost in the same ballpark. (lower Powerlimit and fewer CPU cores and weaker memory subsystem aside…)

In fact this also suggests that DeepPrime itself could also in theory run reasonably well on some ARM SoCs used for Android Devices or Chromebooks or Windows on ARM, since those have pretty strong dedicated AI/ML Accelerators too. (Especially the Snapdragons and Exynos)
Some of the newer ones are almost twice as fast as Apples. (24-26 TOPS)

I added my results for my new pc which has an i5 12600k (just using built in UHD770 graphics) with Windows 10 and 11.

Windows 10 was not using the the performace cores, only the ecores used and as a result took a painstakingly 337s to process ‘Eygpte CPU only’ test. Whereas Windows 11 used all the cores and did it in 72s.

So on the new 12th gen intel chips it is worth while upgrading to Windows 11 if you are using only the cpu to process images.

1 Like

Is this still the latest thread for this benchmark? I just added my results with DeepPRIME for the Egypte 111MB image using Photolab 6 trialware under Windows 10.

GPU (Radeon RX 470 with 4 GB) - 41 seconds
CPU (Intel i7-4770K with 16 GB) - 208 seconds