16 Months Later: Still No Universal App for Apple Silicon

florisvaneck · October 21, 2021, 7:11pm

What work is a lack of M1 support preventing you from doing? It is using the new silicon in the most important areas. You should compare noise reduction in C1 and Lightroom with PL5. You call that state of the art?

The reason that there is no native support yet is that it’s likely more complicated than everyone here make it to be.

Be patient… like Fuji X-Trans users.

0RES · October 22, 2021, 7:33am

Apple M1, M1 Pro, and M1 Max chips include a 16-core Neural Engine. The Neural Engine is a neural processing unit (NPU) / AI accelerator. DxO PhotoLab includes a feature called DxO DeepPRIME, which DxO describes as using “deep learning artificial intelligence technology.” While that wording makes for great marketing, the feature is simply based on a convolutional neural network, which is precisely what Apple Silicon’s Neural Engine is designed for. DxO is currently handing this off to Apple GPUs as Metal calls, but using the correct methodology and APIs to run this on the Neural Engine would provide speed and efficiency improvements that are orders of magnitude higher than using the GPU.

Amateurs may not mind waiting for images to process, but DxO PhotoLab is targeted towards professionals. Professionals understand that time costs money. If a task can be performed faster by technical means, then it should be made to perform faster. It’s quite a simple concept.

DxO claims to have state-of-the-art features, yet why are they not taking advantage of state-of-the-art hardware by doing things the correct way?

I appreciate you pointing out the obvious that everyone here is already aware of, and you’re right; it is complicated. That’s precisely why users like myself have been purchasing DxO PhotoLab upgrades every year, costing $100–$200 each time. In order to fund DxO’s software development and adapt to new technologies. The problem now is that DxO is not adapting.

You claim to be patient. How much longer than 16 months do you think is appropriate?

That’s not being patient. That’s being complacent.

Savay · October 22, 2021, 9:09am

You are aware that utilizing those matrix units is no easy feat and nothing that comes by itself and they are using CoreML…and not Metal?! (What the driver does in the Background is a different thing)

The code and also the data has to be prepared and tailored to very specific implementations of such Matrix accelerators…and each one is totally different in how they need to be handled.
Be it Apples NPU, Googles TPU or nVs TensorCores etc.

Just take a read how complicated it actually is to use the nVidia Tensor Cores via DirectML…and those have a much greater installed userbase.
And just because there is inferencing going on it doesn’t mean that those matrix units are necessarily really the most sufficient way for that, depending on the data that needs to be processd.
If FP32 datasets are used, which DeepPrime seems to be for the most part hence therefore probably also the dependence on the fp32 throughput, most matrix units are out of consideration anyway.
(The only Matrix Units that i’m aware of that can process FP32 Datasets are AMDs Matrix Cores from their most recent CDNA Professional GPGPUs.)

I still think they must have done some optimizations for at least some steps of the processing in this regard, in any other way the up to 4 times improvements on the Apple Silicon can’t really be explained…

The M1 GPUs are good for GPGPU but not THAT good, especially since they already performed quite well in PL4 according to their theoretical raw processing power…there must be going on something else to get such a large amount of improvement!

Oh and by the way…blame Apple!

Unfortunately Apple isn’t giving third-party developers any guidance on how to optimize their models to take advantage of the ANE. It’s mostly a process of trial-and-error to figure out what works and what doesn’t.

StevenL · October 22, 2021, 10:19am

Hi,
Please allow me to correct you: with PL5, DeepPRIME uses the ANE (Apple Neural Engine), providing blazing fast rendering times compared to PL4…

I’m curious:

Do you own a M1 mac?
Did you try PL5 yet?

Please, let me know.
Thanks.

uncoy · October 22, 2021, 11:12am

From what I can see, PhotoLab 4 or 5 don’t have serious performance issues when run on hardware with dedicated graphic cards (don’t use DeepPrime with Intel graphics, export jumps from 15 seconds per image to 15 minutes).

The only routines for which there is urgency to rewrite the code to support M1 processors directly is where there are substantial performance gains to be made. That would include calculating previews with PrimeNR or DeepPrime enabled and export of images with Prime or DeepPrime enabled. As DeepPrime outbats PrimeNR all the time, even limiting optimisation to DeepPrime routines should be enough.

Lucas · October 22, 2021, 1:33pm

We actually do use NVIDIA Tensor Cores (found in RTX GPUs) since PL5, that’s how it became faster on Windows. This is done by using fp16 which also speed-ups processing on latest AMD GPUs.

We adapted our algorithm to be executable on Apple Neural Engine rather than only M1’s GPU.

uncoy · October 22, 2021, 11:45pm

Good work, covering all Mac GPU by targeting optimisation to Neural Engine instead of the cards themselves.

I would be absolutely fine with living with PhotoLab 4 DeepPrime speeds on Mojave if it’s impossible back engineer the same speed improvements which come with Catalina or Big Sur. Running faster on more recent OS is fine. Not running at all on OS -2 is not okay.

0RES · October 23, 2021, 7:24am

Hello Steven, and thank you for correcting me. I’m tremendously glad to see DxO PhotoLab 5 is now using Apple Silicon’s Neural Engine. For M1 specifically, the speed improvement over GPU processing appears to be approximately 2x.

To answer your questions, I own several Macs with Apple Silicon chips and have been using the trial of PhotoLab 5 alongside my purchased copy of PhotoLab 4.

Do you have an estimate for when DxO PhotoLab will be a Universal App?

m-photo · October 23, 2021, 8:52am

Hi Lucas,
Does it mean that when new CPUs comes out with more ANE processing power comes out (probably next year with Mac Pro), PhotoLab will automatically benefit from this extra processing power ?
Is this the way you get the most out of Apple chips ?
Thanks for your explanations.

Lucas · October 23, 2021, 9:56am

Yes, if Apple comes up with more powerful Neural Engine cores or simply more cores in the ANE we expect that you’ll see a matching speed increase for DeepPRIME.

Latest M1 Pro and M1 Max look to have the same spec ANE as in M1 so we don’t expect a difference there for DeepPRIME. Other corrections will still benefit from more CPU cores though.

Additionally, till now we saw that the ANE was 5 times faster than M1’s GPU for ML tasks, presumably at a much better power efficiency. Now with M1 Max that has up to 4 times the number of GPU cores of M1, the difference in speed is likely much smaller. So for users who don’t mind heating their laptop, we might want to use both the ANE and GPU at the same time and get something close to twice faster in the best case.

m-photo · October 23, 2021, 10:26am

Ah thank you for the details.
It helps understand a bit better how PhotoLab works and how Apple is increasing power from one generation to the next.

Combining ANE and GPU to reduce export time could be great for some maybe.
For a more modest user using ANE to export and GPU for real time editing (or the contrary or whatever combination works better)… could be interesting.
I guess we will see some more options in preferences panel of PL in next few years or months

Now time to start saving for an  M2

CHPhoto · October 24, 2021, 9:21pm

Thank you @StevenL and @Lucas for the info on how DxO is optimising PL for M1 Macs.

I, and I’d imagine others, have read about how Adobe LR and Capture One Pro have been optimised for the M1 chips and their optimisations appear to have encompassed areas wider than just noise reduction.

For example, file import, general rendering, cropping, and exporting being areas mentioned in reviews of the LR and C1P updates following M1 optimisations.

PL is used for many editing processes other than DeepPrime. I use DeepPrine, so the performance enhancement is welcome. But I use it a huge amount more for viewing, cropping, correcting horizons, applying of control points, zooming in to 100% to check focus, and other general editing functions to thousands and thousands of images every year. It is optimisations to these every day functions that are applied to every image that I’d really appreciate PL to be optimised (in any possible way, including M1 chip optimisations) for too.

If the time it takes PL to render a preview could be shortened through optimisations gained from the M1 chip architecture then I would be absolutely over joyed. I cannot over emphasise this enough…

0RES · October 25, 2021, 8:17am

Do you have an estimate for when DxO PhotoLab will be a Universal App?

Lucas · October 25, 2021, 1:22pm

Hello @CHPhoto,

It is a bit hard in these articles to know what part of the speed comes from the native version, from the software update or from the different hardware.

As far as I can see there is no comparison of the same software version with and without Rosetta on the same hardware, so to me it rather looks like a mix of several factors that, when combined, look nice for marketing purpose .

There are opportunities for optimizations specific to Apple Silicon hardware thanks to the Neural Engine or powerful GPU with a large pool of memory. M1 Macs also have faster CPU, faster memory and faster storage. But all of these shouldn’t make a difference between a native or translated application. For now in our tests, native code is about 20-30% faster than translated version. We shall see how this translates at the whole application level.

At the moment I’m not able to tell more about which Apple Silicon specific optimizations we will be able to take advantage of in PhotoLab.

CHPhoto · October 25, 2021, 1:41pm

Hi Lucas,

Thank you for another detailed response.

I agree, the articles I linked to do not offer clear performance comparisons of the same hardware with and without M1 optimisated code, that was an excellent point to notice by yourself.

I love PL (I’ve not loved an image editing app this much since the days of using Aperture!) and shall be continuing to use your software for the foreseeable future.

I look forward to experiencing whatever optimisations you are able to extract from current hardware.

Charles

Jacques4242 · October 28, 2021, 3:00pm

I, for one, am ECSTATIC with PL5’s 3x increase in DeepPRIME processing speed. Everything else about PL5 feels perkier, too.

Waiting patiently for some feedback about the “low memory” bug I reported several days ago… It’s still not fixed with update 5.0.1. My M1 Mac mini with 16GB RAM ran out of memory after exporting 520 JPEGs, about the same as before. Activity Monitor showed about 8GB of “compressed” memory, and the app had gobbled up 5.6GB of RAM.

duncang · November 4, 2021, 5:40am

Why is DXO Photolab 5 still running under Rosetta ?

m-photo · November 4, 2021, 9:10am

Probably because everyone is asking the same and the developers keep being distracted so it slows down the actual work of coding for M1

uncoy · November 4, 2021, 2:12pm

What matters most for performance is export. PhotoLab 5 is optimised on Apple Silicon for export. I have a set of 61 Nikon D810 and D850 files with DeepPrime. Export times on an M1 Mac Mini:

PhotoLab 4 32m
PhotoLab 5 10m38s

Three times faster is pretty optimised in my opinion. Those are real world results on a real world photo set shot in low light and fully developed. It’s not just throwing DeepPrime on some random images and calling it a test.

The sliders and controls seem to work adequately fast on the M1 Mac Mini. Mac Pro 12 core with Radeon VII does seems a tad more responsive (immediate).

My test mule with 8GB of RAM ran out after about 20 images, albeit D810 and D850 (36MP, 45MP).

Savay · November 4, 2021, 2:16pm

DeepPrime is seemingly totally content and even ISO agnostic!
The only thing that matters is how many Megapixels per file and in total needs to be developed and which kind of RAW is used.
So it doesn’t really matter if you throw “random images” at it or not.

However lossless compressed RAWs, like Canon *.CR2 and *.CR3, seems to be a little bit faster than uncompressed RAWs in general.