Matrix Multiplication Performance revisited

As part of general experimenting (and not merely an excuse to avoid doing the work that I’m meant to be doing right now, honest!), I got to thinking how much faster modern hardware is compared to previous generations.

Given that I don’t really play that many computer games these days (and probably couldn’t tell the difference in the graphical fidelity anyway), I thought what better way to test this than matrix multiplication. Especially given that I had some suitable software in the form of the previous post (link).

Running the same, unmodified code on my new laptop (Intel i9-13900H) and 64GB of 4800MT/s memory, I was hoping that I would see improvements…

	Time(s)	GFLOP/s	Time(s)	GFLOP/s	Perf(x)
Algo 1	21.535	0.13	5.669	0.491	3.80
Algo 1 (SIMD)	1.009	2.76	0.721	3.858	1.40
Algo 2	2.745	1.01	0.988	2.816	2.78
Algo 2 (SIMD)	0.396	7.01	0.282	9.850	1.40

Timings old laptop vs. new laptop

TL/DR

Modern hardware = faster performance.

Thoughts

With the original experiment, my thought process was that algorithm 1 was, most likely, memory access bound. Looking at the rough memory speed up (4,800 / 2,133 => 2.25x speed up) then this would explain part of the performance with CPU architectural changes being responsible for the rest (This is just a guess, but it sounds plausible enough).

The second thought that sprung to mind was, and this could be interpreted in multiple ways, as computers get faster and faster, then they tend to be more forgiving of inefficiencies. Now, this does not mean that I’m about to recommend writing matrix multiplication code in a runtime interpreted language, but that with merely a few years’ gap, code can go from running unacceptably slow to acceptably fast with a bit of patience and opening your wallet.

The other way to interpret this would be that coding quality / algorithm design is still massively important. Even though simply by using modern hardware, one can achieve significant improvements in performance, further, very significant improvements can be achieved by applying old fashioned techniques such as using decent algorithms trying to understand what the computer’s actually doing internally.

Summary

My takeaways from this are:

Using appropriate algorithms is still a massive help (and probably will always be to get the best out of whatever hardware we have)
Users still shouldn’t write their own matrix multiplication libraries, there are people who dedicate their lives to such things and it’s unlikely that we’ll do better than they will.
New hardware helps – there’s a reason why companies will update their data centre hardware on a regular basis. Modern hardware is both faster and more efficient (in terms of power etc.) All of which will lower the cost to the end user of running their analysis jobs.
At the high (-er abstraction, e.g. the orchestration code etc.) end of software, that code will continue run faster and faster, even if inefficient. Combine that ability for non-computer scientists to write their analysis code in high level languages with the ability to call efficiently written low level packages, computers are going to be able to do more things for more people in the future.

TL/DR

Thoughts

Summary

Leave a comment Cancel reply