So, given the current performance problem Josh is having, I thought I'd share some info on how multithreading is generally done in large (astro)physical simulations.
After all, conceptually they're not super different from LT. And all of them are multithreaded, as running larger ones on a single core simply isn't viable.
Disclaimer: I'm not an expert on multiprocessing, and have little experience of the implementation, but a good grasp of the theory.
This may not be applicable to LT, but I'll let Josh be the judge of that.
Example case
To illustrate the methods, I'll use the following example case.
Say we want to simulate a solar system. We have planets, the sun, moons, comets etc. We dump these into objects (which we'll call particles) with the following properties:
-mass
-position
-velocity
Every tick, we need to calculate the acceleration on each particle. Now, the acceleration (due to gravity) depends on the ass and location of every other particle,
so this is a nice way to demonstrate how to deal with thread communication too.
(spoiler alert: you make it so you don't have too. iirc Josh spoke about each system being a thread and having to communicate data - that's not a good way of doing things.
We're on shared memory after all.)
In a classical singlethreaded approach, the pseudocode for this would be this:
Code: Select all
particles = list(all particles)
while(keepgoing):
tick = something;
for particle in particles:
calculateAcceleration(particle, particles, tick)
for particle in particles:
particle.update()
The classic approach
This one is currently the most used one, not because it's better, but because it's easy to integrate into a program that was single threaded before.
Basically, first we make sure the state the sim was in at the start of each iteration is written to memory that is NOT changed during the calculations.
The calculating functions write their findings to other variables. The state then gets updated at the very end, in update().
e.g., the particle will have fields
x0, y0, z0, dx, dy, dz, etc. Calculations written to dx, state kept in x0.
This way you have no communication problems, as all the info each function needs is written in shared memory and nothing ever needs to be copied.
The actual multithreading happens as follows: loops where it is possible (in this case, both of the for loops inside the while) will get split between cores.
e.g. if you have 100 particles and 4 cores, each gets 25 of the loop iterations.
In the correct framework (my thesis project ShadowFax used openMPI) you don't even have to worry about implementation once the main framework is there -
I could just add new functions in loops and they would get multithreaded automatically! Very user friendly.
You don't quite get a 1:1 gain when adding cores, because of parts staying singlethreaded and waiting for synchronisation, but you should be able to get a nice performance boost.
Task-based parallellism
An elegant approach, for a more civilised age.
I first learned about it when receiving a talk about SWIFT. More efficient per core, but used less because you have to adapt your codebase to fit on it.
The basic idea is this: instead of dividing the work, we define tasks. Then we let each thread pick a task, execute it, and mark it as "done", pick a new task, etc.
The advantage of this is that each thread keeps going as long as there is work, thus avoiding threads idling because they're done while others threads are still doing stuff.
As for communication problems, same approach as above.
Now, of course, some tasks need to be done before others can be done.
In my example, calculateAcceleration() must happen before particle.update()
This is handled by giving each task dependencies on other tasks. Then we make the threads only pick up tasks for which all dependencies are met, and the work is
automatically done in the right order.
This avoids a large synchronisation losses and gives great efficiency on large numbers of threads. (SWIFT speak of 60% efficiency on 512 cores, which is really good)
Not sure if it's worth it for small-scale things like LT (at most 8 threads or so), but the approach is very elegant either way.
Paper: https://arxiv.org/pdf/1606.02738v1.pdf (open access, no paywall - interesting point for programmers in part three and further mostly)
So, hope it helps Josh, or is otherwise interesting.
Dino out o7