Return to “Suggestions”

Post

The Multi-thread

#1
pun totally intended :ghost:

So, given the current performance problem Josh is having, I thought I'd share some info on how multithreading is generally done in large (astro)physical simulations.
After all, conceptually they're not super different from LT. And all of them are multithreaded, as running larger ones on a single core simply isn't viable.
Disclaimer: I'm not an expert on multiprocessing, and have little experience of the implementation, but a good grasp of the theory.
This may not be applicable to LT, but I'll let Josh be the judge of that.

Example case
To illustrate the methods, I'll use the following example case.
Say we want to simulate a solar system. We have planets, the sun, moons, comets etc. We dump these into objects (which we'll call particles) with the following properties:
-mass
-position
-velocity
Every tick, we need to calculate the acceleration on each particle. Now, the acceleration (due to gravity) depends on the ass and location of every other particle,
so this is a nice way to demonstrate how to deal with thread communication too.
(spoiler alert: you make it so you don't have too. iirc Josh spoke about each system being a thread and having to communicate data - that's not a good way of doing things.
We're on shared memory after all.)
In a classical singlethreaded approach, the pseudocode for this would be this:

Code: Select all

particles = list(all particles)
while(keepgoing):
    tick = something;
    for particle in particles:
        calculateAcceleration(particle, particles, tick)
    for particle in particles:
        particle.update()

The classic approach
This one is currently the most used one, not because it's better, but because it's easy to integrate into a program that was single threaded before.
Basically, first we make sure the state the sim was in at the start of each iteration is written to memory that is NOT changed during the calculations.
The calculating functions write their findings to other variables. The state then gets updated at the very end, in update().
e.g., the particle will have fields
x0, y0, z0, dx, dy, dz, etc. Calculations written to dx, state kept in x0.
This way you have no communication problems, as all the info each function needs is written in shared memory and nothing ever needs to be copied.
The actual multithreading happens as follows: loops where it is possible (in this case, both of the for loops inside the while) will get split between cores.
e.g. if you have 100 particles and 4 cores, each gets 25 of the loop iterations.
In the correct framework (my thesis project ShadowFax used openMPI) you don't even have to worry about implementation once the main framework is there -
I could just add new functions in loops and they would get multithreaded automatically! Very user friendly.
You don't quite get a 1:1 gain when adding cores, because of parts staying singlethreaded and waiting for synchronisation, but you should be able to get a nice performance boost.

Task-based parallellism
An elegant approach, for a more civilised age.
I first learned about it when receiving a talk about SWIFT. More efficient per core, but used less because you have to adapt your codebase to fit on it.
The basic idea is this: instead of dividing the work, we define tasks. Then we let each thread pick a task, execute it, and mark it as "done", pick a new task, etc.
The advantage of this is that each thread keeps going as long as there is work, thus avoiding threads idling because they're done while others threads are still doing stuff.
As for communication problems, same approach as above.
Now, of course, some tasks need to be done before others can be done.
In my example, calculateAcceleration() must happen before particle.update()
This is handled by giving each task dependencies on other tasks. Then we make the threads only pick up tasks for which all dependencies are met, and the work is
automatically done in the right order.
This avoids a large synchronisation losses and gives great efficiency on large numbers of threads. (SWIFT speak of 60% efficiency on 512 cores, which is really good)
Not sure if it's worth it for small-scale things like LT (at most 8 threads or so), but the approach is very elegant either way.
Paper: https://arxiv.org/pdf/1606.02738v1.pdf (open access, no paywall - interesting point for programmers in part three and further mostly)

So, hope it helps Josh, or is otherwise interesting.
Dino out o7
Last edited by Dinosawer on Tue Jan 24, 2017 1:57 am, edited 1 time in total.
Warning: do not ask about physics unless you really want to know about physics.
The LT IRC / Alternate link || The REKT Wiki || PUDDING
Image
Post

Re: The Multi-thread

#2
No, bad Dino. Giving me ideas for the stuff I wanted to code a decade ago...
I have to work on Purgatory...

Still, that's a damn good way of looking at Threading, the task based approach in particular would work really well for the masses of AI that LT will no doubt have.

Code: Select all

<+BMRX> Silver Invokes Lewdly Verbose Experiences Readily With Absurd Rectal Expeditions
Image
Image
Post

Re: The Multi-thread

#6
Cornflakes_91 wrote:
Silverware wrote:the task based approach in particular would work really well for the masses of AI that LT will no doubt have.
systems would also lend themself well as task units, process_solar_system() as individual task chunks.
Yeah, and you could schedule specific ones only once every 30 frames or something. (or whenever you switch into them) so long as you keep a record of the last time they updated.

Code: Select all

<+BMRX> Silver Invokes Lewdly Verbose Experiences Readily With Absurd Rectal Expeditions
Image
Image
Post

Re: The Multi-thread

#7
Silverware wrote:
Cornflakes_91 wrote:
Silverware wrote:the task based approach in particular would work really well for the masses of AI that LT will no doubt have.
systems would also lend themself well as task units, process_solar_system() as individual task chunks.
Yeah, and you could schedule specific ones only once every 30 frames or something. (or whenever you switch into them) so long as you keep a record of the last time they updated.
heck that could be internal to the process_solar_system() with themself keeping track how often and when to run a full update.

with the task scheduler having no idea which systems are at what LOD level.
(beyond is_loaded = TRUE)
Post

Re: The Multi-thread

#8
In task based it's actually better to keep the tasks as small as possible, for optimal time scheduling efficiency. process_system is too large.
Scheduling can be done much easier - in Shadowfax we assigned each object a timestep of 1/2^k, and if the current time was a multiple of that step it was simulated that step (the task was scheduled, if it were task-based). (tick per loop was the smallest timestep)
Allows you to run different things at different rates easily.
Warning: do not ask about physics unless you really want to know about physics.
The LT IRC / Alternate link || The REKT Wiki || PUDDING
Image
Post

Re: The Multi-thread

#10
Depends on the case, but in Swift, never. They had near perfect scaling on >32000 threads in a sim with 500 million particles. (the larger the amount of stuff the better it scales)

Mathematically speaking, in task-based approaches the synchronisation idle time (of cores waiting for one to finish) is at most as long as the largest of tasks between 2 syncs. So as long as you don't try ridiculously small sims there's a gain.
Warning: do not ask about physics unless you really want to know about physics.
The LT IRC / Alternate link || The REKT Wiki || PUDDING
Image
Post

Re: The Multi-thread

#12
Hi Josh, This might be for Graphics Josh, but if it helps to think in a "different light", then all the better, no?


I was drawn to Dino's link to ShadowFax and its gallery.

I shall draw your attention to the third movie (far right, a load of dots in a square).
When clicking on this, it says:

"Single generator moving through a 2D Voronoi mesh, illustrating the continuous way in which the Voronoi mesh adapts to a continuous movement of the generators. Faces shrink and grow, but do not suddenly (dis)appear."



When I watched this vid, it got me thinking about using voronoi meshes for the imposters issue Josh was having a while ago, except in 3d.

So a Google search for "3d Voronoi mesh imposters" brought me to a Powerpoint presentation on GPU Computational Geometry.

A synopsis is as follows:

Good fits for GPU:
- Voronoi Diagrams, Distance Fields

3 Research Papers:
Generic Mesh Refinement on GPU
- Use/ update coarse meshes on CPU, generate refined meshes on GPU to desired LOD
- Main Reason: Overcome Bandwidth Bottleneck
- Secondary Reason: Offload work load from CPU onto GPU
- Performance: Frame rates are equivalent, #Vertices on bus greatly reduced, CPU free up to work on other tasks than refinement
- Conclusions:
- Simple Vertex Shader Method for low cost tessellation of meshes on GPU
- Reduced bandwidth on graphics bus
- CPU freed up
- Dynamic LOD on GPU (Page 33)
- Goal - GPU LOD Geometry
- For max efficiency, data structures must support parallel algorithms
- Overview of approach
- Creation
- LOD Atlas Texture
- Rendering
- Select appropriate LOD level
- Render on GPU
- Performance
- GPU approach about 10x faster than CPU approach
- Isosurface Computation Made Simple: Hardware Acceleration, Adaptive Refinement and Tetrahedral Stripping"
Edit: Yikes! That took an hour to review and write up! :wtf: :ugeek: :problem:
YAY PYTHON \o/

In Josh We Trust
-=326.3827=-

Online Now

Users browsing this forum: No registered users and 1 guest

cron