Return to “Dev Logs”

Monday, February 6, 2017

#1
Monday, February 6, 2017

Whew, roller-coaster of a week. Ultimately, a pretty crushing ending, sad to say. As stated in the last devlog, my plan for the week was to allocate 100% of effort to exploring the performance characteristics, and, ultimately, feasibility of the current LuaJIT solution.

The tentative conclusion, after getting a finer handle on where I lose microseconds with LJ, was...LJ is really close to being feasible. With the current tests, it is performant enough. But don't get too excited yet. Despite injecting some more logic, my tests aren't yet representative of the full scale of LT simulation. As such, I would have considered passing tests with 'good headroom' as a green light for LJ. In the perfect escalation of tensions, LJ did pass, but not with much headroom. What I mean by 'headroom' is basically 'spare time.' Essentially, the current solution manages to 'just scrape by' the current tests, which puts me in a very nerve-wracking situation. LT will quite obviously be more intensive than my tests, which means, presumably, LuaJIT would not perform well (FYI, I was not running these tests on my older computer, I was running them on a rather powerful laptop, so that contributes to the nerve factor). On the other hand, I haven't explored all optimization routes yet -- I could push more logic into C, I could continue to learn more about how to squeeze the most from LJ, etc.

Well, some of you will be happy to know that my first thought (believe it or not) was threading. Despite how frequently I complain about the difficulty of threading game logic, I actually made a great design decision when building the PAX demo: unlike all of my previous work, I (somewhat impulsively) decided to implement a 'data-driven' design for the component-based entities. I think I did so initially to minimize function call overhead (with this design, I write update & draw functions that take lists of specific object types and perform the necessary logic on them; this is in contrast to simply having 'monolithic' loops that call update/draw functions on one entity at a time). I quickly realized this was a great decision for graphics optimization, as it allowed me to eliminate a lot of OpenGL state change calls. In fact, the simple PAX renderer (theoretically) runs substantially faster than the old C++ LT renderer. But, most importantly, this design ended up being critical to opening a path for a limited amount of game logic threading.

Now, in C(++), this would have been a fairly simple ordeal. But we're in LJ. Lua has no native support for threading, but it's designed such that there's nothing stopping you from running multiple interpreter instances on multiple threads. The real problem, however, is that each such instance has its own data -- there would be no direct 'sharing.' Shared memory is absolutely necessary to getting any performance gain out of threaded logic. The alternative (and the mechanism that existing Lua threading packages use) is to send data as necessary between the different interpreters via communication channels. Frankly this is a waste of time for most high-performance code. In my case, the time it would take to send even a basic snapshot of the state of some objects to another thread in order to have it process them would vastly outweigh the time it would have taken to simply perform the logic on one thread. I don't need to do a perf test to know that non-shared-memory multithreading is a waste of my time :P OTOH, Lua isn't built for shared-memory threading. Two different interpreter states cannot get at one another's data.

Anddd that's how my Friday was spent, striking out the 'cannot' and replacing it with 'should not try to' :lol: I devised a decent little hack to force the interpreters to share memory. I had no illusions about how dangerous this was. Nonetheless, I basically lept out of my chair with joy when a simple test actually showed my mechanism to be working, and showed threads smoothly sharing memory. It was fast. It was correct (no memory corruption). I built a threadpool utility and built some functions to help me control arbitrary amounts of worker threads splitting parallel code paths among themselves. Around midnight, I had finally put enough gears in place to try it out on the real thing: threading the logic for ships in the PAX demo.

And that's when everything caught on fire. :cry:

It took about five minutes to come to the definitive conclusion that shared-memory threading of any real complexity (e.g. accessing tables within tables) isn't possible in Lua. Best guess is that the garbage-collected architecture screws everything up. Instead of pointers, tables likely use state-relative indices, breaking any attempts to access memory from other interpreters. Whatever the case, I'm nearly certain that the GC is to blame. I can't possibly overstate how much I hate garbage collection.

All-in-all, a fairly heart-breaking week. Threading would have pushed me over the edge into the green -- I would have had a comfortable amount of headroom. Alas, 'twas not meant to be.

I'm not ready to give up on LJ yet. It has still come closer than any other solution. I will not discard it lightly. This week, I will delve even deeper into learning performance characteristics of LJ -- in particular, there is a built-in profiler that is (apparently) very good. I need to get that running so I can see what's going on and where my opportunities for buying time are. Catch is, to run the profiler I have to be running my program directly from LJ, not launching LJ from my program. A little finessing of the C engine will be necessary for me to convert it into a shared library, and then a little time will be required to make a Lua script that does the (basic) functionality that the C core did with respect to starting things up. Shouldn't be long before I can grab some nice profile results. After that? Who knows. I will try whatever I can. The goal is to buy headroom. The goal is to push LJ into the green.

I believe it can be done. Whether I can figure out how is another matter, and I suppose we'll all have to stay tuned to find out :geek:

---

PS ~ In case anyone was going to point it out, the 'thread' construct in Lua is deceptively-named, and has nothing to do with true hardware parallelism. It's not helpful in this quest, sadly. Also, LuaLanes, luasched, LuaTask, etc, etc etc etc...none of the existing packages for Lua threading implement the kind we need (shared-memory, preemptive). 'LuaThread' claims to be shared-memory and preemptive. Sadly it has mysteriously disappeared from the internet, and the only similarly-named impostors I can find do not implement true shared-memory/preemptive (likely because, as I found, it is not possible :ghost:)
“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Re: Monday, February 6, 2017

#4
football13tb wrote:I don't know anything about coding other than the occasional C++ for simple modding. What I do know is what your trying to do is extremely hard and I have nothing but respect for what you are trying to accomplish. I hope all the best for you and Limit Theory :D

~Lurker since 2014
Thanks man :)
Cornflakes_91 wrote:...for your special case.

Else i'd wonder how those gigantic clusters work :ghost:
Yes I thought that was implicit :monkey:
“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Re: Monday, February 6, 2017

#6
Dinosawer wrote:They have a fairly limited amount of data to transfer (say, 10-15 doubles per simulated object or so) and only need to do it a handful of times per loop.
Significantly more than that, but even still: that's per object, meaning I have to loop through every object, package its state into a buffer, push that buffer to a thread (requiring a mutex or at best atomic set in-between, causing CPU memory barrier), do it all over again in the opposite direction when my thread has processed its jobs / objects. That's two full serialization & deserializations of every object + memory barriers when pushing through comm channels + the full loop overhead of having to iterate through every object in my main thread (which is half of what I wanted to avoid in the first place, since the iteration itself has a nontrivial cost).

Now, add that all up, and keep in mind that the latency of a single object's logic step is going to be on the order of, say, ~1 microsecond. I guarantee we will spend more than 1 microsecond on average doing the above tasks (none of which are necessary with shared memory). In fact, I would place lots and lots of money on a bet that we will spend more than 7 microseconds, making the scheme not worth it even with an ideal 8-thread environment!

I promise there's no world in which non-shared memory gets our logic to run faster (in the context of LT, where updates are happening blazingly fast but over a very large number of objects with significant amount of state) :/
“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Re: Monday, February 6, 2017

#7
Just a thought, and I don't recall if LT already has this...
but do you currently have an adaptive LOD for when frame rates start to drop ? :ghost:


EDIT: CSE has phrased the question far better than I on the next page.
Last edited by N810 on Mon Feb 06, 2017 3:13 pm, edited 2 times in total.
"A sufficiently advanced technology is indistinguishable from magic."
- Arthur C. Clarke

Re: Monday, February 6, 2017

#8
JoshParnell wrote:push that buffer to a thread (requiring a mutex or at best atomic set in-between, causing CPU memory barrier)
This caused a thought to form in my (work distracted) lizard brain: with optimisations, if you can't change the code any more, you look at the data: any chance you could use some kind of circular buffer or stack mechanism to transfer data? Avoiding mutexes or locks by keeping a single pointer to the current position, writing to the end of the buffer/stack, etc.?
--
Mind The Gap

Re: Monday, February 6, 2017

#9
JoshParnell wrote:
Dinosawer wrote:They have a fairly limited amount of data to transfer (say, 10-15 doubles per simulated object or so) and only need to do it a handful of times per loop.
Significantly more than that, but even still: that's per object, meaning I have to loop through every object, package its state into a buffer, push that buffer to a thread (requiring a mutex or at best atomic set in-between, causing CPU memory barrier), do it all over again in the opposite direction when my thread has processed its jobs / objects. That's two full serialization & deserializations of every object + memory barriers when pushing through comm channels + the full loop overhead of having to iterate through every object in my main thread (which is half of what I wanted to avoid in the first place, since the iteration itself has a nontrivial cost).

Now, add that all up, and keep in mind that the latency of a single object's logic step is going to be on the order of, say, ~1 microsecond. I guarantee we will spend more than 1 microsecond on average doing the above tasks (none of which are necessary with shared memory). In fact, I would place lots and lots of money on a bet that we will spend more than 7 microseconds, making the scheme not worth it even with an ideal 8-thread environment!

I promise there's no world in which non-shared memory gets our logic to run faster (in the context of LT, where updates are happening blazingly fast but over a very large number of objects with significant amount of state) :/
Well, your updates would have to be very fast and amount of state very large, cause all the rest is the same for astrosims, where distributed memory is the norm :ghost:
Warning: do not ask about physics unless you really want to know about physics.
The LT IRC / Alternate link || The REKT Wiki || PUDDING
Image

Re: Monday, February 6, 2017

#10
Dinosawer wrote:Well, your updates would have to be very fast and amount of state very large, cause all the rest is the same for astrosims, where distributed memory is the norm :ghost:
Yes, comparatively-speaking they are. Those sims are way more intensive in the inner loop. Just think of N-body, for instance, is O(N) in the inner loop, this is already going to be asymptotically more logic than most things in LT have to run. Even for sims with O(1) inner loops, we're talking about a lot of (heavy) math. One of the papers I skimmed makes a point of keeping memory overhead extremely low. That's another major difference here. The memory / inner-loop-time ratio is way, way lower in this kind of situation.

When I said the LT logic is heavy, I meant in comparison to typical game logic, not in comparison to astrophysical simulation...

Bottom line is, our memory overhead is much too high in comparison to the latency of inner loops. Remember that the situation in LT is 'death by a thousand papercuts.' None of the logic is particularly CPU-hungry, but we have a lot of things to deal with (things with fair amounts of state). Even in C++ cache misses probably accounted for a lot (most?) of the update loop's time (pre-LTSL).
“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Re: Monday, February 6, 2017

#11
On the bright side, the LuaJIT profiler is indeed very helpful :thumbup:

On the downside, the 'death by papercuts' is even more apparent when you look at a profiled view....lines are showing 1%, 1%, 1%...maybe 2% here and there...I think the biggest 'bottleneck' I have seen is 5% (overall CPU time). That's not a fun situation for optimization :( Much prefer to see 20%...50%...90%...would be nice.
“Whether you think you can, or you think you can't--you're right.” ~ Henry Ford

Re: Monday, February 6, 2017

#12
Keep in mind the idea of threading on Linux before threading became 'mainstream' was to use sockets between threads. Sometimes you can even use streams. This way you're not really mixing data across the memory barrier, you're instead just writing to it like you would a file.

Not sure if this helps, but sometimes going backwards to see how things were done in the past might inspire ideas going forward.
Image
Early Spring - 1055: Well, I made it to Boatmurdered, and my initial impressions can be set forth in three words: What. The. F*ck.

Re: Monday, February 6, 2017

#13
JoshParnell wrote:On the bright side, the LuaJIT profiler is indeed very helpful :thumbup:

On the downside, the 'death by papercuts' is even more apparent when you look at a profiled view....lines are showing 1%, 1%, 1%...maybe 2% here and there...I think the biggest 'bottleneck' I have seen is 5% (overall CPU time). That's not a fun situation for optimization :( Much prefer to see 20%...50%...90%...would be nice.
Well... remove some functions :V
MORE MONOLITHS!

Code: Select all

<+BMRX> Silver Invokes Lewdly Verbose Experiences Readily With Absurd Rectal Expeditions
Image
Image

Online Now

Users browsing this forum: Fish_Hook and 3 guests

cron