Sunday, February 12, 201755 89 e5
Very productive week, and, overall, quite a bit more satisfying than last week. The week was split straight down the middle, 50/50, as I will explain.
The first few days were dedicated, as expected, to continuing my analysis of LuaJIT performance, getting profiling tools in place, and performing profile-guided optimization to get a better feel for where the milliseconds were going. I'm pleased to report that this all went quite well!
LuaJIT's built-in profiler is indeed really nice. It didn't take much effort at all to set up, and the results helped me get a much better perspective on the existing code's performance. Unfortunately, the results of several profiling runs of the PAX demo weren't exactly what we like to see in performance optimization. Frequently, something like the '80/20' rule will apply in profiling and optimization: it's common to find that a small minority of the code is consuming a majority of the CPU time (hence the oft-abused quote "premature optimization is the root of all evil"). This is actually what we want
to see, since it means concentrating effort on that small minority can yield big gains. Alas, it wasn't the case with the demo. The profiler reported a lot of very small percentages accross the entirety of the code, with the best 'bottlenecks' peaking at maybe ~5%. The meaning of such a result is basically: there's not much we can do to reap big performance gains, at least, not much we can do easily.
Still, the profiling was helpful enough for me to weed out the peaks that did exist (mostly by transferring them to the C library), resulting in a decent gain of somewhere between 25% and 40% perf. At the end of it all, my heavy-duty machine was able to run a 1000-ship battle at 60 fps with a millisecond or so to spare. Remember, of course, that not all logic is implemented, so this isn't representative of what you'll be able to do in LT, although it's encouraging. The goal for me has always been to allow ~100-ship-battles at decent framerates, and at this rate, I believe we'll be able to do that.
That result, among others, has convinced me that we're safe to move forward with LJ. I'm not able to hop off the fence and plant my feet as firmly as I'd like to in the "this will work" grass, but for now, what I've seen has allowed practical me to conclude that, with continued profiling and keeping a careful eye on the milliseconds, we will be able to proceed with LJ. This is also taking into consideration that, if things do start to get too heavy, I'll likely be able to make enough little scale cuts here and there to push out a still-high-quality LT 1.0. In other words, I really just wanted to remind you all that I'm keeping practicality in mind and am open to, for example, having to slightly lower the scale of system economies or out-of-system simulations, etc, if it means the difference between 60 and 10 fps
I'd be using more smileys and squirrels if I weren't so scared of being excited about solutions
Now, in complete contrast to the first half, the latter half of my week was spent hedging my bets. Several people have asked the question "what happens if LuaJIT falls through"? The answer is that I do
have quite a few options remaining in my mental priority queue. I explained in my 'State of Limit Theory' post that, at that point, I was splitting my effort roughly 50/50 between LJ and a different solution involving code generation from script. Since then, the prior has escalated to taking most of my time, what with PAX and the excitement thereafter of having something I can play and iterate on. Now that I've decided to move forward cautiously with LuaJIT, I do intend to resume my 'hedging' efforts. Although I'll continue allocating the majority of my time to 'LT in LJ' in hopes that it'll pull through for us, I'll still be giving ~20-30% to R&D for the next potential solution in the priority queue. This week, in an attempt to spin up that next solution and give it some momentum, I gave it a one-time boost of about half my time.
So, currently sitting at second place in the queue is what I think of as a 'nuclear' option -- a sort of fail-safe, brute-force option. Indeed, it may be a bit scary that a nuclear option is next in line behind LJ, but Practical Josh is thinking of it this way: "I'm getting really tired of solutions failing; had I gone with a nuclear option in the first place, I'd be done by now." Flawed logic, because I didn't have the know-how to pursue this two years ago (again, the knowledge gained from those failed attempts is nontrivial!) But, now that the option is accessible to me, I intend to give it a low-priority JoshOS thread as my hedge against LJ failing, at least until something really promising displaces it in the queue.
Over the past few days I've worked hard to spin up this solution, and have already got quite some results to show for it. I built a working x86 assembler/linker (of course, not full-featured yet, but working as in 'capable of generating in-memory programs & functions using a restricted instruction set.' Just a few hours ago I had my first successful test run in which I used it to create some simple math functions at run-time. They worked!
It was an intense few days of reading hundreds of pages of Intel's software developer guide to better understand CPU architecture and, more importantly, to understand enough to translate assembly to machine code (that's what an assembler does, among other things). Let me re-iterate since some of you are probably scared that I've lost my mind now: this is for plan Z, in the event that LJ fails (and fails hard enough for me to walk away from that substantial time investment) and no other solution manifests in the mean time. We should hope that it ends up being nothing more than a nice learning experience for me. Which, by the way, is very refreshing to have every now and then. My brain feels like it got to go on vacation during the second half of the week thanks to this work (yes, I love learning enough that reading Intel manuals and looking at opcodes in hex feels like a vacation). I was surprised at the relative ease of doing it -- I initially assumed we were looking at months until any results at all; turned out to be a few days.
This so-called nuclear option allows something along the lines of 'compiled LTSL,' i.e. LTSL running at nearly the same speed as the pre-compiled code. Some people have asked about the feasibility of such a solution before, since it seems like a somewhat-straightforward way of solving the original problem of having run into one kind of limit with C++ and another kind of limit with LTSL. I honestly didn't have the knowledge to do that kind of thing before. But, having spent so much time with intermediate representations (thanks, failed Python solution), JITs & asm (thanks LJ + failed Python solution), codegen (thanks, C++ codegen along with the 32 other metacompilers I've written in my life..), I now do. Best of all, I have a number of attack vectors for doing so (TCC, LLVM's IR/code generator), but the most appealing to me (for plan Z) is the solution involving the minimal number of intermediate pieces that could fail: my own direct-to-machine-code, in-memory compiler/assembler/linker. I already wrote an LTSL 'compiler' that takes it as far as an expression tree (essentially an executable AST), and now I've got the humble yet promising beginnings of the latter parts. It might sound like a monumental task, but the reality is that I only have to implement a tiny fraction of what 'general-purpose' tools do. In other words, when I first created LTSL and the compiler for it, I didn't worry about making it a feature-complete language, I worried about making it super easy to write ship generation algorithms, UI, gameplay code in. Similarly, for assembler/linker, I am not concerned with outputting executables, nor shared libs, I'm not concerned with PLTs and GOTs, nor many of the other complexities. I'm concerned with writing a relatively-small subset of ops to memory and executing them from a program that's already running (the LT core) in order to quickly evaluate things (like LTSL expression trees). When you cut the problem down to the core, it's a lot less scary. Still not easy, of course, but totally within the realm of feasibility.
(I know that I'll still catch flak from some people on this, despite stressing that it's plan Z and not by any means the focus of the coming weeks / months of development...but hey, I said I was going to be honest about the good, bad, and ugly
Finally, the other 10% (
) of my week was spent creating the beginnings of a benchmarking script utility to help me be more precise about my observations and decisions when it comes to all these different solutions and their respective performances. This was, in part, motivated by a simple test that I performed pitting C, unoptimized C (i.e. no compiler optimization), D, Lua (the standard interpreter), LuaJIT, and Python against one another in a small bit of code. In doing so (and finding a few interesting oddities along the way), I realized that, for someone who spends so much time thinking about a very hard problem related to perf, I've got a startling deficit of objective, quantitative data to back me on my calls. It's been a good week for quantitative measurement, what with the profiling runs and all. I decided that I need more of that in my life
It's a very simple little utility (written in Python!), but with a few more hours of work it'll help me record concrete information about relative performances in the face of many variables (language/solution, piece of code, machine, OS, CPU architecture, GPU, etc.) As I continue development, I plan to toss new benchmarking tests into the mix to help me stay abreast of FPLT in a precise way. Hopefully I'll be able to quote precise figures in the future instead of just "too slow, close but not quite, good enough, really fast." I'll also have a way to be less on-the-fence about things like LJ since I'll have hard data to say what is and what isn't working.
This coming week, I'm excited to say that we resume 'development as usual,' at least in some sense. The majority of time is to be devoted to implementing more LT in LJ rather than scrutinizing the existing code for performance, which should really prove to be a relief from walking on eggshells with a profiler in-hand! Of course, the minority is to be devoted to 'assembling' (
) a nuclear warhead, so that's a rather fun contrast.
In all, I'd estimate a 100% chance of fun!
PS ~ I'm still working on the whole brevity thing when it comes to my logs. They look a lot smaller in full-screen vim and somehow become startlingly-long when I paste them into the forum's post editor b8 39 05 00 00 5d c3