TROA Archives - Samuel J. Greear

January 24, 2009

In-language commenting

Lua is clever and uses –[[ … –]] for multi-line comments. The usefulness in this scheme is that the syntax used for closing a multi-line comment IS a comment, and no syntax error results if you comment out the opener by adding an extra -, which re-enables the commented out section. The part that I don’t like is how single-line comments are specified, with –. Just IMO, FWIW, etc., I don’t think the same characters used for basic operators should be re-used for something like a comment. Really, shouldn’t a = a — 2 result in a-4?

The same in C99/C++, how does using // to specify a comment really make sense? a //= 2 ? a ///= 2 ? /* */ -style comments are better, I suppose, because they are an obvious nop.

Ideally I think we would be using something that allows for the abuse that Lua’s multi-line comments gives us, without overloading any of the basic operators provided by the language. “ perhaps?

“ Comment

“
Multi-line comment
“

“ “
“ Active multi-line comment?
a //= 37;
“

Maybe, I’m not sure.

There are also languages that allow comments to effectively serve as documentation that is preserved (potentially) at runtime. For example, Python assumes unassigned strings immediately inside a class or function definition are to serve as a documentation block for that section. These documentation blocks can be accessed through runtime introspection, or used with appropriate tooling to generate documentation. This is a great approach from my experience, and perhaps -all- comments should be given the opportunity (optionally) to persist alongside the code for which they are intended.

If one is to take this approach, how does one rectify comment specification with static string initialization? Especially with regards to whitespace (esp. newline) handling? How does one specify to what block of code a comment refers when not at the beginning of a closure of some sort? Is there even any point in preserving these comments for any sort of runtime introspection or should the goal be simply the production of documentation, ala Literate Programming?

January 2, 2009

Virtual machine opcode dispatch experimentation

I was reading The case for virtual register machines recently and decided to do a bit of experimentation with different opcode dispatch methods. Apparently, up to 60% of the cpu time burned by common virtual machines is due to branch mispredicts. This is rather a silly problem to have in the context of opcode dispatch, considering the VM knows quite readily exactly where it will be branching to for each VM instruction. As a result, there is really no reason for the mispredicts apart from the fact that we can’t actually tell the cpu what we know. Since there is no useful mechanism of any sort (at least on all x86 cpu’s that I know of) to say to the cpu, branch at foo will go to bar (short of JIT’ing everything, which can indirectly solve the branch mispredicts which happen at the opcode dispatch stage), the best we can really hope to do is attempt to seed the branch predictor with past branch information that will hopefully prove useful in the future. This proves to be somewhat problematic, as different cpu’s have branch predictors implemented in different ways and with different capabilities, and varying mispredict penalties. You also tend to burn cycles and space over more direct implementations, you just have to find the algorithm that lets you come out ahead due to increased prediction accuracy.

To really figure out what is going to work best for a full blown VM, I think you need to start at the beginning. The paper above referenced a couple of different ways that opcode dispatch is typically accomplished, but I wanted to write my own test cases and figure out exactly what would work the best, and more importantly, what definitely was not going to work, so that I could avoid wasting time on it in the future. These are more important simply because the faster running algorithms will very likely be somewhat dependent on the number of opcodes a VM implements and the frequency with which it executes opcodes repeatedly or in the same order.

My preliminary test cases are on github here: http://github.com/evilsjg/hacks/tree/master/troa/test/dispatch, and the runtime results with various compilers on various CPU’s is here: http://github.com/evilsjg/hacks/tree/master/troa/test/dispatch/RESULTS.

As you can see, the “goto direct” version is the fastest in every case by a relatively healthy margin. To qualify these results I implemented the same method of dispatch as the goto direct case in the Lua VM. Much to my dismay it was consistently (~10%) slower than the switch-based dispatch that is standard in Lua. After quickly realizing it was purely a function of opcode count, 5 in my tests vs 38 in Lua, I modified my Lua patch to be more like the goto direct 2 example. Runtimes are not provided in the RESULTS file for this, but it was marginally slower than the goto direct case. After this, Lua was consistently faster (up to around 30%) on some of my test hardware, and marginally slower on others. Making minor changes to the breadth or depth of the nested if or switch statements expanded into each opcode had minor changes one way or the other on all processors tested. Typically, faster on my Xeon 3220 meant slower on my Athlon XP 2500+, and vice versa, but by differing magnitudes. The Xeon gets faster, faster than the Athlon slows.

The entry point to my post about this on the Lua list can be found here.

There is obviously performance to be had here, probably quite a bit of performance. My next bit of testing will focus on expanded (# of opcodes) versions of the faster test cases, with more realistic opcode distribution. In terms of algorithmic improvements, I am going to try grouping opcodes in various ways adding the group identifier to the opcode itself, so that the dispatch data structures can nest like switch (group) { case n: switch (op) { } * n }. I am also going to play with the concept of simple opcode or group ordering rules. The compiler frontend of any VM follows some set of rules, intended or not for generating the opcodes that the VM executes. Even with a VM implementation that does not enforce those rules, and allows opcode execution in any order, knowing the likely order will no doubt be useful for optimization.

A requirement in my mind early on for the TROA VM was the easy evaluation of expressions on vectors or streams in the language to make extensive use of SIMD possible inside the VM. This concept is being weighted right up to the top of my list after having done this opcode dispatch testing. Even in the basic unoptimized case where your opcode operates on its vector/stream serially, there is still potential for double-digit overall program performance improvement due to the reduction in opcode dispatches.

August 31, 2008

Object inheritance musings

class Controller defines IController
class CProcess implements IController
class CThread implements IController
x = Controller(CThread).new()

interface IFDEvent
class Select implements IFDEvent
class Poll implements IFDEvent
class KEvent implement IFDEvent

class SocketServer extends FDEvent(Poll) [defines ISocketServer]

class HTTPServer extends SocketServer [defines IHTTPServer]
class FTPServer extends SocketServer [defines IFTPServer]

server = HTTPServer.new()
-or-
server = HTTPServer(SocketServer(KEvent)).new()

No? Where does this break down?