In-language commenting

Lua is clever and uses –[[ … –]] for multi-line comments. The usefulness in this scheme is that the syntax used for closing a multi-line comment IS a comment, and no syntax error results if you comment out the opener by adding an extra -, which re-enables the commented out section. The part that I don’t like is how single-line comments are specified, with –. Just IMO, FWIW, etc., I don’t think the same characters used for basic operators should be re-used for something like a comment. Really, shouldn’t a = a — 2 result in a-4?

The same in C99/C++, how does using // to specify a comment really make sense? a //= 2 ? a ///= 2 ? /* */ -style comments are better, I suppose, because they are an obvious nop.

Ideally I think we would be using something that allows for the abuse that Lua’s multi-line comments gives us, without overloading any of the basic operators provided by the language. “ perhaps?

“ Comment

Multi-line comment

“ “
“ Active multi-line comment?
a //= 37;

Maybe, I’m not sure.

There are also languages that allow comments to effectively serve as documentation that is preserved (potentially) at runtime. For example, Python assumes unassigned strings immediately inside a class or function definition are to serve as a documentation block for that section. These documentation blocks can be accessed through runtime introspection, or used with appropriate tooling to generate documentation. This is a great approach from my experience, and perhaps -all- comments should be given the opportunity (optionally) to persist alongside the code for which they are intended.

If one is to take this approach, how does one rectify comment specification with static string initialization? Especially with regards to whitespace (esp. newline) handling? How does one specify to what block of code a comment refers when not at the beginning of a closure of some sort? Is there even any point in preserving these comments for any sort of runtime introspection or should the goal be simply the production of documentation, ala Literate Programming?

Generalized string search improvement for needles with a small or numerically similar alphabet

I have been on a bit of a pointless optimization kick lately, and decided to see what I could do with string search. Most of the fast string search algorithms work on the principle of a sliding window for the purpose of skipping characters which don’t need to be checked. The best of these use a fair (fair being relative) amount of storage and extra cycles in the loop to make sure they are skipping as many characters as possible.

I am sure this has been done before, but I haven’t seen it. In the code below I have implemented an unintrusive extra level of skips to the well known Boyer-Moore-Horspool algorithm. Basically, each character of the key is AND’ed together and the result stored. If the result is zero, which happens if the alphabet is large/sparse enough, the extra checks are conditionalized away. In the event that the result is non-zero, we very quickly check for mismatches in the haystack by AND’ing the haystack character being checked against our previous result, and checking to see if the result of that is equal to our previous result. If the two results are equal, we have just checked a potentially matching character, and we need to fall back to our regular checking routine. In some cases we will match a non-matching character and our efforts will have been wasted, but in other cases we will have determined a non-match and be able to skip the full length of the needle in just a few instructions.

Original source for Boyer-Moore-Horspool lifted from Wikipedia.


Below are some quickie results from my X3220, compiled with GCC 4.2.1.

$ gcc -O2 boyer-moore-horspool.c -o boyer-moore-horspool
$ gcc -O2 boyer-moore-horspool-sjg.c -o boyer-moore-horspool-sjg
$ time ./boyer-moore-horspool
./boyer-moore-horspool 5.53s user 0.00s system 99% cpu 5.530 total
$ time ./boyer-moore-horspool-sjg
./boyer-moore-horspool-sjg 5.21s user 0.00s system 99% cpu 5.210 total

$ gcc -O3 -mtune=nocona boyer-moore-horspool.c -o boyer-moore-horspool
$ gcc -O3 -mtune=nocona boyer-moore-horspool-sjg.c -o boyer-moore-horspool-sjg
$ time ./boyer-moore-horspool
./boyer-moore-horspool 5.28s user 0.00s system 99% cpu 5.282 total
$ time ./boyer-moore-horspool-sjg
./boyer-moore-horspool-sjg 5.02s user 0.01s system 99% cpu 5.034 total

The, dare I say “elegant”, thing about this addition is that it could relatively easily be applied to many other string search algorithms and completely conditionalized away from the inner loop if the results are going to be ineffectual.

Virtual machine opcode dispatch experimentation

I was reading The case for virtual register machines recently and decided to do a bit of experimentation with different opcode dispatch methods. Apparently, up to 60% of the cpu time burned by common virtual machines is due to branch mispredicts. This is rather a silly problem to have in the context of opcode dispatch, considering the VM knows quite readily exactly where it will be branching to for each VM instruction. As a result, there is really no reason for the mispredicts apart from the fact that we can’t actually tell the cpu what we know. Since there is no useful mechanism of any sort (at least on all x86 cpu’s that I know of) to say to the cpu, branch at foo will go to bar (short of JIT’ing everything, which can indirectly solve the branch mispredicts which happen at the opcode dispatch stage), the best we can really hope to do is attempt to seed the branch predictor with past branch information that will hopefully prove useful in the future. This proves to be somewhat problematic, as different cpu’s have branch predictors implemented in different ways and with different capabilities, and varying mispredict penalties. You also tend to burn cycles and space over more direct implementations, you just have to find the algorithm that lets you come out ahead due to increased prediction accuracy.

To really figure out what is going to work best for a full blown VM, I think you need to start at the beginning. The paper above referenced a couple of different ways that opcode dispatch is typically accomplished, but I wanted to write my own test cases and figure out exactly what would work the best, and more importantly, what definitely was not going to work, so that I could avoid wasting time on it in the future. These are more important simply because the faster running algorithms will very likely be somewhat dependent on the number of opcodes a VM implements and the frequency with which it executes opcodes repeatedly or in the same order.

My preliminary test cases are on github here:, and the runtime results with various compilers on various CPU’s is here:

As you can see, the “goto direct” version is the fastest in every case by a relatively healthy margin. To qualify these results I implemented the same method of dispatch as the goto direct case in the Lua VM. Much to my dismay it was consistently (~10%) slower than the switch-based dispatch that is standard in Lua. After quickly realizing it was purely a function of opcode count, 5 in my tests vs 38 in Lua, I modified my Lua patch to be more like the goto direct 2 example. Runtimes are not provided in the RESULTS file for this, but it was marginally slower than the goto direct case. After this, Lua was consistently faster (up to around 30%) on some of my test hardware, and marginally slower on others. Making minor changes to the breadth or depth of the nested if or switch statements expanded into each opcode had minor changes one way or the other on all processors tested. Typically, faster on my Xeon 3220 meant slower on my Athlon XP 2500+, and vice versa, but by differing magnitudes. The Xeon gets faster, faster than the Athlon slows.

The entry point to my post about this on the Lua list can be found here.

There is obviously performance to be had here, probably quite a bit of performance. My next bit of testing will focus on expanded (# of opcodes) versions of the faster test cases, with more realistic opcode distribution. In terms of algorithmic improvements, I am going to try grouping opcodes in various ways adding the group identifier to the opcode itself, so that the dispatch data structures can nest like switch (group) { case n: switch (op) { } * n }. I am also going to play with the concept of simple opcode or group ordering rules. The compiler frontend of any VM follows some set of rules, intended or not for generating the opcodes that the VM executes. Even with a VM implementation that does not enforce those rules, and allows opcode execution in any order, knowing the likely order will no doubt be useful for optimization.

A requirement in my mind early on for the TROA VM was the easy evaluation of expressions on vectors or streams in the language to make extensive use of SIMD possible inside the VM. This concept is being weighted right up to the top of my list after having done this opcode dispatch testing. Even in the basic unoptimized case where your opcode operates on its vector/stream serially, there is still potential for double-digit overall program performance improvement due to the reduction in opcode dispatches.

A Better IE 5.5 and 6 PNG Fix

I should have posted this here prior to now, but as you can probably tell … .. I don’t post to this blog very often. During the implementation of the new site we decided to use PNG alpha transparency to some extent. During the course of implementation existing IE5.5/6 PNG hacks were deemed to be wholly inadequate for our needs. So, I took some time to reimplement the core PNG hack as an .htc (IE CSS Behavior) and layer in some additional hacks on top to support css repeat and positioning.

Get the code: [HERE]

Original post follows:

So there I was in the wake of an unexpected and tragic steamroller accident involving the entirety of the production design staff… “Wait, you mean I have to cut and implement all the new designs? Me? Well, ok, this shouldn’t be too hard. I’ll just slice each element out with alpha transparency preserved in PNG’s, then layer them using CSS just like they are in Photoshop.”

“Wait, wait. What do you mean that won’t work?”

Back in reality, the lack of true support for PNG alpha transparency in Internet Explorer 5.5 and 6 has been nipping at us and many others for at least 4 or 5 years now. Given that browsers that are fully supporting are in the 70%+ market share range, we decided that it was time to come up with a proper “fix” that would allow our alpha transparent PNG’s to degrade gracefully on now effectively “legacy” versions of IE.

Many web developers out there will be familiar with the prevalent “.htc” file behavior fix targeted at this problem. There are certainly other ways to approach a solution, but we tend to like this one for a number of reasons, perhaps the biggest being: It will invalidate otherwise valid CSS. This may seem a bit crazy, but a fix (hack) is a fix (hack) and as a boundary pushing web developer, one probably shouldn’t be left to forget that.

The IE behavior/.htc fix that has been around for a number of years has some pretty staggering limitations when used on anything resembling a complex layout – so staggering, in fact, that it’s easier to just say what it gets right: images in img tags, and non-repeating (non-tiled) background images aligned to the top left of their container. Perhaps this isn’t a problem if you design the page with this in mind, but it certainly won’t suffice in making a crazy-alpha-png’d-layout degrade gracefully on IE 5.5/6.

For our from-scratch implementation, we started with a page cut and structured as we desired and validated the presentation in IE7, Firefox2+ and Safari 3+, then implemented our own behavior/htc hack to correct all the regressions that we could find in earlier versions of IE. What we ended up with was something that was capable of not only preserving the status-quo in IE PNG hacks. We also bring to the table full support for the CSS properties background-position (for labeled, pixel and percentage offsets) and background-repeat (for values of repeat, no-repeat, repeat-x and repeat-y). The only thing we don’t do is support the use of these two properties together. We’ll leave that for someone else… or maybe a future weekend hacking session.

PowerDNS / PostgreSQL & Web Interfaces 2

After a bit of eat and drink, as well as a half hour of zOMG why is this not werking!?!?! (iptables), Supermaster/Superslave is operating famously. It seems to “just work”. No complaints thus far, which is, well… highly unusual for me to put it lightly.

Feb 03 05:35:55 Received NOTIFY for from for which we are not authoritative
Feb 03 05:35:55 Created new slave zone '' from supermaster, queued axfr
Feb 03 05:35:55 gpgsql Connection succesful
Feb 03 05:35:55 No serial for '' found - zone is missing?
Feb 03 05:35:55 AXFR started for '', transaction started
Feb 03 05:35:55 AXFR done for '', zone committed

PowerDNS / PostgreSQL & Web Interfaces

I have been looking at PowerDNS for a while now, and after regular confirmation that it is in fact performing extremely admirably over at DreamHost I decided that it was time to deploy it.

While PowerDNS is the least braindead DNS server I have ever come across, there were a couple of things that I was not 100% happy with, at least in terms of coupling it to a web frontend.

  • SOA records are stored space-delimited. This would hardly be a problem except that our serial is stored here. In its defense, PowerDNS has an alternate method of handling serials that is probably better in most circumstances. Hardly, but we would still have to break it apart and put it back together again to edit the minimum (default in practice) TTL, etc.
  • Record types are stored textually. Even when implemented as an enumerated value this still violates DRY, as you must re-state these values in your frontend code.
  • Everything must be represented fully qualified. This = FAIL from a normalization perspective.

Here I have come up with a somewhat optimal schema from the point of view of my web interface, and I have tied it to PowerDNS’s preferred table structure via domain logic. This could have been handled in other ways of course, but I tend to like this one for a number of reasons.

  • First, the alternative is to add custom queries to the PowerDNS configuration file to make it understand whatever schema we might have in place, PowerDNS actually makes this very easy.
  • Another alternative would be to use dynamic (normal) views.

On to the benefits, some being quite minor.

  • Querying against serialized views will have performance benefits versus the above two options, this of course has to be weighed against the cost of maintaining the views.
  • As mentioned, PowerDNS has two methods of handling serials, either in the SOA record, which we are keeping up to date with our domain logic. Alternatively PowerDNS will scan each record for you to find the most recently updated (if you maintain change_date). The former should logically be more performant, so we have implemented that option. This could have been handled either way in the domain logic, but most importantly we aren’t relying on our web frontend to keep our serials up to date.
  • Most importantly, namely for debuggability, data on master’s and slave’s “looks the same”.

To get you rolling your PowerDNS configuration file need not be any more complicated than this:


I haven’t tried slaving yet, but I suspect it will work without a hitch. Will update here when I do and when this rolls out.


This week I decided to toy with ActionScript/Flash a bit (for the first time, really). I’m using the FlashDevelop IDE, so it’s all free goodness, no shelling out $500 to Adobe. Anyway, I wrote an MP3 player that is devoid of any sort of flash user interface, completely controllable through JavaScript. It’s a mere 162 lines of ActionScript and weighs in at 2071 bytes as an swf. It supports a wide range of operations, load, play, pause, stop, setvolume, getvolume, ispaused, getpauseoffset, getcurrentfile, getduration, getposition, getbytesloaded, getbytestotal, getid3, as well as a number of asynchronous JavaScript callbacks (notifications) on various events, loadcomplete, playcomplete and id3found. You can see it in action with possibly the simplest UI possible here:

The real question that I am trying to answer for myself is, does eliminating the flash user interface somehow make it [flash] more palatable?

JavaScript/File-based HTTP request logging

I just had the thought that it should be pretty feasible (if not trivial) to tie JavaScript-based request logging (like Mint and Analytics) to traditional file-based request logging using cookies and/or headers and CustomLog in Apache or similar in other httpd’s.

The question being… Is this somehow useful? I think it potentially could be, I’m just not 100% on how as yet.