Exercise for the Reader

November 1, 2009

An Aside About Uniformity

Filed under: 3D Graphics — Seth Porter @ 4:51 pm

Before getting into a correct treatment of the problem, and deriving a correct way to handle these differing end segments (as part of the B-spline series of posts), I should explain what I mean by “uniform” and why I’m so hung up on it. (That is, other than the fact that the Buniform function is an easily-understood mixing of three control points, while the recursive definition of the basis functions makes my head hurt.) In particular, why not just accept a solution along the lines of:

function evaluateBSplineWithConditional(u) =
v = u * (n-k+2)
segment = floor(v)
w = fractional_part(v)
if(segment <= 1) then
  do weird thing(segment, w)
  do uniform thing(segment, w)

(I may have got the indexing slightly wrong, but I think this works (and is a good illustration of why I was being so pedantic about u, v, and w above — it’s not complex math, but it’s easy to mess up the parameterization).) Also note that I’m only looking at the weirdness at the beginning of the curve; properly the conditional would need to check whether the segment was at either end of the curve. This is a simplified version to support the discussion, not an actual implementation.

Endpoints are often exceptional cases; why not just define the evaluation with that exceptional case and move on? All the more so since my attempted trick of “pre-conditioning the data” (replacing the original sequence of control points with a sequence with repeats at the beginning and end) didn’t work? (An important background assumption to the rest of this discussion is that this function is going to be run many many times, since we need a lot of points on that curve to make it look like a smooth curve when we draw it on-screen, or when we send the points to a CNC router to cut metal, or whatever we’re doing with it. If we only needed to do it once, this would probably be the right solution.)

The Problem with Branching

The answer basically has to do with computer architecture and the way it has evolved over the past couple of decades. Once upon a time, when I was programming on my father’s hand-me-down Heathkit Z-80, this would have been a perfect solution — each time we needed to evaluate a B-spline, we’d simply decide if we were in the normal or weird section of the curve, and calculate accordingly. The problem is that as we poured billions of dollars into CPU fabrication techniques, we got more and more transistors, and needed some way to take advantage of them in order to make things run faster. One major solution is “pipelining”, which essentially can be described as doing many things at once. Rather than waiting for each step to complete before we start on the next step, the CPU will get a lot of things started at once, and pull them all together when needed. (Perhaps I’ve been watching too much Iron Chef, but this seems like partitioning work out to a sous chef.) In the extreme case of the P4 architecture, the pipeline was up to 20 stages long. A much more technical discussion can be found at http://arstechnica.com/old/content/2001/05/p4andg4e.ars but the basic takeaway is that for a modern CPU to run fast, it needs to be able to do many things at once.

A conditional (the “if / then” in the pseudo-code above) wreaks holy hell on this scheme. Before the CPU can get started on “do uniform thing” or “do weird thing”, it has to know which one it needs to do. This is one cause of a “pipeline stall”, since instead of doing 20 things at once, we have to wait for the conditional to be evaluated (through all 20 stages) before doing anything. This is absolutely brutal on performance.

Modern CPUs actually try to work around this problem as well, using “speculative execution” and “branch prediction” — basically, they keep track of what happened last time we ran this code, and assume that the same thing will happen this time. If they guess right, when the conditional is finally evaluated they can just act as if they knew it all along. If they guess wrong, they have to throw away all the speculative work and start over on the correct path. This works, but it’s hugely complex, and it makes performance less predictable and more dynamic (because performance becomes directly tied to the accuracy of the branch prediction). In our case, things would run fast through the first curve segment, when the “weird” path is consistently followed. This would be followed by a period of degraded performance when it kept guessing “weird” but the real answer was “uniform”. At some point, the branch prediction unit would be retrained to guess “uniform” instead, and performance would recover until we hit another “weird” segment. In short, speculative execution and branch prediction are nice things to have, but no substitute for “straight-line”, branch-free code.

This whole preference for a single codepath is even more dramatically true when we talk about “SIMD” (Single Instruction, Multiple Data) parallelism. This approach to performance showed up on CPUs in the MMX and SSE instruction sets, and is one of the ways to really increase performance on modern silicon. Essentially, rather than evaluating points on the spline one at a time, we could stack up a list of points to be evaluated side-by-side, then grind through them four (or however many) at a time. See http://arstechnica.com/old/content/2000/03/simd.ars for a more responsible description. The challenge here is that “Single Instruction” bit — you have to be doing the same thing to your four data points, not looking at each one individually and deciding whether to give it the “uniform” or “weird” treatment. This is true of the SIMD instructions in modern Intel and AMD CPUs, and even more so of “shader programs” running on graphics cards’ GPUs. In fact, early shader programs didn’t even have conditional instructions; everything had to be straight-line code. More recent cards allow you to branch on the GPU, but at a potentially fearsome cost to performance (see http://forums.nvidia.com/index.php?showtopic=107865 for instance).

Separate Batches

So we’re all agreed that branching is Bad. What can we do?

One approach is the “hoist the conditional” out of the inner loop. This somewhat compromises the nice clean interface of the evaluateBSplineWithConditional function, but often be a performance win. To describe this approach, we need to introduce another level of the program, namely the function which was calling the evaluateBSplineWithConditional function. Assume that we’re trying to fill an array with evenly-spaced points on the B-spline, in order to render them or drive a CNC router or whatever. Using the original definition we’d have something like

function naiveFillArrayWithSpline(array, numPoints)
  dU = 1 / (numPoints - 1)
  for(i = 0; i < numPoints; i++)
    u = dU * i
    array[i] = evaluateBSplineWithConditional(u)

The naiveFillArrayWithSpline function would really work on any parametric curve function that expects a u ranging from 0 to 1. Instead, we can give it some knowledge about the end-conditions of B-splines, to produce a version without any conditionals in the main loops:

function batchedFillArrayWithSpline(array, numPoints)
  numSegments = n-k+2
  pointsPerSegment = numPoints / numSegments
  dW = 1 / pointsPerSegment
  for(i = 0; i < pointsPerSegment; i++)
    w = dW * i
    array[i] = do weird thing (1, w)
  for(j = 1; j < numSegments - 1; j++)
    for(i = 0; i < numPoints; i++)
      w = dW * j
      array[j*pointsPerSegment + i] = do uniform thing(j, w)

(Again I’m ignoring the weirdness at the end of the curve, which would require another “do weird thing” loop. I’m also playing fast and loose with the array indexing, assuming that numPoints divides evenly into numSegments, and so forth — this is another strawman, in other words.)

This is a much uglier function than naiveFillArrayWithSpline. It also has a lot of specific knowledge about B-splines embedded in it; you couldn’t reuse this for, say, making an array filled with sin(u). However, it has the huge virtue that there are no conditionals embedded in the inner loops. In fact, there are no conditionals at all — by changing the problem from “evaluate a B-spline at a given value of u” to “fill an array with evenly-spaced values of a B-spline” we’ve been able to significantly improve the performance characteristics.

This version still has flaws, however. One is a software engineering issue: we might need both this version and the evaluateBSplineWithConditional function, if we need to be able to fill arrays efficiently but also need to be able to evaluate arbitrary points (perhaps searching for an intersection). Maintaining two functions with different structure, but supposed to produce the same result, is a good recipe for maintenance pain. (In a prior job, we had exactly this situation — and in fact the single-value version did disagree with the bulk version, which was an on-going source of confusion.)

There’s an argument to be made that software engineering pain is the price of doing business, and perhaps a different language or some code generation approach would let us keep a single version of the source code while still producing these two versions for actual production use. There are two technical flaws (or two aspects of the same flaw), though, that mean this version is sub-optimal even ignoring maintenance considerations.

From the point of view of running this code on a CPU, the issue is code size. Instead of a single, small codepath we now have a rather larger function to evaluate. The on-disk size of a program is pretty much irrelevant in the modern world, but the actual in-use code size still matters because of caching. The CPU has a limited (and often surprisingly small) instruction cache on-die; it’s vastly faster to execute code from that cache than it is to load code from main memory. While it’s unlikely that the method above would actually be larger than the instruction cache, it’s definitely larger than a uniform version, which means it would occupy more of the instruction cache. If, say, the actual rendering loop has to be pushed out of the cache in order to load our spline evaluation method, then we’re paying a definite performance cost for code size (although one that’s very difficult to rigorously detect and characterize).

In the case where this code would run on the GPU, the cost is somewhat easier to see. Rendering commands are transferred from the CPU to the graphics card in “batches”, where everything in a batch is using the same shader programs, the same constants, and so forth (only the vertex data changes between data points). The GPU-based analogue of the method above would be to render the spline in three batches: one for the first “weird” segment, one for the uniform segments, and one for the final “weird” segment. The number of batches is very often a limiting factor in naive rendering approaches; magnifying the batch count by a factor of 3 is a quick way to become batch-limited. In these circumstances it can even be possible for the performance wins of GPU-native evaluation to be completely lost in the face of the communications overhead.

Arithmetic “Conditionals”

If the cost of the computations is low enough, and the penalty for branching is high enough, there can be another answer. Instead of having the program explicitly branch, or separating the different conditions into different loops / different batches, we can simply do both branches. The obvious problem is that we’ll have two output values, but we need to produce exactly one (the right one). To solve this problem, we introduce a “condition variable” which is 1 in one conditional branch and 0 in the other. This is typically a “natural” operation because in C (and C-like languages) boolean variables are internally stored as an integer which is 1 for true and 0 for false. In this idiom, the core of the function becomes something like

function doBoth(segment, w)
isWeird = (segment <= 1)
isNormal = 1 - isWeird

weirdResult = do weird thing(segment, w)
normalResult = do uniform thing(segment, w)

answer = weirdResult * isWeird + normalResult * isNormal

This seems incredibly counter-intuitive as a speed-up, but it can actually be a win if the “weird” and “uniform” paths are both rather cheap to evaluate, and the cost of separate batches is high enough. A nice advantage of this approach is that we’re back to having a single method which can be used as the core of an array-filling loop and also used to evaluate “one-off” points.

In practical terms, I would guess that on the CPU, this approach is likely to be slower than the batchedFill method above, while on the graphics card it would be closer. (This reflects the fact that CPU instruction cache pressure is probably less critical than the communications overhead of sending batches to the GPU, combined with just how fast the GPU is at evaluating math — in the GPU case, the bar for “rather cheap to evaluate” is lower, and the cost of separate batches is higher). I haven’t done any benchmarking at all to test whether the B-spline functions actually fit these assumptions on the hardware I have available.

Tradeoffs and Real Hardware

The last paragraph hints at why all of these approaches are kind of unfortunate. It’s very possible (in fact, likely) for the “right” answer to vary with different graphics cards, different CPUs, and different connections between CPU and GPU. For instance, within a given generation of graphics cards, the cost of evaluating functions tends to drop (as the cards get faster), but the cost of submitting a batch is typically fixed (by the interface between CPU and graphics card, which experiences much slower evolution). However, for on-board graphics (where the GPU is on the motherboard or even integrated into the CPU itself) the cost of batch submission may be almost zero, while these kinds of GPUs are typically computationally weak (so the cost of evaluating the functions would be higher).

One very common solution to this sort of dilemma is to optimize for performance on your hardware, and figure it’s reasonably representative of your customers. If there’s any validity to that assumption, this may be a reasonable approach, but it’s fraught with peril; if nothing else, the optimal solution will change over time (with new generations of hardware), so your program is unlikely to age well. It can also be embarrassing, if customers upgrade their hardware and notice that your program doesn’t take any advantage of their expensive new graphics card.

To really do this kind of thing right, across heterogenous hardware and with an eye toward future-proofing your program, you would need to “do it all”, with dynamic strategy selection. For each performance-critical routine, provide all three variants listed above, then benchmark on the customer hardware to figure out which is fastest in this particular computer. This approach actually can work, and is really used by some very performance-sensitive applications. (For instance, the ATLAS project http://math-atlas.sourceforge.net/ does this for certain key linear algebra computations. Likewise, the “md” RAID system on Linux boxes will test several versions of the XOR routine at runtime to determine which is fastest. In both of these cases, a very small but very performance-critical part of the code is being optimized.) This approach is very expensive in programmer and testing time, and easy to get wrong if you don’t have a wide range of hardware available to validate your performance assumptions.

Sometimes you can instead leverage the work of other projects. For example, many problems can be coerced into problems of linear algebra, and you can then take advantage of the optimally-tuned ATLAS functions to solve them. There is typically overhead in expressing a problem this way, but it can be a win in time-to-market and coding complexity. Unfortunately, this isn’t a mechanical solution, and it isn’t applicable in all cases. In the case of B-splines, it would be possible to express the “do weird thing” and “do uniform thing” sections in terms of matrix multiplication, but it would not be easy to capture the general “do this or do that” problem as I’ve set it up here.

So the long and the short of it (after this very long digression) is that conditionals are a pain no matter how you slice it. Vastly better is to be able to precondition the data — that is, feed a fixed formula something clever to make it do what you want, rather than using two different formulas. (You can tell that I got it to work, or think I did, from my phrasing. If B-splines were inextricably different in their end segments, I’d be writing a very nice summation about how conditionals are hard but sometimes unavoidable, and the next section would be about how the real clever is solution is to have several B-splines to draw, and draw all the middles of all the curves in one batch, and all the end-segments in another batch. Which is actually pretty clever, now that I think of it. But it’s definitely not the way this post is going.)


1 Comment »

  1. […] why B-splines are interesting curves, attempted to sort out B-splines terminology, and decided that it would be really nice to be able to evaluation them with uniform functions. I’d tried some quick and dirty approaches to force it to be, and had gotten little in […]

    Pingback by B-Splines: Better Math « Exercise for the Reader — December 5, 2009 @ 9:14 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: