Meaning… if you run it in an Atom, be ready to watch it break in half due to the load!
Vega Strike has always been GPU-hungry, there’s no denying. Ever since the 0.5.0 release, it has also been RAM-hungry (or, more specifically, ever since we had to retire soundserver). But, as those that tried may have noticed, and as those that read the forums may have read, this description isn’t entirely accurate on the netbook front. There, rather than hungry, I’d call it starving.
Now, I’ve only toyed a little bit on an N5xx, so this is far from a thorough report. But, since the devblog has been rather quiet lately, I started writing.
Right before building for the N5xx, I thought, I need proper optimization settings. There’s no point in benchmarking the game if the build won’t be optimized for the (rather different) hardware. Luckily, gcc “recently” added a nifty “native” optimization option - it just detects the CPU being used, and optimizes for that.
Building on the dual-core HT netbook was great. That little machine managed to build in about half an hour, which is what it took to build VS on a P4 1.7Ghz I had, or a P3 1Ghz - they both took the same time. Given that the netbook consumes a whopping 8.5W, I was surprised. That is undoubtedly the “quad-core” effect - I had heard HT on these architectures worked a lot better than in newer ones, say Sandy Bridge, and I could go lengthy about why - but suffice it to say: it’s true. “make -j4″ really paid off here. Beautiful.
This little thingy I borrowed from my sister has an astonishing 2G of RAM on it. It’s the maximum the chip can handle, for those that don’t know, so you don’t get a bigger Atom. VS can run with 1G and some swap, so 2G was plenty. Still, I could feel the poor thingy ask for mercy. Atoms, small as they are, are 64-bit. That makes VS use some more RAM than it would on 32-bit, and on a 2G system with no video RAM whatsoever, it was pushing it.
I think the worst thing is that APUs (and other onboard GPUs too) have to share system RAM with the CPU - they have no dedicated RAM, and that’s a big disadvantage for OpenGL. OpenGL has to keep a copy of all textures in case they have to be swapped out of VRAM, and APUs, which have no RAM but do have reserved RAM are no exception. So all textures use up memory twice - once in system RAM, and once in video RAM. On a 2G system, you feel it. A 1G system would be swapping and stuttering constantly (I’ve tried), and if you don’t decrease texture resolution, even crash.
Which brings me to N5xx APU’s limit of 256MB texture space. This is plenty, but since intel doesn’t support DXT compression (they market it as DXT de-compression, which means the driver expands the texture when loading it into RAM, a big lie that retains none of the benefits of DXT), those 256MB run out quickly.
Full-size planet textures, for instance, use up 160MB on their own. Add a few stations and ships, and texture swapping is pervasive. In fact, I’ve had it crash on me when textures were at full resolution - especially when looking at earth, which has the biggest, bestest and meanest texture set - because if an object’s textures don’t fit in that limit, the driver will kill VS.
So… disabling faction textures and lowering texture resolution was a must. I must say I didn’t notice a performance impact - it was mostly about crashing than performance.
Shaders and whatnot
Now, surprisingly, the on-chip GPU (APU for the knowledgeable folks) on those tiny processors is quite decent on per-pixel stuff. It’s not super fast (or fast for that matter), but it’s surprisingly capable given the low power budget (TDP for the geeks). It means, it can run all the shaders that can run on other intel onboards, albeit slower.
A lot slower. At the rather humble resolution of 1024 x 600, I could get maybe 4 fps looking at Serenity. But, check this out, the pixel pipeline wasn’t the bottleneck! These intel APUs have poor hardware vertex shaders. So much so, that for a while I thought they were running on the CPU!
All of the 2 shaders units run at 200Mhz, which is half the clock rate of the CPU. But, more importantly, I think full-precision floating point arithmetic must be working in scalar mode rather than vector mode (only 1 operation per cycle, instead of 4), because the difference in speed compared to the pixel shaders is astonishing.
Update it seems that they do actually run on the CPU, according to tech report: “Integrated graphics processors typically lack dedicated vertex processing hardware, instead preferring to offload those calculations onto the CPU. As a unified architecture, the GMA X3000 is capable of performing vertex processing operations in its shader units, but it doesn’t do so with Intel’s current video drivers. Intel has a driver in the works that implements hardware vertex processing (which we saw in action at GDC), but it’s not yet ready for public consumption.” This means that, coupled with the rather slow CPU, vertex shading is severely underpowered. We can only hope driver improvements will revert this situation.
This situation was unheard of before APUs came inside netbooks. It was always the case that the pixel pipe would be the bottleneck, and if you wanted to accelerate stuff you pushed some calculations out of the pixel shaders and into the vertex shaders. With a few thousand vertices per object vs a few million pixels, it was a clear win. Not anymore - not on netbooks - the underpowered vertex shader can’t keep up with a tenth of the workload the pixel shader can handle, and all our shader optimization just doesn’t make sense anymore.
For instance, many shaders perform multiple passes, in order to avoid computing the expensive pixel shading on occluded pixels. This is such a dumb thing to do when vertex shaders are the bottleneck, that I’ve been considering adding a “Netbook” shader setting that will disable that. It could possibly multiply FPS 2x or perhaps even 3x. Some other “optimizations” would beg revising too, so this will take time.
And the bug in Vega Strike
If this wasn’t enough, bugs in VS contributed to the slowness. Especially a rogue high-quality shader that slipped into low-detail modes. Other intel GMAs handled it pretty well, in fact, but not this APU. After fixing that, things improved a great deal. I still get unplayable FPS when running shaders, but I can see a small light of hope - maybe if I can rebalance the shaders with the underpowered vertex pipeline in mind, maybe it might work.
That’s another resource hog for the Atom. This lowly processor isn’t up for the task of running VM bytecode, which is what Python runs on. Java and Dalvik, two other technologies that run VM bytecode, have a JIT - a module that spits out optimized machine code to replace the Java bytecode, which makes it fare a lot better in the Atom, as evidenced by the abundance of Atom hardware running android.
But Python has long wished one, but not gotten it. So it still emulates the running program by reading the bytecode and performing the operations in a very Atom-unfriendly way.
The Atoms’ simplistic architecture isn’t well suited to run this kind of generic, utterly suboptimal code. It shows. When python scripts start running in VS, the stuttering is immediate. Spawning ships, a very Python-heavy part of VS, is worst.
I don’t see a way to fix this, other than moving most Python universe simulation stuff to a thread. But VS is years from becoming thread-safe in that way, so… bad mojo. For now, all I can do is try to optimize the python scripts. I can’t make Python fast on the Atom, but I can probably use better algorithms to do less work in Python. I’ve been slowly groking through Python code trying to find obvious places to optimize, but I haven’t been able to measure anything yet.
The conclusion is I have to do more testing. There’s the new N2800’s which run a lot faster, and features a tile-based GPU. I have absolutely no clue how this GPU will fare, but I can imagine it will pose its own challenges. Even on the N5xx I should be able to streamline shaders a bit and perhaps optimize python scripts. Hope is not lost yet.