Hashing File Paths

By | July 12, 2013

At this point it’s pretty common knowledge that when you refer to a runtime resource in your game, you don’t do it by string. At least not directly. The string gets transformed into a hash key using some fancy hashing function that the systems guy read about on the internet, and the key is used to look up the resource in a resource database.

A lot has been written about about what to do with the strings in the runtime. Do we use a macro and a tool to post-process our source files to convert to the hash value? Do we use a clever function that we’ve proven via reading disassembly that gets unrolled into a constant value? Well, plenty of other people have written stuff and have opinions on that. I’m not here to talk about this aspect.

What I don’t see discussed in the game land is how to choose your hash function specifically for hashing asset names. It seems like a foregone conclusion. Lots of these have been written by smart people who all have statistics to prove that their latest function-du-jour has better bit distribution, avalanche characteristics, bit independence, and fewer calories than the competition. Just pick the best one and get on with it, right?

Not so fast.

The string hashing functions that most of us consider are usually labeled as “general purpose” hashes. Minimization of “bad” characteristics and maximization of “good” characteristics is measured broadly over many sets of data. But the one set of data that none of these hashes have ever been tested against is the data for my game.

So that’s what we’ll do.

Many of the string hashing functions in use are put in to practice in the web and database spheres, where text is put into a hash table for caching purposes, and is really arbitrary. This is different than what we want to use a hashing function for. We will be hashing relative paths to a file on a disk. That means that most of our source strings will have high similarity. The strings themselves will also be relatively short. With MAX_PATH being only 260 characters on Windows, you can bet that strings will be quite a bit less than that in length.

So how will we test? With this Python script, of course!

import os
import sys

def main(argv):
	collisionCount = 0

	if len(argv) == 2:
		fileCount = 0
		nameSet = {}

		if os.path.isdir(argv[1]):
			for top, dirs, files in os.walk(argv[1]):
				for file in files:
					(head, key) = os.path.join(top, file).split(argv[1] + os.sep, 1)
					fileCount += 1

					hash = HASH_FUNC(key)

					if hash in nameSet:
						print('Collision: (0x%x) - %s -> %s' % (hash, key, nameSet[hash]))
						collisionCount += 1
					else:
						nameSet[hash] = key

			print('\nCounted %d file(s) with %d collsion(s).' % (fileCount, collisionCount))

			return collisionCount
		else:
			print('%s: `%s` is not a valid directory.' % (os.path.basename(argv[0]), argv[1]))

	else:
		print('Usage: %s <base_dir>' % os.path.basename(argv[0]))

	return 1

if __name__ == '__main__':
	sys.exit(main(sys.argv))

The input is the root path to which we will recursively calculate hash values relative to. Try it with C:\Windows (if you’re on a Windows machine, that is.)

The gotcha here is that I haven’t provided an implementation of HASH_FUNC. That one’s up to you.

Now another thing that would be really helpful would be to test this on real game assets. Incidentally, I happened to install the GUTS tool for Torchlight II the other day. The editor unpacks all the Torchlight II data files into a directory so that everything and it’s grandmother can be modded. Hey! Real game data! Like 96,436 files worth! It’s hard to beat that.

So, statistics…

Hash Function Bit Width Collisions
Bernstein Hash 32-bit 1
Bernstein Hash XOR Variant 32-bit 4
SDBM Hash (x65599) 32-bit 1
Fowler-Noll-Vo 32-bit 2
Fowler-Noll-Vo Variant 32-bit 2
Fowler-Noll-Vo Variant Post-Mix 32-bit 2
Fowler-Noll-Vo Variant 64-bit 0
Fowler-Noll-Vo Variant 64-Folded 32-bit 1
Sedgwick Hash 32-bit 1
Kernigan & Ritchie Hash 32-bit 0
Knuth Hash 32-bit 220
ELF Hash 32-bit 495

So, without listing the implementation of every single algorithm here (ask me if you want to know about one), it seems that hashing functions relying purely on multiply-add style operations have the fewest collisions. In this case, K&R, Sedgwick, and Bernstein-OLD. Hashes relying purely on shifts, i.e. Knuth and ELF, were pure garbage on this data set. I suspect the data is just not long enough for the shift-based hashing functions. Fowler-Noll-Vo, in any form, seems really good. Of course 64-bit causes no collisions with only 96.5k files. 64-bit anything would probably cause no collisions on this data.

Note that there were a couple of other popular hashes not included in my test (such as Murmur and Jenkins), only because I never bothered to implement them for my test suite. I actually expected to use FNV1a, initally, as it is suggested that it is one of the better hashes in terms of overall quality, and the implementation is super simple. Turns out there might be no need.

So, as a reward to the winner, I’ll post the code for the K&R hash that I used. Note that this is basically the same as Sedgwick and Bernstein-OLD, except the magic constants are different. Also, Bernstein has an initial hash value other than 0, but incidentally, I changed mine to 0 (just for fun) and it still produced just 1 collision. (Albeit, on a different file than it did with its original constant.)

def HashKnR(string):
	"""Kernigan & Ritchie "The C Programming Language" hash."""
	hash = 0
	
	for c in string:
		hash = (hash * 131 + ord(c)) & 0xFFFFFFFF
		
	return hash

So, the moral of the story is, test some hash functions. The case we most care about for assets is hash collisions, not bit distribution. But if you’re gonna use this post as a cheat-sheet, K&R performs surprisingly (unexpectedly!) well. And for the record, it produces 0 collisions on C:\Windows too!

Endian Swapping

By | July 9, 2013

Wow, so it’s been almost a year since I posted anything. I better remember how to use this blog thing. So for a simple ease-back-in sort of post, I want to talk about endian swapping.

Thanks to hardware guys’ inability to agree on things, we software guys run the risk of having our data represented in memory in two different ways. This is not a big deal, unless you read binary data from disk (or the network) that was created on a platform that represents said data in the opposite byte-order in memory than the one you are reading it on.

Rather than going into detail on this, I’ll just link you to my favorite place on the internet; wikipedia.

So, getting to the point. I’ve seen lots (and written my share) of code to swap bytes from one order to the next. It usually starts out by figuring out what endianness your current platform is, and then swapping the bytes in words if the data does not match.

Often there are many functions, like so:

#include <stdint.h>

void Swap16(int16_t *pData)
{
	char *pBytes = reinterpret_cast<char *>(pData);

	char temp = pBytes[0];
	pBytes[0] = pBytes[1];
	pBytes[1] = temp;
}

void Swap32(int32_t *pData)
{
	char *pBytes = reinterpret_cast<char *>(pData);

	char temp = pBytes[0];
	pBytes[0] = pBytes[3];
	pBytes[3] = temp;

	temp = pBytes[1];
	pBytes[1] = pBytes[2];
	pBytes[2] = temp;
}

void Swap64(int64_t *pData)
{
	char *pBytes = reinterpret_cast<char *>(pData);

	char temp = pBytes[0];
	pBytes[0] = pBytes[7];
	pBytes[7] = temp;

	temp = pBytes[1];
	pBytes[1] = pBytes[6];
	pBytes[6] = temp;

	temp = pBytes[2];
	pBytes[2] = pBytes[5];
	pBytes[5] = temp;

	temp = pBytes[3];
	pBytes[3] = pBytes[4];
	pBytes[4] = temp;
}

Generally, you’ll have a function for each sized type that you can swap. Maybe you stick these in some sort of ByteSwapper class so that you can check the endianness in one place and “not swap” if it’s not necessary, even though you call these functions.

But the swapping pattern should be obvious from the code above — we simply take the first and last bytes, swap them, and then progress to increment the start index and decrement the end index. Instead of writing all these swap functions manually (and then having the compiler complain about signed/unsigned mismatches every now-and-then), it would be cool if we had some way to make the compiler do it for us.

Ah, I know! I know! A template!

template <typename Type_t>
void EndianSwap(Type_t *pData)
{
	char *pBytes = reinterpret_cast<char *>(pData);

	for (int head = 0, tail = sizeof(Type_t) - 1; head < sizeof(Type_t) / 2; ++head, --tail)
	{
		char temp = pBytes[head];
		pBytes[head] = pBytes[tail];
		pBytes[tail] = temp;
	}
}

And there you have it. Nothing revolutionary, I admit. But strangely, I’ve never actually seen this in production except when I had to write it myself. All indices and loop counts are constants based on the size of the type, so this should be an easy one for the compiler to unroll.

Enjoy.

Weighing in on Ouya

By | August 28, 2012

There’s been obviously a lot of opinions flying around the Internet regarding the latest home-game-console-to-be, the Ouya. Everyone from Penny Arcade to Forbes have taken notice of this little Kickstarter darling and attempted to point out why it will be the new Nintendo or end up the next Phantom. Since I’ve never been one to shy away from having an opinion, I’ll take a moment to share it here. My opinion is one of a person who develops actual physical-disc based games for current generation consoles.

I want to start by saying that if you are a console developer, or want to be, you need to buy one of these things. Basically you’ll get to write code for a very relevant CPU architecture — the ARM. Not only this, but the Tegra 3 is a quad-core with SIMD extensions. This means that anything you learn about parallel programming on this little device is immediately relevant to other consoles, as well as PCs. You get all of this for $99. The same cost as a year of XNA Creator’s Club or PlayStation Mobile Developer, but you get it forever.

Now I want to address some of the comments I’ve heard regarding the Ouya. I was originally planning to write this as a sort-of Q&A where I would answer a criticism with my thoughts on the validity of said criticism. But, there’s really too much to say, so I’ll just post my own opinions interspersed with stuff I’ve heard said, and why I do or don’t agree.

I’ll start by addressing the idea that the Ouya will fail on the grounds of hardware. This includes the notion of how difficult it is to make a console, and that the Kickstarter videos demonstrated very little.

I think this argument is the result of uninformed paranoia. Really, everything about the Ouya is off-the-shelf except the case. I hear things like, “But, but Microsoft took longer to come out with the Xbox 360…”, and “Phones cost so much more than $99…”, but these arguments are specious.

Firstly, a console is not a phone. It doesn’t have a magical touch-screen that is supposed to survive being dropped in a rainy puddle. It doesn’t have a cellular transmitter/receiver, and it doesn’t have a battery made from rare earth metals. In fact, the Ouya is very little other than a Tegra 3 SOC.

Secondly, Microsoft designed their console from scratch, and as all the big companies do, they took input from big name developers on what the thing should do. Trust me when I say that anything that seems off-the-shelf about an Xbox has actually been redesigned at least slightly from the equivalent standard parts. DirectX 9? Not exactly. PowerPC + Altivec? Not quite. Dual layer DVD storage? If only. About the most customization the Ouya will get is when the Nvidia guys say, “Hey if you slap a heat-sink on this thing you can run it at a higher core frequency and ignore the energy management features.” And even that won’t be novel — Nvidia already has blades of these things for high-performance computing.

On top of this you have the massive amount of design that went into security on the home console platforms. In addition to using the online service to attempt to vet piracy, they spent massive amounts of money making these things just really damn hard to take apart. This is all time and cost that won’t be incurred for the Ouya, which I’m guessing, at best will be almost as secure as a Wii.

Next is the idea that Ouya doesn’t have a market. Or, do we really need another home console? This usually starts with comparisons with how much this thing will suck sitting next to your PS3.

Uh, yeah. It’s no PS3. It won’t compete with one. It certainly won’t compete with the next generation of consoles that will appear in a couple of years time. But that’s irrelevant. This device won’t likely attract teams who make games that require the highest end hardware. Those games are expensive to make and therefore will always cost the consumers more, and the target for this thing is cheap entertainment. But one thing that is clearly overlooked by the mainstream press, and even perhaps by Julie Uhrman herself is the potential this thing has with emerging markets.

Games these days are always localized. You can’t earn back the costs, even at $60 a game, without going worldwide. But nobody ever asks for localizations for Russian, or Portugese, or Hindi. Why? Because even though the BRIC countries are growing like mad, there is still really no market for $300 consoles and $60 games there. The Ouya really has great potential to open that up further. Now the astute amongst you will remember the Zeebo and that Eedoo thing. The Zeebo has some pretty strong support from real game publishers. The interesting comparison here, for me, is that the Ouya has a lower price-point, and is not tied to the cellular network for games distribution. This, in my opinion, makes Ouya a direct competitor in the same markets.

As for China, and it’s Eedoo thing. I think they just opened the market up here with their release of the CT510. Now, for those of you not in the know, home game consoles are banned in China. Eedoo gets around this by not shipping a controller, but instead a remote control and the motion sensor. Instead of selling a game console they sell a living room device. Well the reality is, the Ouya actually is a living room device. It clearly wouldn’t be all that hard to ship with a motion sensing remote control a-la the Wiimote instead of a console-style controller. Sure, that changes the dynamic for developers, but it opens up a massive market. And at 1/6th the cost of the Chinese console, I think it would do quite well.

Now what do I see as the dangers and down-sides?

For one, the software platform. There is no demonstration of this thing downloading and installing a game. There is no off-the-shelf software platform for this. I’m not talking about the back-end service, which could easily be run off of Amazon EC2. I mean the app store mechanism, the support mechanism, and the anti-piracy mechanism. Android is not the most consistent or well put together in this regard. If this thing just turns into a little piracy box, then developers will just drop it. If there was a top-notch security guy on board making claims aimed at allaying developers fears regarding this, I would certainly feel less concerned.

Additionally, I’ve heard some complains that this device is not just a port box for Android phone games. That redesign for a controller and a TV will take some effort. My concern is that it isn’t exclusive enough. Now I don’t mean to reek of snobbery, but part of the allure of the console experience is that there is a higher bar than other gaming experiences. This is almost directly the result of the console manufacturers’ iron-fisted approach to vetting the games that end up on their systems. Now, look at all the complains directed at the Wii as being a party-game shovelware experience. This is with all the barriers to entry in place. Now, I’ll be the first to admit that the quality bar in the console scene has really dropped over the years, but unless the Ouya people come out with some sort of category system or preferred tier where the games are properly tested, you can expect this experience to be overall worse than what’s available today for console gamers.

To be clear here, with respect to the Ouya, I’m not suggesting that individual developers have some sort of superficial standard applied to them to keep them from coming out with games on the console. I’m talking about some sort of exclusivity that makes designing for the system a unique experience. Much like the Wiimote made the Wii a unique experience before everybody had motion controllers.

TL;DR

Will it be a success? I think the Ouya will do just fine for itself. There must be enough to show behind closed doors to attract developers like Square, EA, and potentially Namco-Bandai. Will it remain a game console and will big companies continue to gravitate towards it? That I don’t know. The Vevo deal certainly shows that the Ouya not only can, but will do other things. If nothing else, it will make a very slick looking little Google TV box, and a great little hobby device for programmers out there.

Pedantic Asserting

By | July 10, 2012

Asserts are a brilliant paradigm for catching errors at run-time in your code. Whereas a unit test is usually written to catch errors in the constraints of a function, asserts are put in place to catch usage errors. Let’s go over a quick example of what a unit test might look like; I’m using UnitTest++ syntax, cause that’s the framework I’m familiar with.

TEST(CheckSqrtForNaN)
{
	float nan = sqrtf(-1.0f);
	CHECK(nan != nan);
}

This is a simple test that makes sure that the square-root of a negative number comes back as a NaN. This is a good test — there were actually platforms where sqrt(-1) would return 0. But say we decide that nobody should ever pass a negative value to sqrtf() to begin with. We can’t unit test that, so let’s assert it.

float MySqrt(float input)
{
	ASSERT(input >= 0.0f);
	return sqrtf(input);
}

Now we need to get rid of the previous unit test — taking the square root of a negative value is no longer within the constraints of the function. On the plus side, though, we don’t have to decide what to do about that pesky platform that never returned NaNs.

It turns out that there can be a lot of code you can write like this — simple functions where you would like to enforce run-time behavior. Let’s look at another one that has useful unit tests, but also should contain asserts.

float Lerp(float a, float b, float t)
{
	ASSERT(t >= 0.0f && t <= 1.0f);
	return (b - a) * t;
}

UNITTEST(CheckLerp)
{
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 0.0f), 0.0f);
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 0.5f), 0.5f);
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 1.0f), 1.0f);
}

So if you have a Lerp() function, you should definitely assert the defined input range. You would, of course, also want to unit test it, just in case somebody changes it into Hermite spline interpolation when you’re not looking.

By now you should be getting the impression that if you are as anal retentive as me about putting asserts everywhere, you might be slowing down your code unnecessarily. Afterall, even with optimizations, the assert inside the Lerp() function is still doing floating point compares and branching, and the only way to get those asserts to compile out is to disable asserts for the entire program.

Local Asserts to the Rescue

Why don’t we just make another assert macro specifically for our math functions so that we can enable/disable them at the file level? If we start getting weird math-related bugs we can re-enable the asserts and see if any of them get hit as the bugs manifest.

#define ENABLE_MATH_ASSERTS		1

#if (ENABLE_MATH_ASSERTS)
	#define MATH_ASSERT(x)		ASSERT(x)
#else
	#define MATH_ASSERT(x)		NO_ASSERT(x)
#endif

float MySqrt(float input)
{
	MATH_ASSERT(input >= 0.0f);
	return sqrtf(input);
}

float Lerp(float a, float b, float t)
{
	MATH_ASSERT(t >= 0.0f && t <= 1.0f);
	return (b - a) * t;
}

#undef MATH_ASSERT

We can put these local asserts all over the place and turn them on and off selectively if we want to test for problems at run-time, and in final builds, where all asserts are disabled, we don’t need to do anything different.

Now something might be niggling at the back of your mind. If asserts are put in place to catch run-time errors, and we may or may not have them enabled, how exactly can we be sure that they even work?

Let’s unit test our asserts…

The first step in this whole process is to read Charles Nicholson’s blog post on asserts. You know nothing about asserts until you’ve completed that article.

Now the key takeaway here is setting up a customizable assert handler. Charles mentions in passing that this can be used for unit testing asserts, but doesn’t go into detail. If you’re using the UnitTest++ framework (which Charles also co-authored), there’s a handy macro for testing asserts. It relies on your assert handler throwing an exception. This is key — if you’re using another unit testing framework you can test your asserts in the exact same way.

int _CCALL ThrowDebugAssert(const char *expression, const char *path, int line, const char *reason, ...)
{
	throw UnitTest::AssertException(expression, path, line);

	return 0;
}

int MathTests::RunAllTests(void)
{
	Debug::SetAssertHandler(ThrowDebugAssert);

	int failures = UnitTest::RunAllTests();

	return failures;
}

And now you can test your asserts as follows…

TEST(CheckSqrtForNaN)
{
	CHECK_ASSERT(MySqrt(-1.0f));
}

UNITTEST(CheckLerp)
{
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 0.0f), 0.0f);
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 0.5f), 0.5f);
	CHECK_EQUAL(Lerp(0.0f, 1.0f, 1.0f), 1.0f);

	CHECK_ASSERT(Lerp(0.0f, 2.0f, -0.1f));
	CHECK_ASSERT(Lerp(0.0f, 2.0f, 1.1f));
}

Again, this is what you need for the UnitTest++ framework, but it should give you a clear idea of how to do it in any other framework that doesn’t have identical support for testing asserts.

But wait… we’re going in circles. If we disable the asserts locally, then they are disabled for the unit testing, and the unit tests will fail wanting an assert. We need some way to let CHECK_ASSERT() pass if we’ve disabled asserts locally for a module. One option is to write a nifty new UnitTest++ macro and add a global variable to our tested code, like so.

// Somewhere in our library...
bool g_MATH_AssertEnabled = ENABLE_MATH_ASSERTS;

// In our test framework...
#define CHECK_TOGGLE_ASSERT(name, condition) \
	extern bool g_##name##_AssertEnabled; \
	if (g_##name##_AssertEnabled) CHECK_ASSERT(condition)

// In our unit tests...
UNITTEST(CheckLerp)
{
	// ...

	CHECK_TOGGLE_ASSERT(MATH, Lerp(0.0f, 2.0f, -0.1f));
	CHECK_TOGGLE_ASSERT(MATH, Lerp(0.0f, 2.0f, 1.1f));
}

Now we can call CHECK_TOGGLE_ASSERT() in our unit tests in place of CHECK_ASSERT() and it will do nothing if the asserts for a module have been compiled out locally.

Finally, we could beautify all of this stuff with some more macros to make setup of this procedure a little bit less obtuse.

	#define _BUILD_VAR_NAME(name)				g_##name##_AssertEnabled
	#define IMPLEMENT_UNITTEST_ASSERT(module)	bool _BUILD_VAR_NAME(module) = ENABLE_##module##_ASSERT;

	#define CHECK_TOGGLE_ASSERT(name, condition)	\
		extern bool _BUILD_VAR_NAME(name);	\
		if (_BUILD_VAR_NAME(name)) CHECK_ASSERT(condition)

And, if you wanted, inside your unit tests you could wrap the toggle check like so:

	#define CHECK_MATH_ASSERT(condition)		CHECK_TOGGLE_ASSERT(MATH, condition)

Of course, the down side to this is that there is a lot more boilerplate that can’t be generated by preprocessor macros; like the macros themselves. There’s also the possibility that the global variables are not stripped from the executable in release (this is particularly possible if your code is built into a DLL). Most importantly, though, is that we’re not testing our asserts when they’re disabled — they can easily be forgotten about over the course of a project.

A better and simpler option is to just have your build machine make a build with full tests enabled. This is a clear win unless there is some dire reason you can’t do it. You get full test coverage, even of unit tests that have been disabled forever and forgotten about.

#if defined (UNITTEST_BUILD)
	#define ENABLE_MATH_ASSERTS		1	// Always 1...
#else
	#define ENABLE_MATH_ASSERTS		0	// Defined by user...
#endif

#if (ENABLE_MATH_ASSERTS)
	#define MATH_ASSERT(x)		ASSERT(x)
#else
	#define MATH_ASSERT(x)		NO_ASSERT(x)
#endif

Now there is even no real need for the CHECK_TOGGLE_ASSERT() macro, either.

Clearly you’ll want unit tests disabled in final code. If your build setup runs unit tests on final stuff (which seems even too anal for me) you can of course macro around final but you’ll need to keep the customized CHECK_ASSERT() macro.

#if !defined (_FINAL)
	#define CUSTOM_CHECK_ASSERT(condition)		CHECK_ASSERT(condition)
#else
	#define CUSTOM_CHECK_ASSERT(condition)		NO_ASSERT(condition)
#endif

To be clear, all this will do is cause your unit tests to NOT test the asserts, hence all passing. Now before you start thinking that we’re back to where we were with the global variable option, consider for a minute that it’s really moot to test asserts in final anyhow. They are compiled out and will never be hit in the actual product. Asserts are needed while developing…

So anyway, this post is long and scatter-brained, but I think the idea is evident. Assert everywhere with local asserts, disable the ones you feel like disabling, unit test them, and have the build machine test them all. So go forth and assert pedantically.

日本でお代わりコーヒー

By | June 19, 2012

私はコーヒーが好きな人。コーヒーは日本でおいしいだけどたかくて小っちゃい。カフィインが必要なとき、私は飲み放題コーヒー飲みに行きたい。

日本は飲み放題コーヒーがどこですか?今までこれだけ知ります。

  1. McDonald’s
  2. Italian Tomato
  3. Denny’s

もうありますか?教えてください。



UA-28451547-1