At work one of the devs was running into a weird problem. He could run a group of our unit tests on his dev box without any problem, but when the same tests ran on our build machine the test host process crashed with an unhandled exception. Thankfully we run all of our processes with an unhandled exception trap which generates a minidump before terminating, so we were able to determine the failure was in a C runtime function _chkstk called upon entry into a particular function that allocates alot of stuff on the stack.
At first, I was thinking stack overflow, but there wasn’t anywhere near 1MB of shit on the stack, which is the default maximum stack size. I thought about stack corruption, but the C-runtime stack checking routines provide an explicit message when stack corruption is detected. We commented out the large stack variables in the function that was being called, and that made the crashes go away, so clearly it was something stack related, but what?
I started Goggling about, and first ran into the Microsoft documentation for the _chkstk routine. There’s not much there:
Called by the compiler when you have more than one page of local variables in your function.
Remarks
_chkstk Routine is a helper routine for the C compiler. For x86 compilers, _chkstk Routine is called when the local variables exceed 4096 bytes; for x64 compilers it is 4K and 8K respectively.
Hmm, well, that explains why commenting out the large local variables made the problem go away; _chkstk isn’t even called without at least 4K on the stack. Still, that doesn’t explain why the crash is happening.
Then I ran across an article about debugging a stack overflow with WinDbg, the shitty not-at-all-intuitive debugger that ships with the Debugging Tools for Windows. That wasn’t interesting; what was interesting was the hypothetically stack overflow scenario they presented. It involved a mysterious crash in _chkstk!.
In the article’s example, the problem wasn’t too much stuff on the stack, it was a system committed page count very close to the max (that is, the amount of physical memory in the machine). You see, _chkstk grows the stack when needed by committing some of the pages previously reserved for the stack. If there is no more physical memory available for committed pages, _chkstk fails. Interestingly enough, the commited page count on our build machine was very near the max, while the committed page count on the developer’s box was low, which explains why he couldn’t repro it on his box.
The article offers little more than a shrug and a “shit happens” as a workaround. It does suggest increasing the stack commit size (the portion of the reserved memory committed when a thread starts) as a workaround, which will cause the necessary memory to be committed at the time a thread starts, such that under low memory conditions the thread won’t start at all, rather than crashing in some unpredictable spot when the stack grows too much.
As a result, we explicitly set the stack reserve and commit sizes to 1MB in the Linker | System property page for all of our executable projects. That will increase the quantity of committed memory at application and thread startup, but only by 1MB. In return, it will make low-memory conditions cause more obvious thread start errors, which to my mind is worth the up-front memory hit.
Just when you think you have a pretty good handle on Windows development, something like this comes along to remind you how much you don’t know.