What sort of things can cause a whole system to appear to hang for 100s-1000s of milliseconds?

Question

I am working on a Windows game and while rendering, some computers will experience intermittent pauses ("hitches" for lack of a better term). When profiled they appear in seemingly random places in the code. Eventually I noticed that it wasn't just my process that was affected, but (seemingly) every process on the system. All of the threads in my application hitch at once. The CPU utilization drops during these hitches and it appears as if most processes make no progress.

This leads me to believe this may be an Operating System or Driver issue, but it only occurs while playing the game (and only on some systems). What sort of operations might the operating system be doing that would require the kernel to pause all user threads and block. Some kind of I/O? At first I thought of paging but my impression is that would only affect a single process, no?

Some systems in use: Windows, DirectX (3d), nVidia cards (unknown if replicates on ATI), using overlapped io for streaming

One thing to look at may be the DPC (Deferred Procedure Calls - basically a network card or something deferring an interrupt and having the processing done later) See http://superuser.com/questions/202254/steps-to-troubleshooting-a-problem-with-high-dpc , more specifically it links a tool at http://www.thesycon.de/deu/latency_check.shtml — , Jan 14 '11 at 18:28
Even better, since you're a game developer there's an even better place for this: Game Dev — Ivo Flipse, Jan 14 '11 at 20:56
I'm not really sure this is better for this site instead of SO - most game developers are not strong WinAPI programmers, and there's nothing game-specific about the question. — , Jan 15 '11 at 00:54

score 3 · Answer 1 · edited May 26 '21 at 02:33

In my experience, these types of issues typically boil down to some type of resource exhaustion.

It's easy to speculate to the n-th degree about what it "could" be, but without data, these remain speculations.

Counters

To gather data that can solve the puzzle, on windows you need to collect perfmon data. Some counters you should grab for all processes (if applicable) are:

**Processor** /All Counters/All instances
**Logical Disk**/All Counters/All instances
**Memory**/All Counters/All instances
**Network Interface**/All Counters/All instances
**Paging File**/All Counters/All instances
**Process**/All Counters/All instances
**Processor**/All Counters/All instances
**Server**/All Counters/All instances
**Server Work Queues**/All Counters/All instances
**System**/All Counters/All instances

In my opinion, this is an exhaustive list of all possible counters that you might find relevant data in. There is a penalty for capturing all of this data, it is a lot of data to log, so you may want to try a subset of the counters that you feel are most relevant for your situation.

Logging

When you run perfmon you want to select to create a new manually defined Data Collector Set for Performance Counters. There will be a screen that asks for the sample interval. You need to make sure that the sample interval is small enough to capture the problem but not so small that you overwhelm the system with data logging.

I would recommend setting the capture to manually start/stop. So that you can start the capture, repro the problem, then stop and analyze the logs.

Analyzing Data

The perfmon utility allows you to look at every counter individually. If you know what you're looking for, this works. If you're not familiar with this process or which counters to look at, you might benefit from using an automated analysis tool such as PAL. PAL is free and awesome. Essentially it has a set of thresholds defined for each counter, it parses through your log collection and spits out an HTML report that shows you:
Warnings - Any counter that is close to a threshold
Critical - Any counter that has exceeded a threshold

This can be a simple way to start your analysis and narrow in on any items marked Critical.

Best Guess/Speculation on Problem Statement

To add to the speculation about what it might be. It sounds like you may be under memory pressure. This means that physical free memory has been exhausted and the os needs to read or write memory contents to or from disk.

The perfmon data that would validate the above scenario would show a steady rise in memory utilization followed by a sharp fall. Simultaneously there would be a sharp rise in pagefile usage as well as local disk I/o. Again, just speculating without any hard data (which you need).

score 1 · Answer 2 · answered Jan 14 '11 at 22:09

1

If you managed to hang the entire computer, not just your own process or any processes you were actively messing with, then this means a bug in Windows or a driver, as it should be impossible for user-mode code to cause this kind of system-wide hang. That doesn't mean that the bug isn't triggered by a bug in your own code- but your code might also be perfectly bug-free.

What you need to do is narrow down the issues- record the state of all threads at the time of a hitch. Also, you should verify that no external code is messing with your process. There's a specific Logitech Webcam, whose drivers will inject a DLL into all processes and it would damage several games that I know of, just for example. This kind of thing can damage your system from the outside in unknowable ways.

answered Jan 14 '11 at 22:09

DeadMG

5,518
3
27
40

-1 for 1st Paragraph - There are many other explanations for this besides an actual bug in any component. The simple insertion of multiple filter drivers (anti-virus) can cause all sorts of symptoms like this. +1 for second paragraph. – Error 454 Jan 15 '11 at 21:17
@Error: How is that not the definition of a bug in the filter drivers? – DeadMG Jan 17 '11 at 01:40
Resource contention causes performance bottlenecks. Just because a filter driver is slow doesn't mean that it has a bug. It can be a victim as well. – Error 454 Jan 17 '11 at 02:15

score 1 · Answer 3 · answered Jan 19 '11 at 10:35

This sounds like you are crashing/stalling the GFX-Driver.

The best tools to narrow this down are a PIX and the directx-debug runtime.

You can change the DirectX Runtime into debug mode in the "DirectX Control Panel". This should give you a lot of log-output about any issues there could be in your driver-calls or the data you are providing. You should try to fix every error/warning the runtime is logging, as it is totally driver-dependent how these problems are handled on specific graphcs-cards.

If this does not show anything suspicious you can try to get more info about what is happening with PIX. But I am pretty sure that the first step should give you the informations you need.

What sort of things can cause a whole system to appear to hang for 100s-1000s of milliseconds?

3 Answers3

Counters

Logging

Analyzing Data

Best Guess/Speculation on Problem Statement