Thursday, June 10, 2010

Process "hangs" for 6-8 seconds

I'm running in to some weird issues with a custom caching system I've built. I have 5 computers running a test client, hitting my server (across WCF) with about 40 threads each. For the full test they make 1000 calls per thread. The call they are making isn't overly complex -- it grabs small amounts of data from about 8-10 large dictionaries then combines the data into result rows and returns a max of 30 rows. The complete round trip time of the call (when called singly) is about .3 seconds.

With 5 clients (200 threads total) hitting the server at once, the 8 CPUs sit at about 50% usage each. The response times have grown anywhere from 2-4 seconds for each client. Not really great, but I'll deal for now.

After about 30 seconds or so, all 8 CPUs drop to 0% and stay there for about 8-10 seconds. This obviously causes the response times to shoot way up (around 15 seconds or so; depends on how long it's hung). The CPUs then pick back up and everything resumes as usual. This doesn't just happen once near the beginning. It happens multiple times during the test with no certain frequency.

Something else that is puzzling is that after one of these big "hangs", the base line processor usage jumps to about 65%. When that happens, response times start jumping all over the place. I'll see some that are 2 seconds, others that are closer to 20 seconds!

After some time, the baseline usage will come back down closer to 40-50%, where response times level back out around 2-4 seconds. Then it all happens again.

I'm putting together some screen shots of processor usage and such. I will be posting those shortly. I'll be posting some graphs from perfmon, so if you have any advice on which counters would be most helpful to see here, please let me know.

Host machine: Windows Server 2008 SP1 x64; Xeon W5590 @ 3.33GHz; 24GB RAM; 2 SSDs in RAID 0;

Client machines: Windows Vista Business SP1; Core2 Duo E8500 @ 3.16GHz; 8GB RAM;

Update:
I took a process dump during one of the bigger hangs. Here's what the stacks look like.




Processor graphs:
1) The first graph is at the very beginning of testing, after the clients had been running for a few seconds. Around 50% average load per processor.
2) The second graph was a "minor" hiccup. When it happened all the client response times spiked just briefly to around 15 seconds.
3) The third graph was really bad. All the clients "hung", then reported response times of near 20 seconds. This happened for around 8-10 seconds.
4) After the terrible "hang" earlier, everything started responding again. You can definitely tell that the CPU usage ended up well above where it started. Response times around this time are all over the place, from 2 to 20 seconds.

No comments: