Thursday, June 10, 2010

Process "hangs" for 6-8 seconds

I'm running in to some weird issues with a custom caching system I've built. I have 5 computers running a test client, hitting my server (across WCF) with about 40 threads each. For the full test they make 1000 calls per thread. The call they are making isn't overly complex -- it grabs small amounts of data from about 8-10 large dictionaries then combines the data into result rows and returns a max of 30 rows. The complete round trip time of the call (when called singly) is about .3 seconds.

With 5 clients (200 threads total) hitting the server at once, the 8 CPUs sit at about 50% usage each. The response times have grown anywhere from 2-4 seconds for each client. Not really great, but I'll deal for now.

After about 30 seconds or so, all 8 CPUs drop to 0% and stay there for about 8-10 seconds. This obviously causes the response times to shoot way up (around 15 seconds or so; depends on how long it's hung). The CPUs then pick back up and everything resumes as usual. This doesn't just happen once near the beginning. It happens multiple times during the test with no certain frequency.

Something else that is puzzling is that after one of these big "hangs", the base line processor usage jumps to about 65%. When that happens, response times start jumping all over the place. I'll see some that are 2 seconds, others that are closer to 20 seconds!

After some time, the baseline usage will come back down closer to 40-50%, where response times level back out around 2-4 seconds. Then it all happens again.

I'm putting together some screen shots of processor usage and such. I will be posting those shortly. I'll be posting some graphs from perfmon, so if you have any advice on which counters would be most helpful to see here, please let me know.

Host machine: Windows Server 2008 SP1 x64; Xeon W5590 @ 3.33GHz; 24GB RAM; 2 SSDs in RAID 0;

Client machines: Windows Vista Business SP1; Core2 Duo E8500 @ 3.16GHz; 8GB RAM;

Update:
I took a process dump during one of the bigger hangs. Here's what the stacks look like.




Processor graphs:
1) The first graph is at the very beginning of testing, after the clients had been running for a few seconds. Around 50% average load per processor.
2) The second graph was a "minor" hiccup. When it happened all the client response times spiked just briefly to around 15 seconds.
3) The third graph was really bad. All the clients "hung", then reported response times of near 20 seconds. This happened for around 8-10 seconds.
4) After the terrible "hang" earlier, everything started responding again. You can definitely tell that the CPU usage ended up well above where it started. Response times around this time are all over the place, from 2 to 20 seconds.

"Processor affinity" when making WCF calls from multiple threads

I'm having a strange problem with my WCF client/server. From a "fresh" client I fire off 100 calls from a single thread and my server processes all the calls on a single processor. Great. From a fresh client again, I fire off 100 calls from 10 threads (10 each), it still processes them all on one processor. However, if I start a fresh client and fire off 100 calls on one thread then immediately follow that by 100 calls over 10 threads, I see all 8 processors handling calls.

Fresh client:
100 calls from 1 thread on client puts all calls on only 1 logical processor on server

Fresh client:
10 calls from 10 threads on client puts all calls on only 1 logical processor on server

Fresh client:
100 calls from 1 thread on client immediately followed by
100 calls from 10 threads on client puts calls successfully on ALL 8 logical processors on server

So why does a fresh client's calls always get serviced by a single logical processor? Why does the second set of calls get serviced by all logical processors?

Sample code from client:
public partial class MyAgentClient
{
    static System.ServiceModel.Channels.Binding _binding;
    static System.ServiceModel.EndpointAddress _endpoint;
    static ChannelFactory<IMyAgent> _channelFactory;
    static IMyAgent _channel;

    static MyAgentClient _instance;
    public static MyAgentClient Instance
    {
        get
        {
            return _instance;
        }
    }

    static MyAgentClient()
    {
        _binding = WCFSettings.CurrentBinding;
        _endpoint = new EndpointAddress(WCFSettings.CurrentURI + "/MyAgent");
        _instance = new MyAgentClient();
        _channelFactory = new ChannelFactory<IMyAgent>(_binding, _endpoint);
        _channel = _channelFactory.CreateChannel();
    }

    public MyReturnData GetSomeData(string p1, string p2, string p3, SomeEnum p4, int p5, int p6)
    {
        try
        {
            return _channel.GetSomeData(p1, p2, p3, p4, p5, p6);
        }
        catch (Exception ex)
        {
            // Log the exception.
        }
    }
}
Sample code from server:
[ServiceContract]
public interface IMyAgent
{
    [OperationContract]
    MyReturnData GetSomeData(string p1, string p2, string p3, SomeEnum p4, int p5, int p6);
}

[ServiceBehavior(ConcurrencyMode = ConcurrencyMode.Multiple, InstanceContextMode = InstanceContextMode.PerCall, UseSynchronizationContext = false)]
public class MyAgent : IMyAgent
{
    public MyReturnData GetSomeData(string p1, string p2, string p3, SomeEnum p4, int p5, int p6)
    {
        try
        {
            return MyCode.StaticMethods.GetSomeData(enterpriseCode, companyNumber, salespersonFilter, leadCategoryType, startIndex, pageSize);
        }
        catch (Exception ex)
        {
            // Log the exception.
        }
    }
}
WCF binding:
public static CustomBinding NewTcpBinding
{
    get
    {

        TimeSpan oneMinute = new TimeSpan(0, 1, 0);
        TimeSpan tenMinutes = new TimeSpan(0, 10, 0);

        ReliableSessionBindingElement reliableSession = new ReliableSessionBindingElement();
        reliableSession.MaxPendingChannels = 512;
        reliableSession.Ordered = false;
        reliableSession.FlowControlEnabled = false;
        reliableSession.MaxTransferWindowSize = 4096;
        reliableSession.InactivityTimeout = tenMinutes;

        BinaryMessageEncodingBindingElement binaryEncoding = new BinaryMessageEncodingBindingElement();
        binaryEncoding.ReaderQuotas.MaxStringContentLength = 2147483646;
        binaryEncoding.ReaderQuotas.MaxArrayLength = 50000000;

        TcpTransportBindingElement tcpTransport = new TcpTransportBindingElement();
        tcpTransport.MaxReceivedMessageSize = 2147483646;
        tcpTransport.MaxBufferSize = 2147483646;
        tcpTransport.MaxBufferPoolSize = 524288;
        tcpTransport.TransferMode = TransferMode.Buffered;

        BindingElementCollection elements = new BindingElementCollection();
        elements.Add(reliableSession);
        elements.Add(binaryEncoding);
        elements.Add(tcpTransport);

        CustomBinding binding = new CustomBinding(elements)
        {
            Name = "TcpBinding",
            CloseTimeout = oneMinute,
            OpenTimeout = oneMinute,
            ReceiveTimeout = tenMinutes,
            SendTimeout = oneMinute
        };
        return binding;
    }
}