c# - WCF Reliable Sessions Fault when Server under heavy CPU load or Thread Pool Busy -
wcf reliable sessions fault when server under heavy cpu load or thread pool busy
there appears design flaw in wcf reliable sessions prevents issue or acceptance of infrastructure keep-alive messages when server under high cpu load (80-100% range) or when there isn't immediate io threadpool thread available handle message. symptoms manifest apparently random channel aborts due reliable session inactivity timeouts. appears abort logic runs @ higher priority or via different mechanism because abort timer seems fire though keep-alive timer can't run.
digging reference source, appears channelreliablesession uses interruptabletimer class handle inactivitytimer. in response, fires pollingcallback, set reliableoutputsessionchannel, creates ackrequestedmessage , sends remote endpoint. inactivitytimer uses wcf internal iothreadtimer/iothreadscheduler schedule itself. depends on available (non-busy) io threadpool thread service timer. if cpu load high, appears thread pool won't spawn new thread. result if several threads executing (appears 8 threads on 4-core machine; 15 second inactivitytimeout 7 abort , fail) no thread available send keep-alive. if modify reliable session inactivity timeout on client longer server, under these conditions server still unilaterally abort channel because expected message in shorter time. appears abort logic running @ higher priority or throws exception 1 of executing threads (not sure which); expected abort on server delayed due high cpu , client's longer timeout hit not case. if cpu load lower exact same scenario works fine concurrent calls take 30-90 seconds return.
it irrelevant instancemode is, max concurrent connections, sessions, or instances are, of other timeout values (other recievetimeout must greater inactivitytimeout). entirely design flaw of wcf implementation; should using isolated high-priority or realtime thread service keep-alive messages spurious aborts not generated.
the short version is: can issue 1000 concurrent requests take 60 seconds complete 15 second reliable session inactivity timeout no problems, long cpu load stays low.as cpu load gets heavy, calls randomly begin aborting, including calls aren't taking cpu time or duplex sessions idling waiting used. if incoming calls add cpu load service enter death spiral, execution time wasted on requests guaranteed abort, while other requests sit in inbound queue. service cannot return healthy state until requests stopped, in-flight threads finish, , cpu load drops. this behavior appears paradoxically make reliable sessions 1 of least reliable communication mechanisms.
this same behavior applies clients; in case wcf client may @ mercy of other processes on box under high cpu load randomly abort reliable sessions unless operations take less inactivitytimeout complete, though if don't issue new call wcf may still fail send keep-alive , channel may fault.
documenting answer:
you can mitigate issue if use threadpool.setminthreads(x, y) y number greater number of threads executing concurrent wcf requests. there may thread available++ service keep-alive , reliable sessions may not timeout, under sustained 100% cpu load, has limits well. in tests, bumped io threads 2 20 minimum, issued large number of concurrent (but do-nothing requests sleep 10 seconds). after re-ran client cpu-wasting call , able execute 8 simultaneously. restarting service executing same client test failed due lazy initialization of thread pool. bumping started hitting timeouts again @ 14 simultaneous calls (10 calls aborted), may scheduler not getting enough cpu slices execute properly. suspect if grab io threads , increase priority might able solve problem.
++because pool uses lazy initialization, must issue enough concurrent calls client(s) take time complete don't use cpu (eg: thread.sleep(5000)) force pool create minimum # of threads without triggering high-cpu-blocks-new-threads logic, otherwise minimum # of threads won't created , problem still happens.
another potential fix make inactivitytimeout large value. alleviate problem introduces new denial of service vulnerability, unintended failures of clients close connection.
otherwise there not appear fix issue @ time; advise against using reliable sessions due flaw makes aborts random in both connections aborted , circumstances in aborts start occur.
Comments
Post a Comment