Hi David,
I'd recommend running a zmdiaglog during one of these slowdown or degradation events. As root:
# /opt/zimbra/libexec/zmdiaglog
And provide the output of that data in a case and ftp upload. We should step back and look at the degradation event more broadly. The zmdiaglog will provide the threaddumps, zmstats, log data, and other pertinent data. It is critical to perform the zmdiaglog during the event period and prior to doing a "zmmailboxdctl restart" (if you are doing mailboxd restarts).
Are you still running ZCS 5.0.x?
There are some cache problems we've identified (with some significant help from a customer on this list - thank you) that are scheduled for the upcoming 5.0.24, and that we believe may be the cause of certain high CPU behavior that can cause degraded service:
Bug 47522 - Message cache size inconsistency
http://bugzilla.zimbra.com/show_bug.cgi?id=47522
FIXED: 5.0.24
Bug 46808 - Message cache pruning is too aggressive
http://bugzilla.zimbra.com/show_bug.cgi?id=46808
FIXED: 5.0.24
Sincerely, have a great weekend -
-thom
Hi Thom-
Thanks for such a thoughtful reply. I appreciate the care that is taken when considering changes like those we are talking about. The one thing I do need to point out is for us, the following is not true:
> As Xueshan and Zimbra Engineering have noted, we do not believe at this time that the BAD parse problem increases load significantly or has other significant performance or functional impact on the server other than increasing the mailbox.log size.
We see very significant performance degradation when a single client on gig-E gets into this tight loop with our server and spits tens of requests per second at it. Degradation to the point where other users complain about UI slowness in the web client among other things.
I don't know whether our setup is lower latency/less distributed than others or just poorly sized/tuned from our original planning with the sales engineer lo those many moons ago when we first put this into place. We do wind up with this manifesting as a more serious problem than log space usage. I wouldn't be advocating for more rapid response than just waiting for Apple to fix their client if we didn't find this bug to be a real hazard. But I'm more than willing to believe the problem for us is exacerbated by how our server is provisioned.
We'll be upgrading our original setup in the next few months using considerably beefier hardware. The quote is actually going to go in tomorrow or Monday. Perhaps that will provide enough horsepower so this turns into just noise when it happens, I was just hoping to not have to wait and see.
-- dNb
P.S. One other fleeting thought that came to mind that may not cause the Engineering team to gnash their teeth would be the idea of more judicious logging (i.e. aggregate certain messages or messages at a certain rate) in cases like this. It is a bit late here so I can't really tell how good an idea that might be.