Today I had a very strange issue that really took my heart to the limit. This happens, as it always does, at the exactly wrong time. Big weekend for the client, and the site goes unresponsive out of nowhere. But my issue is what the title says. After any reset, either iisreset, a deployment, or a config change, the site would hang for about 20 minutes, completely unreposive for all .NET calls. HTML and txt files work fine, but anything Sitecore or .NET is dead. Once the site comes back, it is perfectly fine! Head. On. Desk.
Nothing in the logs help at all. And while the site is locked up, the log files are growing with the normal Sitecore stuff, like publishing, index updates, etc... After I have pulled my hair out for 4 hours, I decided to break out the Win debug tools and take it into the deep deep dark world of crash dumps.
You have two choices when it comes to Win Debug:
The swiss army knife
The full toolkit for Win Debug is WinDbg. You can download it here. It is the full set of deep tools to analyze your crash dumps. You can add your own symbols and run console commands to analyze your dump 10 ways to Sunday. This is REALLY good for when you are dealing with unmanged-code crashes. Unmanged-code errors would be issues like memory errors, disk access errors, and anything that was not written in .NET, aka C errors.
If you haven't seen WinDbg in action, you can see a video of it here. It is super powerful. But you really have to learn how to use it. Learning it in the middle of a critical outage is a tough time. Face meet fire hose. I always choose the simpler option to see if I can solve the issue without having to pull my notes out. So what is that simpler option??
One simple tool
The simpler tool to use is Debug Diagnostic Tool v2 Update 2. This tool is great because you can just consume a dump file and it will do all the work. Heyyyy, easy! I like easy. All you do is configure it to watch your app pool. Then when it crashes, it automatically saves the dumps to a folder.
Then you open the dump with the diag tool and let it analyze your dump. The hope here is that something in the trace will give you a hint as to what code is crashing. Whether it is reflection not finding the required dll, a memory exception, stack overflow, or just some code that is crashing hard.
For me, I found things I recognized in the stack trace. The timeline goes from the bottom up, just like an error. You can follow errors down until you find the root cause, or something you can work with.
From it I can see that Sitecore is getting item links from SQL and processing the XML for the links. At some point the XML Serialization blows up and the app crashes over and over as Sitecore keeps trying to initialzie.
To fix this, my first thought was to run the "Database Cleanup" task, as well as rebuilding the links database on all three databases. And guess what? TADA!!!! The site is back. I would never have had a clue what to do, if I couldn't have gotten my hands on the stack trace of the crash.