The Dude’s Greatest Hits

As requested by several customers, I’m putting up a list of tools I find useful in troubleshooting, etc…

First up….

Data Gathering

I use the following tools to gather data from systems.  Each has its own place, much like a mechanic has many tools, so does the average engineer…

PFE MPS Reports.  I use these a lot to get a full snapshot of what a system looks like.  It bundles up all the data into a compressed cabinet file for easy transport and review.

The Windows System State Analyzer comes in handy for analyzing clusters or systems that are supposed to be the same…(updated version for 7 here  http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=857)

Process Monitor is good for a verbose bootlog of a system (amongst other things).

Process Explorer is a bit more verbose than Task Manager.

UserEnv Logging is handy for troubleshooting Group Policy application problems.

Verbose vs Standard logging of the transition states is something I recommend for all Enterprise environments.

 Microsoft Hyper-V VM State to Memory Dump Converter is quite handy at times.

Performance Issues

Performance Analyzer of Logs written by my buddy in Washington State, the very own Clint Huffman!  PAL was recently named one of the top 15 open-sourced tools for Windows troubleshooting, and the honor is deserved.  PAL will turn your perfmon files into works of art (if your idea of art is a HTML file with graphs anyway).  I can’t say enough about this tool, it’s the corner stone of sane troubleshooting.

I use xperf a bit from the Windows Performance Toolkit in the Windows 7 SDK.

Debugging

DebugDiag 1.2 of course.

WinDBG from the Debugging Tools for Windows.

Network Issues

Network Monitor 3.4.  I use this quite a bit, partly because it can consume the etl network results of a netsh boot trace of the network stack.

I also use some Netmon Experts located on codeplex here.

This one can be used to decrypt SSL traffic.

And I sometimes use this one as well.

Memory Issues

RAMMAP can be used to see what is consuming RAM.

Poolmon I use to analyze the Non-Paged and Paged Pool Memory in the Kernel mode memory space.

VMMAP can be used to see what is consuming memory in the Virtual Memory of a process.

Security Tools

Microsoft Standalone System Sweeper Beta

Malicious Software Removal Tool

Microsoft Safety Scanner

Miscellaneous

I use FCINFO to analyze HBA issues.

I use ERR all the time.  Use it to translate those pesky hex code errors.

I use Disk2VHD to convert physical machines to Hyper-V VMs.  It would also be a good tool for making an image of a machine for legal discovery perhaps.

Mouse without Borders is a cool Garage project that bubbled up to the real world.  Software KVM

RDCMAN is another cool tool for managing lots of machines.

Microsoft Shared View is pretty cool too.

Finally, never configure GPOs without consulting:

http://gps.cloudapp.net/ or Microsoft Security Compliance Manager.

PTE depletion, handle leaks and You

Applies to:  Windows 2000 Server/Advanced Server, Windows 2003 32bit Server, Exchange 2000/2003


PTEs 


Ok, so one of the most overlooked resources we run into with performance and availability problems is the availability (or lack thereof) of Free Page Table Entries.  What is a PTE?  It’s basically an I/O partition table, if you will.  Wikipedia has an awesome link with 8×10 color glossy photos, with circles and arrows and a paragraph on the back explaining what each one is, so I’ll point you there.  Cliff Huffman also has an excellent post on PTEs here that specifically talks about Windows.


So anyway, running out of Free Table Entries is bad, because it causes system hangs, sporadic lock ups, general unresponsiveness, etc.  These symptoms present themselves in Exchange as general slow performance or service unavailability.


You manage your available PTEs in Windows with the boot.ini and also the SystemPages registry key.  Generally speaking for an Exchange Server that is properly configured, you’ll see your PTE values somewhere between 8000-16000.  A large number of PTEs (50k or so) may be a hint that you’re not using the /3GB switch on your server.  A lower value generally means there is a problem.


This problem can either be a configuration issue, or if the PTE value is falling, a memory leak.


If you are dealing with a static low value and you’ve examined all the configuration settings and they all seem fine, but the value is still low (flagging in the EXBPA for example), then add /basevideo to your boot.ini.  The new agp/pci-e video drivers consume a lot of PTEs, and who needs the super-duper video card drivers on an Exchange box anyway?


If you are dealing with a leak, update your drivers for everything, NIC, HBA, Video, SCSI controller, you name it, update it.  If you’ve done all that and still haven’t gotten the leak addressed, contact PSS to get one of us involved with your case.


Handles


Another resource people don’t usually pay much attention to is handle count.  Excessive handle consumption can cause all kinds of non-paged kernel pool problems because they reside within that memory space.


If you have the symptoms of a memory leak but don’t see what is causing it, check out the handle count in task manager.  You can do this by going to the Processes tab and selecting View/Select Columns and selecting Handles.  Handle usage varies by application and what it’s doing at the time, but if you have an application with 100k handles open and your machine performance isn’t the greatest, you may be dealing with a handle leak.  If you are, your non-paged pool kernel memory may also be high but not showing anything eating it up in poolmon.  This is because the handles don’t appear to be taken into account on the poolmon monitor in some cases, so high consumption of handles by a resource don’t end up under the process tag.


If you have a process with a high handle count, contact the vendor.


Documents on PTEs:


The effects of 4GT tuning on system Page Table Entries


How to Configure the Paged Address Pool and System Page Table Entry Memory Areas


Documents on Handles:


Well, here you can see the impact of high handle count:


Microsoft KB