Version Store 624 events

Applies to Exchange 2000, Exchange 2003, Exchange 2007.


So in Version Store 623 errors, Version Store gets ‘clogged’, if you will, and will fail to process transactions.


624 errors on the other hand, are caused by a lack of available virtual memory on the server.  Sometimes this has no impact and the server corrects itself, but in a memory leak condition, this can be the sign your Exchange server is no longer accepting client connections and is in need of some assistance.


In the particular instance where I have seen this occur, the 624 event comes after a series of errors:


 


First we throw a MSExchangeDSAccess 2104 event.


Event ID     : 2104
Raw Event ID : 2104
Record Nr.   : 4802384
Category     : None
Source       : MSExchangeDSAccess
Type         : Error
Generated    : 9/7/2008 12:27:27 PM
Written      : 9/7/2008 12:27:27 PM
Machine      : JAHUMBALABAH
Message      : Process STORE.EXE (PID=636). All the DS Servers in domain are not responding.


Shortly thereafter you’ll see a MSExchangeDSAccess 2102.


Event ID     : 2102
Raw Event ID : 2102
Record Nr.   : 4802387
Category     : None
Source       : MSExchangeDSAccess
Type         : Error
Generated    : 9/7/2008 12:28:15 PM
Written      : 9/7/2008 12:28:15 PM
Machine      : JAHUMBALABAH
Message      : Process MAD.EXE (PID=2588). All Domain Controller Servers in use are not responding:


JAHUMBALABAH-DC


Then we will see a MSExchangeSA 9152.


Event ID     : 9152
Raw Event ID : 9152
Record Nr.   : 4802391
Category     : None
Source       : MSExchangeSA
Type         : Error
Generated    : 9/7/2008 12:31:15 PM
Written      : 9/7/2008 12:31:15 PM
Machine      : JAHUMBALABAH
Message      : Microsoft Exchange System Attendant reported an error ‘0x8007000e’ in its DS Monitoring thread.


This particular error is an out of memory error.  Uh oh.


Then DSAccess has another problem…. a 9154.


Event ID     : 9154
Raw Event ID : 9154
Record Nr.   : 4802392
Category     : None
Source       : MSExchangeSA
Type         : Error
Generated    : 9/7/2008 12:31:20 PM
Written      : 9/7/2008 12:31:20 PM
Machine      : JAHUMBALABAH
Message      : DSACCESS returned an error ‘0x80004005’ on DS notification. Microsoft Exchange System Attendant will re-set DS notification later.


This means a call failed, due to lack of memory…


Then the error you’ve all been waiting for, a 624 gets thrown by ESE.


Event ID     : 624
Raw Event ID : 624
Record Nr.   : 4802473
Category     : None
Source       : ESE
Type         : Error
Generated    : 9/7/2008 12:32:58 PM
Written      : 9/7/2008 12:32:58 PM
Machine      : JAHUMBALABAH
Message      : Information Store (636) Storage Group 1 (First Storage Group): The version store for this instance (1) cannot grow because it is receiving Out-Of-Memory errors from the OS. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.


Current version store size for this instance: 1Mb


Maximum version store size for this instance: 249Mb


Global memory pre-reserved for all version stores: 1Mb


Possible long-running transaction:


   SessionId: 0xBD345AC0


   Session-context: 0x00000000


   Session-context ThreadId: 0x000015AC


   Cleanup: 1


 


So what can cause this?  Check your task manager.  Do you see any handle leaks or processes with out of control handles?  In the instance I saw for this, it was a mixture of stale messages stuck in the SMTP temp tables and a third-party AV scanner that had an apparent memory leak.  Both Inetinfo and Store were over 2 gig and had 32k handles each.  Once we resolved the issue Store was around 6k handles and Inetinfo around 3k.


What is happening is a memory leak is consuming all the virtual memory space in Store and Inetinfo, at least in our case here.  Yours may differ in what is causing the leak, but I’d bet more than likely its going to be something that ties into Store, such as Anti-Virus, something gumming up IIS and then Epoxy, or something along those lines.


Because you run out of memory, DSAccess starts to fail, then you see the string of errors above.


If you see this, what should you do first and foremost?  Give PSS a call so we can help you debug it.


More information on this can be found here:


http://technet.microsoft.com/en-us/library/bb218083(EXCHG.80).aspx


 

Avoiding Version Store problems in the enterprise environment

Applies to Exchange 2003 


  So one of the things that can go wrong with Exchange is that it can run out of something called Version Store.  Version store is an in-memory list of changes made to the database.  Nagesh Mahadev has an awesome post about Version Store on our msexchangeteam.com blog, posted here.  To borrow his summary:  In simple terms, the Version Store is where transactions are held in memory until they can be written to disk.


  So version store running out of memory can be caused by either a long running transaction.  This is pretty self explanatory.  Say your anti-virus product wants to scan something in VSAPI and locks it and then goes to lunch.  Your version store will consume more and more memory until it runs out because it’s trying to work around this long running transaction, keeping track of all the rollbacks and whatnot.


  The other problem is with I/O.  Since we’re holding transactions in memory until they can be written to disk, if something prevents us from writing to disk, we can hit version store problems.  Sometimes this type of problem can be precipitated by 9791 event log entries in the application event log.  If this happens, get ready to do some adplus store dumps when version buckets allocated hits 70%.


What to do to prevent or mitigate this risk?



  1. Consider increasing transaction log buffers, especially if you are seeing transaction log stalls in your environment.  The logic here is that if store can’t commit transactions to the log files fast enough, it can cause version store to back up.  By default the number of buffers is 500, you can increase this to 9000.  This will prevent a single database from needing to write a bunch of TLs at once and backing up version store.  I highly recommend using the EXBPA for governance on this, details on the rule for setting this, etc can be found here.

  2. Watch your PTE resources and treat accordingly.  I’ve seen customers run low on free PTEs and run into version store problems because they don’t have the capacity to perform IO operations as fast as the database would like.

  3. Make sure your online maintenance is completing frequently, at least once a week on each database.  Part of online maintenance is defragmenting your database.  On a highly fragmented database(s) version store has to keep track unoptimized links and tables and dealing with records that are not on the fewest number of pages possible, in essence bloating version store size with each transaction.  For indepth information on Exchange Store Maintenance, go here.

  4. Keep your message size limits down.  Going hand in hand with this is preventing older Outlook clients from accessing your server.  Old clients (Older than Outlook 2003 SP2 in cached mode, any version of Outlook 2003 and higher for online mode) ignore your message size limits for submitting messages, so older clients could attach a 100 meg file and submit and store would have to deal with it even though it’s over the size limit.  This should give you the gist of what I’m talking about here.

Hope this helps with your environment.

PTE depletion, handle leaks and You

Applies to:  Windows 2000 Server/Advanced Server, Windows 2003 32bit Server, Exchange 2000/2003


PTEs 


Ok, so one of the most overlooked resources we run into with performance and availability problems is the availability (or lack thereof) of Free Page Table Entries.  What is a PTE?  It’s basically an I/O partition table, if you will.  Wikipedia has an awesome link with 8×10 color glossy photos, with circles and arrows and a paragraph on the back explaining what each one is, so I’ll point you there.  Cliff Huffman also has an excellent post on PTEs here that specifically talks about Windows.


So anyway, running out of Free Table Entries is bad, because it causes system hangs, sporadic lock ups, general unresponsiveness, etc.  These symptoms present themselves in Exchange as general slow performance or service unavailability.


You manage your available PTEs in Windows with the boot.ini and also the SystemPages registry key.  Generally speaking for an Exchange Server that is properly configured, you’ll see your PTE values somewhere between 8000-16000.  A large number of PTEs (50k or so) may be a hint that you’re not using the /3GB switch on your server.  A lower value generally means there is a problem.


This problem can either be a configuration issue, or if the PTE value is falling, a memory leak.


If you are dealing with a static low value and you’ve examined all the configuration settings and they all seem fine, but the value is still low (flagging in the EXBPA for example), then add /basevideo to your boot.ini.  The new agp/pci-e video drivers consume a lot of PTEs, and who needs the super-duper video card drivers on an Exchange box anyway?


If you are dealing with a leak, update your drivers for everything, NIC, HBA, Video, SCSI controller, you name it, update it.  If you’ve done all that and still haven’t gotten the leak addressed, contact PSS to get one of us involved with your case.


Handles


Another resource people don’t usually pay much attention to is handle count.  Excessive handle consumption can cause all kinds of non-paged kernel pool problems because they reside within that memory space.


If you have the symptoms of a memory leak but don’t see what is causing it, check out the handle count in task manager.  You can do this by going to the Processes tab and selecting View/Select Columns and selecting Handles.  Handle usage varies by application and what it’s doing at the time, but if you have an application with 100k handles open and your machine performance isn’t the greatest, you may be dealing with a handle leak.  If you are, your non-paged pool kernel memory may also be high but not showing anything eating it up in poolmon.  This is because the handles don’t appear to be taken into account on the poolmon monitor in some cases, so high consumption of handles by a resource don’t end up under the process tag.


If you have a process with a high handle count, contact the vendor.


Documents on PTEs:


The effects of 4GT tuning on system Page Table Entries


How to Configure the Paged Address Pool and System Page Table Entry Memory Areas


Documents on Handles:


Well, here you can see the impact of high handle count:


Microsoft KB