100,000 Disk Flush events and Windows boot


I had a system slow boot scenario come my way recently where at least part of the boot up was broken from disk flushes of a management application that appeared to be in debug mode.

In the graphic above, the top is the overall throughput of C:, bottom is the disk flush events from the application writing to its log file. The IO pattern was:

write to file
write to MFT

It did this 500 times a second for 200 seconds.

What is a disk flush?

ChatGPT says

A disk flush is an operation that ensures all pending file system writes are transferred from the operating system’s buffer cache to the actual physical disk. This is important for ensuring data integrity, especially in the event of a system crash or power failure. By flushing the disk, you can be sure that any recent changes to files are fully saved and will be preserved even if the system stops unexpectedly. It’s a way to synchronize the state of the file system on disk with the state in memory.

The problem with this is that it does the whole disk cache to disk. And whilst the flush is happening disk writes get in queue and go post-flush confirmation. So one could say this app DDOS’d the access to C: from other applications while it did its thing.

Doing this during boot extends boot times, and its not easy to triage exactly why, without a WPR trace anyway.

What’s the fix? The customer had to contact their storage team (it was some sort of storage tool from an OEM that I think monitored fiber channel storage). Whatever they did, next boot up was 200 seconds shorter as a result of not writing to the log file.

Disk flushes can also present in normal operation, and inherently they aren’t ‘bad’ its when they are done excessively that it becomes a problem. How do you know if you are having disk flush events without doing a trace?

Ever seen storage in Task Manager 100% busy, but throughput is like, less than 1MB/s? Thats probably disk flushes (or bad storage, or driver issue, or a myriad of other things, but flushes seem to be the common offender for that).

Happy debugging,

Be the first to comment

Leave a Reply