In my work at Tanium I do a bit of debugging and performance analytics. Over the last 2-3 years, a LOT of this has centered around how Windows systems get slower and slower over time. This has been a common complaint/statement of ridicule/FUD since I started my career in IT 26 years ago in fact.
Windows is not the issue here. But this happens a lot in Windows and it isn’t exposed well and that needs to be fixed. And people writing tools for Windows systems need to learn to fucking code (and maybe when you write a kernel driver, do more than download the sample on MSDN, pop in your code, then leave the default “Made by WINDDK” in the file properties of your driver. Oh yeah, including a version number too.
I like to eat my own dogfood, so I run Tanium on my own systems. If our stuff is off the chain, I want to know it. Preferably before a customer and Microsoft are pointing fingers at us in a support call. Call me crazy, I like to get ahead of the 8 ball, you know? But, if it is us, and MSFT support happens to be right (which is a declining % value over time the last 4-5 years ago it seems), I’ll take that on the nose as well. At the end I either educated a customer, Microsoft support tier 1 SE, or myself and my dev team. It’s a win all around, no matter who is ‘at fault’.
Tonight I’m going to show you how fucked up Windows can get when code sucks. And it’s not Tanium. And it’s not MSFT. And (fucking surprise!) it’s not even antivirus!!!!! Never thought I’d give AV a clean bill of sale on this, cause usually it is AV….
Ok. system specs: AMD 5900x, 64GB of 3200Mhz RAM, Nvidia 3090 FE, boot drive is a 2TB PCI4 NVME drive. I have a total of 9 SSDs and 3 NVME drives. I also have spinners for cold storage, USB3.1 attached larger spinners. This is not a poorly performant system, or RATHER, it has no real excuse to be a poorly performant system. So when I opened ProcessExplorer tonight to figure out why I had process ID counts above 100k (a zombie process symptom) and it HUNG, over and over trying to render screen updates, I knew something was bad.
So here’s how I solved this.
- Taskman, sort by PID, over 100k? Yes? Bad.
- Open ProcessExplorer as Administrator and sort by handle count (after adding it to the view).
- Add bottom analysis pane to ProcExp do “By handle”
So what does this look like in pictures?
Ok, so ArmouryCrate is causing all the zombied processes? hahahaah, no
Why? Because the plot thickens.
Bruce Dawson wrote a blog post back a couple years ago and updated some fellows code and posted it in GitHub to find details on process zombies, that are handle based. So I ran that as well.
So, check this out, all these zombie processes here, are cause Razer opens a handle to its child gui process and never closes it either.
So how to fix all this mess? Like some gamer guy on the internet is gonna get Razer and Asus to fix their shit right? lolz Uninstall that shit and don’t look back I guess. Here’s how I look once I kill these offenders.
Now my system is responsive. Eventually as running processes close, pid values will fall as well. Hope this helps you understand how to look for zombies.
Thanks for reading my rant.