Windows, Zombie Processes, and bullshit code

Hi,

In my work at Tanium I do a bit of debugging and performance analytics. Over the last 2-3 years, a LOT of this has centered around how Windows systems get slower and slower over time. This has been a common complaint/statement of ridicule/FUD since I started my career in IT 26 years ago in fact.

Windows is not the issue here. But this happens a lot in Windows and it isn’t exposed well and that needs to be fixed. And people writing tools for Windows systems need to learn to fucking code (and maybe when you write a kernel driver, do more than download the sample on MSDN, pop in your code, then leave the default “Made by WINDDK” in the file properties of your driver. Oh yeah, including a version number too.

I like to eat my own dogfood, so I run Tanium on my own systems. If our stuff is off the chain, I want to know it. Preferably before a customer and Microsoft are pointing fingers at us in a support call. Call me crazy, I like to get ahead of the 8 ball, you know? But, if it is us, and MSFT support happens to be right (which is a declining % value over time the last 4-5 years ago it seems), I’ll take that on the nose as well. At the end I either educated a customer, Microsoft support tier 1 SE, or myself and my dev team. It’s a win all around, no matter who is ‘at fault’.

Tonight I’m going to show you how fucked up Windows can get when code sucks. And it’s not Tanium. And it’s not MSFT. And (fucking surprise!) it’s not even antivirus!!!!! Never thought I’d give AV a clean bill of sale on this, cause usually it is AV….

Ok. system specs: AMD 5900x, 64GB of 3200Mhz RAM, Nvidia 3090 FE, boot drive is a 2TB PCI4 NVME drive. I have a total of 9 SSDs and 3 NVME drives. I also have spinners for cold storage, USB3.1 attached larger spinners. This is not a poorly performant system, or RATHER, it has no real excuse to be a poorly performant system. So when I opened ProcessExplorer tonight to figure out why I had process ID counts above 100k (a zombie process symptom) and it HUNG, over and over trying to render screen updates, I knew something was bad.

So here’s how I solved this.

  1. Taskman, sort by PID, over 100k? Yes? Bad.
  2. Open ProcessExplorer as Administrator and sort by handle count (after adding it to the view).
  3. Add bottom analysis pane to ProcExp do “By handle”
  4. SMH.

So what does this look like in pictures?

PIDs are over 100k. Fuck That noise, Windows recycles pids. If your prod server has over a MILLION value, your environment is hosed. Seek help.
As you can see, ASUS ArmouryCrate is a well coded app that knows to release handles after enumerating all processes on my system. (<—snark) Why is it enumerating all processes on my system? Because it is searching for games… Here we see it has a handle to a thread that is terminated and that thread’s process is dead.

Ok, so ArmouryCrate is causing all the zombied processes? hahahaah, no

Why? Because the plot thickens.

Bruce Dawson wrote a blog post back a couple years ago and updated some fellows code and posted it in GitHub to find details on process zombies, that are handle based. So I ran that as well.

Razer is holding open a lot of zombies, yet, that app is not running on my system…or is it?

So, check this out, all these zombie processes here, are cause Razer opens a handle to its child gui process and never closes it either.

Razer holding open razer, why

So how to fix all this mess? Like some gamer guy on the internet is gonna get Razer and Asus to fix their shit right? lolz Uninstall that shit and don’t look back I guess. Here’s how I look once I kill these offenders.

From 9300 zombies, to 68

Now my system is responsive. Eventually as running processes close, pid values will fall as well. Hope this helps you understand how to look for zombies.

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-openprocess for more reading.

Thanks for reading my rant.

CVE-2021-26807 – GOG GALAXY v2.0.35 DLL Load Order Hijacking

Authors: Brian Papile and Jeff Stokes

Executive summary

The GOG Galaxy version 2.0.35 was vulnerable to a DLL Load Order Hijacking vulnerability. The vendor has patched the vulnerability and released version 2.0.37, as of March 30, 2021.

Discovery

This vulnerability came about when we tried to uninstall the Folding at Home Client, but its folder and some DLLs remained. When trying to manually delete the files, the below warning prompted that they were in use by the GOG Galaxy client application.

To us this was very odd, GOG of course has no relationship with the FAH software. It was even installed on a completely different drive. We suspected there to be a vulnerability here, but were not sure or if it was exploitable. 

 

Research

After reaching out to colleagues, they believed this could be a DLL Load Order Hijacking vulnerability. After doing some research, we came to the conclusion that Process Monitor, with some filters applied, would likely be the best avenue for investigation.


This was returned around 113 events on startup of the GOG client. But most paths were places only administrators had access to. Adding additional filters, we ended up with

Nice, we have a DLL path that a standard user can write to: C:\Users\(username)\AppData\Local\Microsoft\WindowsApps\ZLIB1.dll


Development of POC

From here, we knew we needed a DLL to place in this directory. After doing some searching (mainly on GitHub), we couldn’t find one that fit our needs. We wanted to write out to a file, the process and the account that was running it. Additionally, we wanted a popup, with the same information, for visual confirmation. 

Our requirements for this DLL was to make it flexible (for other vulnerability research), but mainly to see if the update mechanism of the GOG Galaxy client would run the code in our DLL. The update service runs as SYSTEM level permissions. This was necessary, because there was no way to manually kick off a check for updates of their software, and we couldn’t find a cadence of when or how it happened. As our coding knowledge only included scripting languages, we reached out to some colleagues (again), and they assisted us with the creation of the DLL. 

Exploitation

We dropped the DLL into the above path, and boom! The file C:\temp\whoami.txt was created, with username, process path, and PID. Additionally, the popup displayed the same information:

We dug around, and tried to get the GOG Galaxy client to update and trigger our DLL, but we had no luck. At this time, we could not find a way to elevate, above the current user. But that doesn’t mean that this vulnerability should be ignored, as other processes could exploit this to hide their malicious code and run as the logged on user (we were thinking ransomware primarily as the risk).

Reporting
Now we needed to report the vulnerability to GOG and submit for a CVE, which we have never done before. This was a learning opportunity for us, but wasn’t as bad as initially thought. See the below timeline.

Timeline

Jan 24, 2021 – Initial finding

Jan 25, 2021 – Notified vendor

Jan 25, 2021 – Vendor forwarded on to their security team

Jan 26, 2021 – Vendor closed initial support ticket, opens JIRA and recommends we “follow Google Responsible Disclosure rules” 

Jan 26, 2021 – Google waiting period begins — 90 days (April‎ ‎26‎, ‎2021 from this date)

Feb 2, 2021 – Notified MITRE/CVE portal

Feb 9, 2021 – CD Projekt Red is hacked and held for ransom (only including this, because GOG might fall under them, or vice versa, so resources could be distracted from this)

Feb 25, 2021 – MITRE issues CVE-2021-26807

Mar 30, 2021 – Vender releases 2.0.37, fixing the vulnerability

Mar 30, 2021 – MITRE notified of fixed version

Apr 6, 2021 – MITRE informed us we could publish 

Apr 29, 2021 – Blog published

Thanks to everyone involved (post will be updated with names once we know they are ok being named).

Thanks to Alyssa Miller for guiding us on how the whole CVE process works!

 

References
https://pentestlab.blog/2017/03/27/dll-hijacking/

https://github.com/povlteksttv/Exploiting-DLL-Search-Order-Hijacking

Github for our DLL (TBD)

 

Exploring the hidden opportunities of sudden change in enterprise IT management.

Tanium’s blog post featuring Lumentum’s CIO Ralph Loura’s blog post really resonated with me on a couple of levels. The one thing in life that seems predictable is change. I know it is a bit cliche, but this has been true in my life. While sometimes it is difficult to see the positive aspect in situations, it seems  to me that there is usually a lesson or nugget of knowledge one can take away from an event. Or perhaps a situation presents itself that you can seize to really make meaningful change in what you are doing. This quote I think is spot on. 

“There are too many people who are too comfortable with the pillow and the snooze alarm and are just waiting for this to be over so they can go back to the way it used to be,” he says. “I think those people will have a lot of challenges coming up. I don’t think the world as we used to know it is coming back anytime soon.” 

Where is the opportunity?

Loura does not see this as necessarily a bad thing but rather as an opportunity to let go of what was done before and rethink how to do things better moving forward. 

Personally, my outlook is that change brings opportunity in life. Even a debilitating car wreck in my younger days resulted in investigating PC technology and eventually led to a relatively successful career in IT. So, when I read this blog about how the new Covid-19 scenario impacted IT organizations, and how he saw this as an opportunity for change, it really resonated with me. 

On a professional level, if you reflect on larger companies, especially enterprise level ones, change usually is a slow, methodical process. Suddenly having your staff working remotely, some places even on BYOD machines, is a very good opportunity to pivot from one way of using technology and moving to another. This is what I think of when I talk about Digital Transformation. 

Three things come to mind, Zero Trust, Windows Performance, and the risk in IT organizations of siloed teams.

Zero Trust 

One striking change is the transformation of zero trust from magazine buzzword to business reality over the last year. Have you reimagined your network perimeter, with BYOD and remote workers considered? What do you do when you cannot manage or even trust the user’s endpoint? Oeven worse, have no visibility to the endpoint at all? 

Tanium can help address these concerns. It can be used to help secure endpoints in a BYOD/WFH scenario. Iyou have Tanium already, you probably have the infrastructure in place to do this. For more information, check the library of content at https://www.tanium.com/distributed-workforce/. 

Performance

Windows Performance is a personal interest of mine and something I’ve made into a bit of a career. I feel for users with underperforming machines that therefore having a bad experience. It’s generally “easy” to remediate most performance issues. They are based in data. If you can see that data that is. I can’t count the amount of hours I’ve spent trying to get the right set of data to resolve a performance issue in an enterprise.

Tanium has the Performance module, which can give you that visibility into the health of not just an endpoint, but your fleet of endpoints. And a natural query language parser is ready to help you take that data and go through it in a myriad of ways.

Siloing and Risk 

Another aspect that I saw a lot of while a field engineer (PFE) at Microsoft was siloingI routinely encounter organizations where the company was siloed is such that the security and operations teams are reporting up to different C-levels. At times I’ve seen open hostility between security and operations staff. Almost like a turf war. 

There is a myriad of problems with this situation, but one I think is the most critical is the security posture of an enterprise. If your operations and security teams are at odds, how are you securing your environment properly? Orion Hindawi (Taniums’ Co-Founder & CEO), Sarah Franklin (Salesforces’ Chief Marketing Officer) and Sunil Potti (Vice President and General Manager of Google Cloud Security) discuss this in webinar, which I highly recommend. 

Final thoughts 

You cannot predict the future with certainty, but you can be certain that change is coming. Has your organization started its own digital transformation yet? If not, what’s holding it back? I think the old ‘normal’ is gone, but I don’t think the current ‘normal’ is the future either, so are you ready to pivot? 

As Ferris Bueller said, “Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it.” 

EDRefCard.info is down! Long live EdRefCard! How to set up your own instance of EdRefCard so you can create a card for your HOTAS config in Elite Dangerous.

<no longer needed, EDRefCard.info is back up!!!>

How to set up your own instance of EdRefCard so you can create a card for your HOTAS config in Elite Dangerous. Share with friends, import friends config files and get cards made for those.

What? – This used to be served at https://edrefcard.info but the site has been down recently. So I went to github and found the project, which has some instructions on building a docker container. I took it a step further and placed the edrefcard info needed into Docker Hub.

So now, to set this up locally all you need to do is:

  1. Be on Windows 10.
  2. Install Docker Desktop.
  3. This should install WSL2 and Docker bits and some PowerShell Management cmdlets and prompt for a reboot.
  4. Reboot
  5. Skip the Docker Desktop tutorial. No need to use the UI here.
  6. Open an Administrator PowerShell Window (Win+X and select PowerShell (Admin)
  7. Run the command
    docker run -d --rm --name edrefcard -p 8080:80 jeffstokes72/edrefcard
  8. Allow the Windows Firewall to open the local port 8080
  9. Open your favorite browser and put http://localhost:8080 in the address bar and hit enter.
  10. Enjoy

To recap:

Install Docker Desktop and reboot, open an administrator PowerShell window with Win+X.

Run the command 
docker run -d --rm --name edrefcard -p 8080:80 jeffstokes72/edrefcard

Allow Windows Firewall to open the local port

And then in your browser, go to http://localhost:8080

 

Enjoy!

Windows 10 Task Manager ‘% CPU’ skew – A Tale of Two Metrics

EDIT: My co-worker, Aaron Margosis, wrote his take on this issue, you can read about it here: Task Managers CPU Numbers Are All But Meaningless!

Windows 10 Task Manager is often used by end users to gauge the performance of their machine, especially when they think something is amiss. There are several reasons why this isn’t really a good performance gauge.

  1. It’s a point-in-time measurement that lacks context of the overall scale of resource usage.
  2. It doesn’t see inside processes to understand the impact of Anti-Virus and other security software on the processes.
  3. Task Manager CPU stats are deceptive and inconsistent, at time of writing (rest of this blog explains).

Say what?! A primer on this can be found at CPU usage exceeds 100% in Task Manager and Performance Monitor if Intel Turbo Boost is active .

However, Intel Turbo Boost is not the only scenario. Any scenario where the CPU cores change from their default 100% output will skew results in Task Manager.

Things like thermal throttling, Intel Speed Step, Intel Turbo Boost, AMD Precision Boost 2, AMD Precision Boost Overdrive, C-State management for power savings when Balanced or Power Saver power plans are enabled (Balanced is by default btw). All of these technologies modify the speed of one or more cores inside a CPU. And the point of this article is not that these technologies are bad, it’s that Task Manager currently does not take their modifications to the output of the core into account in calculations, per se. Or maybe rather, it does but doesn’t tell you.

One would normally expect a CPU to just have 0-100% and an app uses 5%. But if the CPU is 8 cores and the core the thread for the app is on is in boost mode for example, it’s not 100% CPU we’re measuring, its like, 112%. So now the Processes tab is showing you 100% of 112%, etc.

Example, my co-worker Aaron Margosis wrote a utility that can run a thread at 100% CPU on a core. So in the screenshot below, I’m running a single thread at 100% CPU on one core, on an AMD Ryzen 5900x CPU.

Which is accurate? It ‘depends’ on what you want to know. If it’s utilization of all the cores at 100%, it’s 4% CPU. But the core the thread is on is likely boosted by AMD’s chip technology, so it’s really 6% of the 4% capable core due to boosting.

Does this seem like a large inconsistency? 2% is no big deal, right?

Let’s expand the experiment to 8 cores (the chip has 24 so we’ll be ok to run this test).

This is a slightly larger variance. So if I were a user complaining my machine was slow, I’d obviously think my CPU was being eaten by this test program, at 42.4%, when in reality it’s 33.333%. So 9% variance. Not huge in the world but still, it’s confusing. Especially since the tooltip on CPU in both tabs of Task Manager say the same thing “Total processor utilization across all cores”.

Below is running 12 cores of my 24 core AMD 5900x.

So now we’re seeing a 13.5% variance. So over 1% per core. My systems’ BIOS is not set to aggressively OC the CPU, I could probably get bigger variances by doing so. Maybe that’s post #2 for this topic.

These tie back to Performance Monitor as well. The more accurate data points for CPU measurements in Windows 8 and Server 2012 and above are

  • Processor Information\% Processor Utility
  • Processor Information\% Privileged Utility

Which is where the “Processes” view is getting it’s values.

So one can think of Processes tab on Task Manager as the “% CPU used of all available % of CPU available” vs the Details tab which is more “% CPU used of 100%/core”.

Microsoft is looking at a better way to display this all the time, cognizant that end users are used to Task Manager for gauging performance, not say, Perfmon or an ETW trace with the Windows ADK.

It is worth noting this variance does not appear to impact virtual machines, so far as I’ve been able to observe at this point.

Happy performance profiling

Jeff

 

How to collect a boot trace on Windows 10 using xbootmgr

Sometimes in support you’ll be asked to collect a boot trace to help troubleshoot slow boot or slow logon scenarios. The symptoms are a long time passes from startup to the CTRL+ALT+DEL or  from CTRL+ALT+DEL to a usable desktop experience. This blog will walk you through the steps needed to do this.

While you can do boot tracing in Windows 10 using the built-in native WPR.exe, it’s a bit kludgy and doesn’t add all the providers it’s ancestor xbootmgr added in boot scenarios. Therefore if you do it that way, you are missing parts of the trace expected by the analyst.

The only alternative is to download the ADK for Windows 10, install the Windows Performance Toolkit (aka WPT), and do the trace using either WPRUI (with the boot scenario selected) or use xbootmgr if you prefer command line.

The Windows ADK for Windows 10 is sometimes updated when a new build is out. Usually, for Windows 10, you want to use the most recent ADK’s install of the WPT. At writing that is the ADK for Windows 10 version 2004. You can always get the link to the most current ADK at the page Download and install the Windows ADK. Installing the WPT requires you to run the ADK installer which pulls what you select in the checkboxes from the web (as shown below).

Or if you prefer, you can download and install the redistributable located in my OneDrive. Your call. I put the Build 2004 redist’s for x86 and x64 there.

Once the WPT is installed, the command line to grab a boot trace is:

xbootmgr -trace boot -traceflags dispatcher+latency -stackwalk readythread+threadcreate+profile+cswitch

This of course must be run as administrator. By default an Administrator command prompt puts you in System32, so it’s best to make a directory off C:\ and name it Trace or whatnot and change directory to there to run the command. The output of the trace will be written to the directory where the trace command is run by default.

Run the command, this will reboot the host and then boot up the kernel in tracing mode.

So to recap:

  1. Install WPT
  2. Open CMD Prompt as Administrator
  3. CD\
  4. mkdir Trace
  5. CD Trace
  6. xbootmgr -trace boot -traceflags dispatcher+latency -stackwalk readythread+threadcreate+profile+cswitch
  7. Wait for CTRL+ALT+DEL after the machine reboots and login
  8. The trace will count down for 2 minutes and then write to C:\trace.
  9. The interim trace files will be labeled KM and UM in the file name. Those are pre-merge files from kernel memory and user memory respectively. Once those are both paged to disk from RAM, xbootmgr will merge the two into a single file and delete the KM and UM working files.

Jeff “dude” Stokes

Windows 10 20H2 boot trace – dropped events

TLDR: At time of writing, Windows 10 20H2 has a bug where the default buffer allocations in boot tracing are inadequate to capture the data of a boot trace. The fix is pretty simple, use good old xbootmgr instead. This is a binary from the older ADK and gets installed when you install the current ADK.

What am I talking about? How did I find this?

I hit a scenario where I needed a boot trace. So I set it up like so, this is a pretty typical set of options for a boot trace. Collect 1st level triage, CPU, DiskIO and File IO events. Log to file (the only option in a boot trace) and change your iterations from 3 to 1.

But when the trace rebooted the VM and came back up, it had dropped events. Dropped events mean at some point in the recording, data was lost. Windows knows it lost data but not what type. So this makes interpreting the trace extremely unreliable.

Typically this is due to poor storage performance. So I tested the storage with CrystalDiskMark. And since the VM is hosted on an NVME drive, it did pretty well.

These numbers are more than adequate for our needs. So what gives? There is a mechanic in collecting traces known as ETW buffers that capture the data from ETW providers.

Think of this as radio waves. Each ETW provider in Windows is a radio station. Each one is broadcasting all the time. When you collect an ETW trace, what you are telling Windows you want to do is listen to a station or set of stations, and collect that data into memory, or in the case of a boot trace, a pair of files. Windows can do this for you usually with no issues, by allocating Non-Paged Kernel memory to trace buffers. In xbootmgr and its cousin, xperf.exe, you can tweak the buffers allocated to the trace, both the count of buffers, and the memory size of each buffer. Typically the default values work just fine, but if you are dealing with a very busy system or terrible storage performance, sometimes you can drop events.

To go back to the radio analogy, this would be like the broadcast missing segments of time, or static perhaps is a way to think of it.

If you wanted to learn more, About Event Tracing is a great starting point, so is ETW Central.

So back to the scenario, I had dropped events, and I confirmed storage was great. So what next?

I thinned the trace, iteratively, down to just 1st level triage checked and “Light” instead of “Verbose” and still dropped events.

I also tried the “GeneralProfileForLargeServers.wprp” file that is located in the “C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit” directory. I tried this because this file has statically set values for buffers. But still, no dice, dropped events.

What I ended up doing to fix this was call xbootmgr and then I had no dropped events. Curious. I can only surmise Windows 10 20H2 has a different configuration than previous Windows versions for the ETW collections.

The command I used is xbootmgr -trace boot -traceflags dispatcher+latency. This rebooted my machine as expected and collected a trace. When I opened it, it had no errors. Success!

xbootmgr -trace boot -traceflags dispatcher+latency

Then simply double-clicking the resulting etw file was met with success.

 

I’ll be opening a Feedback using the Feedback app and placing a link here shortly. If this impacts you and you’d like to see it fixed please upvote here. I hope this has helped you understand what is going on and how to work around the current issue. Happy Tracing!

Jeff

What’s using your video RAM? Xbox Game Services naturally…

Applies to: Windows 10, Gamers

 

One of my routines when installing Windows 10 fresh (or updating builds when it wipes my preferences) is to change Task Manager’s view to report on additional columns of value.  Let me show you what I’m doing:

TaskManager with columns that make sense

My machine has an uptime of 1 day, 15 hours. I game quite a bit. Running an Nvidia 2080ti.

Somehow while not gaming, I’m using 3GB of dedicated video ram…

While not gaming.

So where is it going?

Xbox game services (I don’t use this, forgot to turn off gamebar doh)

 

Does this get released when I launch a game? Probably not. Good reminder to shut off what you don’t need for better gaming experiences.

Win+G to go in and shut it off btw.

Game on!

Microsft Edge – “This site is trying to open” dialog box hell – Fix

Applies to: Edge

I was downloading mods to start up a game of Witcher 3 and every time I tried to download a mod to Vortex, this dialog box appeared, which no “ffs stop asking this” option.

This gets tedious quite quickly.

So, this article walks you through getting around the warning: https://docs.microsoft.com/en-us/deployedge/microsoft-edge-policies#externalprotocoldialogshowalwaysopencheckbox

But, this .reg key can be imported to do the same thing:

.reg file to fix annoyance (requires Edge restart)

Simply download and right-click/merge the file, then restart Edge.

Here’s the contents:

And here’s your new dialog box:

Simply check the box and select Open and you’ll not be prompted again.

yay!

 

Why does HyperX NGenuity need 1.16 GB for my headset?!

I was looking at space used on my C drive in Windows 10 (just upgraded to 2004 build, yay!) and found something that seemed off to me.

Now, it’s not unusual to have driver suites like bluetooth, or sound controllers, gpu’s, headsets, etc, take up some amount of space. That’s fine. Over a GB just to show me a battery status? W-T-F.

So I drilled down in there, it appears the HyperX NGenuity suite downloads the art/text/drivers for all their products, not just the one you use. And it keeps them there. Even if you don’t have the products and don’t intend on ever purchasing them.

Worse, I closed the software, deleted the extra directories, and relaunched only to find that now the NGenuity suite hangs (can’t minimize, move the window, close it, etc) at launch.

So lose a GB of space, or guess what your battery is at in your headset. Buyer beware.

 

Exit mobile version