Unsupervised Machine Learning for Detection Engineering

Hola everyone, I´m Diego from diegowritesa.blog

Before I start - since my last post got quite a lot of traction and some people tried claiming it as their own work. You can find me on L inkedin & T witter - you can imagine these blogs take quite a bit of time to make, and is all simply to share some knowledge, any comments, requests, connections are much appreciated everyone! Love you all! ❤️

As last time, do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you want.

If you are asking yourself, but Diego, why are you not posting about AI, genAI, AgenticAI, etc.. relaaaax, first things first, today we will be discussing some basics of Machine Learning, and we will continue on more advanced AI topics in future entries.

Today’s post will serve as an introduction to key tools and their applications in cybersecurity use cases. If you’re already familiar with ML/AI concepts, jump straight to the ‘Cybersecurity & Machine Learning: Real-World Applications’ section below!

**Disclaimer: I am not re-inventing the wheel here, you have several hundreds of thousands of tutorials, courses and anything you need on the internet to get a way more rigorous explanation of all these advanced concepts. I like teaching with examples and easy concepts, there will be references at the end if you wish to expand your knowledge.

-----------------------------------------------------------------------------------------------------------------------------

Executive Summary

"Math, Myth & Machine Learning: Cybersecurity Basics with a Real-World Application

Detecting Anomalies in Windows Security Event Logs Using ML Techniques"

-----------------------------------------------------------------------------------------------------------------------------

Why do we need math in Cybersecurity? Does it have AI?

Because math is fun!.. (sort of).

If you’re living in the same world as I am—where everything is AI nowadays (written circa April-May 2025)—it’s fair to ask: What exactly is AI?

IBM defines it as ‘technology that enables computers and machines to simulate human learning, comprehension, problem-solving, decision-making, creativity, and autonomy.’

Why This Matters for Cybersecurity:
While definitions vary, the core idea remains: AI mimics human-like reasoning to automate tasks. But how does this translate to detecting threats or analyzing logs? Let’s demystify the hype."

1. We don´t have to understand "all math", "a little math" is usually enough.

In very simple terms: something that can act like a human, something that can "learn", "solve problems", "decide", has creativity and "autonomy".

Breaking Down the Semantics 🔍:

‘Learn’ & ‘Solve Problems’: These imply a feedback loop where the system improves over time, predicting outcomes based on historical data (e.g., a linear regression model refining its predictions as it processes more information).

2. Best Example I could find of what a Linear regression is.

‘Decide’ & ‘Autonomy’: At its simplest, this could resemble rule-based logic—like an if() statement that triggers an action when a condition is met (e.g., “if a login attempt fails 5 times, block the account”). By this definition, even basic code exhibits ‘autonomy’ since it executes decisions without human intervention (though you designed the rules).

Very cool Diego, but now that I know what AI is.. what is Machine Learning?, how do we link all this to Cybersecurity??

What is Machine Learning? Does it have AI?

Machine Learning is "the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data."

3. Self-explanatory

In simple terms, we feed a lot of data to an algorithm, shake it very well, and see what happens.

It tends to be written in python also. Libraries like numpy, pandas, matplotlib, scipy, scikit-learn, tensorflow, pytorch, keras, etc.. are some of the most common ones for these tasks.

How does Machine Learning "Learn"?

We have 2 types of Machine Learning.

Supervised
Unsupervised

4. The Machine is Learning

Let me walk you through the slide above. I am using pictures of cats & dogs simply because I like cats, you can extrapolate this example to any concept you want In here we are building a model that will be able to tell us if something is a cat or not.

Supervised Learning

We feed images of cats to our model along with the tags (tags just means, we tell the model, here is the image, and here is the text associated with this image, that links it to it being a cat).

5. Zoomed-in Supervised Learning

You can feed as many images as you want, in my case I have only done those 2 brown kittens. So let´s assume the model is trained.

Now, lets ask some questions, we send some images for evaluation to our model.

Is the first picture a cat? - Answer: Yes

Why is it a cat? Well, it looks like one, we have trained it with brown furballs, so it works, they are similar enough.

Is the second picture a cat? - Answer: No

Wait what? I can clearly see its a cat! Well, that is because you are a human, and have a good understanding of what a cat is, but this Machine Learning model has learnt that these tiny cats have their ears down, it has never seen a cat facing the camera, thus the outcome is that it is NOT a cat.

Is the third picture a cat? - Answer: No

This one is even worse, is orange! Can cats even be orange? Well, not for our Machine Learning model, thus classifying is as NOT a cat.

Unsupervised Learning

Unsupervised learning is my favourite, and I would say the most useful for Cyber security purposes. In short, you feed a bunch of stuff to your model, and ask it to find patterns.

6. Zoomed-in Unsupervised Learning

In a typical unsupervised learning scenario, imagine feeding the model a dataset of cat and dog images. Without explicit labels, the algorithm identifies patterns—like fur texture, ear shape, or color—to group similar images.

The model might cluster dark-brown kittens into one group and white puppies into another, purely based on visual similarities it detects. These clusters emerge from the algorithm’s analysis, not predefined rules.

Cyber Security & Machine Learning: Real-World Applications

Now that we’ve covered the basics of AI and ML, let’s dive into practical tools—specifically clustering algorithms (unsupervised learning) for pattern discovery in data:

Clustering Algorithms:

K-Means: Groups data into k distinct clusters based on distance.
Hierarchical Clustering: Builds nested clusters using tree-like structures.
DBSCAN: Identifies dense regions, ideal for noisy data (e.g., irregular logins).
Mean Shift: Discovers clusters by shifting points toward high-density areas.
Gaussian Mixture Models (GMM): Uses probability distributions for flexible clustering.
Spectral Clustering: Leverages graph theory for complex pattern recognition.
OPTICS: Similar to DBSCAN but handles varying densities (useful for network traffic).
Agglomerative Clustering: Merges smaller clusters hierarchically.
BIRCH: Optimized for large datasets via clustering feature trees.

They all have very fancy names but serve the SAME purpose - What patterns exist in this data?

It is very important to consider that not all Machine Learning techniques are useful to solve all the problems, these are some common applications for Supervised Learning:

Malware Detection
Phishing Detection
Intrusion Detection Sytems
User Behaviour Analytics
Acces Controls

And some for Unsupervised Learning:

Zero-Day Malware
Network Anomaly
Data Exfiltration

Does that mean that these techniques are limited to these use cases? big NO. These are just common applications, you can do whatever you want, and I really mean it. You can mix and match techniques to detect anything you want.

In my experience I tend to incline more towards Threat Hunting and Clustering way more for Detection Engineering. Overall, its about finding something that either fits a pattern, or something that doesn´t.

Clustering Windows Security Event Logs - Finding Evil?

Why would we want to cluster Windows Security Event Logs? When analyzing massive volumes of logs, even seasoned experts can miss subtle patterns. Clustering automates this process—letting ML algorithms surface hidden trends, outliers, or attack signatures you might overlook.

Clustering algorithms like OPTICS often require parameter tuning (e.g., distance thresholds, minimum samples).

How do you choose?

Experiment: Test different values against your dataset.
Iterate: Adjust based on cluster quality metrics (e.g., silhouette score).
Domain Knowledge: Prioritize features tied to security events (e.g., logon IDs, source IPs).

There is no good answer, experiment with your problem, your logs, different algorithms, etc.. I cannot cover an in-depth explanation of all algorithms and all techniques in this post.

My Feature Selection for Windows Logs

For this demo, I’ve clustered logs using OPTICS with these features:

signature (event type)
source.ip & source.port (origin of activity)
windows.logonId (user session)
windows.taskCategory (action category)

Why These?
They represent core attributes of a Windows security event. However, feature selection is flexible—add/remove fields like user.name or event.code based on your goals.

Now I will show you how this clustering looks.

7. OPTICS clustering applied to Windows Security Event Logs

What can we learn from these 2 pictures? ... ... nothing !

And this is probably the most important lesson here, clustering without domain specific knowledge means nothing! Clustering algorithms identify mathematical similarities, not malicious intent.

The algorithm is simply finding "clusters" of patterns that are similar enough, or weird enough, based on math. Does that mean that everything that is "weird" or "not similar" is bad in cyber security? Not really, same applies to behaviours that are very common.

And with this I mean, think of ML as a super powerful tool to detect the unknown, we do not know what are we looking for, thus we apply some ML to get pointers and ideas.

Key Lessons:

Not All Anomalies Are Threats: A rare event could be a false positive (e.g., an admin working late) or a true threat (e.g., a compromised account).
Common ≠ Safe: Frequent patterns (e.g., daily logins) might hide credential-stuffing attacks.
ML as a Magnifying Glass: It highlights “unusual” or “repetitive” patterns for you to investigate—not to replace human judgment.

Of course, do not miss the point here, we can play with our parameters and filter out the noise much more to find anomalies (which is a very good use case for unsupervised learning).

Anomalies != Malware, but close enough.

Does that mean we can use clustering to detect malware in our environment? Absolutely yes, but with caveats.

While clustering can identify suspicious patterns indicative of malware, it’s not a standalone solution. Think of it as a force multiplier, not a replacement for EDR or signature-based tools. Success depends on:

Data Quality: Are your logs capturing relevant IOCs?
Tuning: Adjusting parameters to reduce noise.
Context: Correlating clusters with threat intelligence.

And now, because graphics are very cool, here you have a representation of different clustering techniques and how they look with the same sample data source (Windows Security Event Logs), default parameters on all of them. Point to prove here is, that each algorithm will create a different amount of clusters, and it is up to us to investigate further and understand what each cluster means.

And with this I mean that if you are executing a reverse shell and it is showing up on your proxy, firewall, app-based logs, but it is obfuscated enough to bypass your EDRs and static-based controls, this method would have a SUPER HIGH probability of flagging it as something different, as standard request simply don´t "look obfuscated".

Now, super cool comparison between Clustering Algorithms, represented in 2D and 3D

2D Representations, they create different clusters, ranging from 2 to 11 different clusters

8. 2D Clustering representation of all techniques mentioned above

3D Representation, sometimes is easier to spot them using a 3d plot, but remember we can only plot so much as for this example we had 5 main features, we are doing PCA to plot it.

9. 3D Clustering representation of all techniques mentioned above

You can see that regardless of the technique, the amount of clusters stays "more or less" standard at somewhere around 5-7 different clusters.

Now is where the real Cyber Security knowledge comes into play. Why am I getting 5-7 average clusters, does that mean that there are only 5-7 types of activities happening in my environment?

Before we wrap up, full disclosure: The logs I used for these clusters spanned just one week from a relatively quiet environment. Most events were routine—like codes 4624 (successful logon), 4648 (explicit credential request), and 5140 (network share access).

The (Obvious) Catch
Unsurprisingly, the model learned to cluster events primarily by their event codes—something a human could do with a simple filter. 😅

Clearly I did something wrong, think about what I have just done, we have grouped all event codes within an algorithm, and the algorithm told us: "here you have, 5 different event codes", which is not surprising at all!

Your Homework: Leveling Up This Technique

How would you improve this approach to detect meaningful anomalies instead of just event types?

Hint: Think about feature engineering, data enrichment, or blending supervised/unsupervised methods.

Share your ideas via:

📝 Comments
📧 Email
📠 Fax
📨 Certified letter
🐦 Pigeon Carrier (Preferred)

Final Remarks

This has been a super quick overview of what is AI, what is Machine Learning, which types of Machine Learning algorithms do we have, how do we use them, recommended use cases and a live example comparing them.

I know this is a lot to process, and I have not gotten into any depth of the math whatsoever, but you can probably imagine that the better you understand the algorithm and the problem you are trying to solve, the better outcomes you get, meaning math is somewhat important.

If you build cool stuff with this, or this sparked any ideas, please do let me know, very interested in knowing what´s out there and how to improve!

Leaving you with some ideas to build on your own, finding patters on Proxy, Firewall, IDS, WMI, etc.. anything you can get malicious known samples for, either via clustering or supervised learning, there is a high chance of finding evil here.

Code used to generate all graphs in here.

Consider taking SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals if this was interesting enough to learn more about it!

References & More Useful Information

My GitHub, full code explained in this post can be found - HERE
AI For everyone, recommended course - HERE
Supervised Learning - Extra Info - HERE
Unsupervised Learning - Extra Info - HERE
Pandas Tutorial - HERE
Pandas Tutorial - Video format - HERE
Code to Use all Clustering Techniques - HERE

Search This Blog

Diego Writes a Blog