Unsupervised Machine Learning for Detection Engineering

Hola everyone, I´m Diego from diegowritesa.blog

Before I start -  since my last post got quite a lot of traction and some people tried claiming it as their own work. You can find me on Linkedin & Twitter - you can imagine these blogs take quite a bit of time to make, and is all simply to share some knowledge, any comments, requests, connections are much appreciated everyone! Love you all! ❤️

As last time, do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you want.

If you are asking yourself, but Diego, why are you not posting about AI, genAI, AgenticAI, etc.. relaaaax, first things first, today we will be discussing some basics of Machine Learning, and we will continue on more advanced AI topics in future entries.

Today’s post will serve as an introduction to key tools and their applications in cybersecurity use cases. If you’re already familiar with ML/AI concepts, jump straight to the ‘Cybersecurity & Machine Learning: Real-World Applications’ section below!

**Disclaimer: I am not re-inventing the wheel here, you have several hundreds of thousands of tutorials, courses and anything you need on the internet to get a way more rigorous explanation of all these advanced concepts. I like teaching with examples and easy concepts, there will be references at the end if you wish to expand your knowledge.

-----------------------------------------------------------------------------------------------------------------------------

Executive Summary

"Math, Myth & Machine Learning: Cybersecurity Basics with a Real-World Application
Detecting Anomalies in Windows Security Event Logs Using ML Techniques"
-----------------------------------------------------------------------------------------------------------------------------

Why do we need math in Cybersecurity? Does it have AI?

Because math is fun!.. (sort of).


If you’re living in the same world as I am—where everything is AI nowadays (written circa April-May 2025)—it’s fair to ask: What exactly is AI?

IBM defines it as ‘technology that enables computers and machines to simulate human learning, comprehension, problem-solving, decision-making, creativity, and autonomy.’

Why This Matters for Cybersecurity:
While definitions vary, the core idea remains: AI mimics human-like reasoning to automate tasks. But how does this translate to detecting threats or analyzing logs? Let’s demystify the hype."


1. We don´t have to understand "all math", "a little math" is usually enough.

In very simple terms: something that can act like a human, something that can "learn", "solve problems", "decide", has creativity and "autonomy".

Breaking Down the Semantics 🔍:

  • ‘Learn’ & ‘Solve Problems’: These imply a feedback loop where the system improves over time, predicting outcomes based on historical data (e.g., a linear regression model refining its predictions as it processes more information).

2. Best Example I could find of what a Linear regression is.
  • Decide & AutonomyAt its simplest, this could resemble rule-based logic—like an if() statement that triggers an action when a condition is met (e.g., “if a login attempt fails 5 times, block the account”). By this definition, even basic code exhibits ‘autonomy’ since it executes decisions without human intervention (though you designed the rules).

Very cool Diego, but now that I know what AI is.. what is Machine Learning?, how do we link all this to Cybersecurity??

    What is Machine Learning? Does it have AI?

    Machine Learning is "the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data."


    3. Self-explanatory

      In simple terms, we feed a lot of data to an algorithm, shake it very well, and see what happens

      It tends to be written in python also. Libraries like numpy, pandas, matplotlib, scipy, scikit-learn, tensorflow, pytorch, keras, etc.. are some of the most common ones for these tasks.

      How does Machine Learning "Learn"?

      We have 2 types of Machine Learning.
      • Supervised
      • Unsupervised


      4. The Machine is Learning

        Let me walk you through the slide above. I am using pictures of cats & dogs simply because I like cats, you can extrapolate this example to any concept you want  In here we are building a model that will be able to tell us if something is a cat or not.

        Supervised Learning

        We feed images of cats to our model along with the tags (tags just means, we tell the model, here is the image, and here is the text associated with this image, that links it to it being a cat).



        5. Zoomed-in Supervised Learning

        You can feed as many images as you want, in my case I have only done those 2 brown kittens. So let´s assume the model is trained.

        Now, lets ask some questions, we send some images for evaluation to our model.
        • Is the first picture a cat? - Answer: Yes
        Why is it a cat? Well, it looks like one, we have trained it with brown furballs, so it works, they are similar enough.
        • Is the second picture a cat? - Answer: No
        Wait what? I can clearly see its a cat! Well, that is because you are a human, and have a good understanding of what a cat is, but this Machine Learning model has learnt that these tiny cats have their ears down, it has never seen a cat facing the camera, thus the outcome is that it is NOT a cat.
        • Is the third picture a cat? - Answer: No
        This one is even worse, is orange! Can cats even be orange? Well, not for our Machine Learning model, thus classifying is as NOT a cat.

        Unsupervised Learning

        Unsupervised learning is my favourite, and I would say the most useful for Cyber security purposes. In short, you feed a bunch of stuff to your model, and ask it to find patterns.



        6. Zoomed-in Unsupervised Learning


        In a typical unsupervised learning scenario, imagine feeding the model a dataset of cat and dog images. Without explicit labels, the algorithm identifies patterns—like fur texture, ear shape, or color—to group similar images.

        The model might cluster dark-brown kittens into one group and white puppies into another, purely based on visual similarities it detects. These clusters emerge from the algorithm’s analysis, not predefined rules.


        Cyber Security & Machine Learning: Real-World Applications

        Now that we’ve covered the basics of AI and ML, let’s dive into practical tools—specifically clustering algorithms (unsupervised learning) for pattern discovery in data:

        Clustering Algorithms:

        • K-Means: Groups data into k distinct clusters based on distance.
        • Hierarchical Clustering: Builds nested clusters using tree-like structures.
        • DBSCAN: Identifies dense regions, ideal for noisy data (e.g., irregular logins).
        • Mean Shift: Discovers clusters by shifting points toward high-density areas.
        • Gaussian Mixture Models (GMM): Uses probability distributions for flexible clustering.
        • Spectral Clustering: Leverages graph theory for complex pattern recognition.
        • OPTICS: Similar to DBSCAN but handles varying densities (useful for network traffic).
        • Agglomerative Clustering: Merges smaller clusters hierarchically.
        • BIRCH: Optimized for large datasets via clustering feature trees.
        They all have very fancy names but serve the SAME purpose -  What patterns exist in this data?

        It is very important to consider that not all Machine Learning techniques are useful to solve all the problems, these are some common applications for Supervised Learning:
        • Malware Detection
        • Phishing Detection
        • Intrusion Detection Sytems
        • User Behaviour Analytics
        • Acces Controls
        And some for Unsupervised Learning:
        • Zero-Day Malware
        • Network Anomaly 
        • Data Exfiltration 

        Does that mean that these techniques are limited to these use cases? big NO. These are just common applications, you can do whatever you want, and I really mean it. You can mix and match techniques to detect anything you want.

        In my experience I tend to incline more towards Threat Hunting and Clustering way more for Detection Engineering. Overall, its about finding something that either fits a pattern, or something that doesn´t.


        Clustering Windows Security Event Logs - Finding Evil?

        Why would we want to cluster Windows Security Event Logs? When analyzing massive volumes of logs, even seasoned experts can miss subtle patterns. Clustering automates this process—letting ML algorithms surface hidden trends, outliers, or attack signatures you might overlook.

        Clustering algorithms like OPTICS often require parameter tuning (e.g., distance thresholds, minimum samples). 

        How do you choose?
        • Experiment: Test different values against your dataset.
        • Iterate: Adjust based on cluster quality metrics (e.g., silhouette score).
        • Domain Knowledge: Prioritize features tied to security events (e.g., logon IDs, source IPs).

        There is no good answer, experiment with your problem, your logs, different algorithms, etc.. I cannot cover an in-depth explanation of all algorithms and all techniques in this post.

        My Feature Selection for Windows Logs

        For this demo, I’ve clustered logs using OPTICS with these features:

        • signature (event type)
        • source.ip & source.port (origin of activity)
        • windows.logonId (user session)
        • windows.taskCategory (action category)

        Why These?
        They represent core attributes of a Windows security event. However, feature selection is flexible—add/remove fields like user.name or event.code based on your goals.

        Now I will show you how this clustering looks.




        7. OPTICS clustering applied to Windows Security Event Logs


        What can we learn from these 2 pictures? ...  ...  nothing !

        And this is probably the most important lesson here, clustering without domain specific knowledge means nothing!  Clustering algorithms identify mathematical similarities, not malicious intent.

        The algorithm is simply finding "clusters" of patterns that are similar enough, or weird enough, based on math. Does that mean that everything that is "weird" or "not similar" is bad in cyber security? Not really, same applies to behaviours that are very common.

        And with this I mean, think of ML as a super powerful tool to detect the unknown, we do not know what are we looking for, thus we apply some ML to get pointers and ideas.

        Key Lessons:
        • Not All Anomalies Are Threats: A rare event could be a false positive (e.g., an admin working late) or a true threat (e.g., a compromised account).
        • Common ≠ Safe: Frequent patterns (e.g., daily logins) might hide credential-stuffing attacks.
        • ML as a Magnifying Glass: It highlights “unusual” or “repetitive” patterns for you to investigate—not to replace human judgment.

        Of course, do not miss the point here, we can play with our parameters and filter out the noise much more to find anomalies (which is a very good use case for unsupervised learning).

        Anomalies != Malware, but close enough.

        Does that mean we can use clustering to detect malware in our environment? Absolutely yes, but with caveats. 

        While clustering can identify suspicious patterns indicative of malware, it’s not a standalone solution. Think of it as a force multiplier, not a replacement for EDR or signature-based tools. Success depends on:
        • Data Quality: Are your logs capturing relevant IOCs?
        • Tuning: Adjusting parameters to reduce noise.
        • Context: Correlating clusters with threat intelligence.

        And now, because graphics are very cool, here you have a representation of different clustering techniques and how they look with the same sample data source (Windows Security Event Logs), default parameters on all of them. Point to prove here is, that each algorithm will create a different amount of clusters, and it is up to us to investigate further and understand what each cluster means.

        And with this I mean that if you are executing a reverse shell and it is showing up on your proxy, firewall, app-based logs, but it is obfuscated enough to bypass your EDRs and static-based controls, this method would have a SUPER HIGH probability of flagging it as something different, as standard request simply don´t "look obfuscated".

        Now, super cool comparison between Clustering Algorithms, represented in 2D and 3D

        2D Representations, they create different clusters, ranging from 2 to 11 different clusters


        8. 2D Clustering representation of all techniques mentioned above



        3D Representation, sometimes is easier to spot them using a 3d plot, but remember we can only plot so much as for this example we had 5 main features, we are doing PCA to plot it.



        9. 3D Clustering representation of all techniques mentioned above


        You can see that regardless of the technique, the amount of clusters stays "more or less" standard at somewhere around 5-7 different clusters.

        Now is where the real Cyber Security knowledge comes into play. Why am I getting 5-7 average clusters, does that mean that there are only 5-7 types of activities happening in my environment?

        Before we wrap up, full disclosure: The logs I used for these clusters spanned just one week from a relatively quiet environment. Most events were routine—like codes 4624 (successful logon), 4648 (explicit credential request), and 5140 (network share access).

        The (Obvious) Catch
        Unsurprisingly, the model learned to cluster events primarily by their event codes—something a human could do with a simple filter. 😅 

        Clearly I did something wrong, think about what I have just done, we have grouped all event codes within an algorithm, and the algorithm told us: "here you have, 5 different event codes", which is not surprising at all!

        Your Homework: Leveling Up This Technique

        How would you improve this approach to detect meaningful anomalies instead of just event types?

        Hint: Think about feature engineering, data enrichment, or blending supervised/unsupervised methods.

        Share your ideas via:
        • 📝 Comments
        • 📧 Email
        • 📠 Fax
        • 📨 Certified letter
        • 🐦 Pigeon Carrier (Preferred)

        Final Remarks

        This has been a super quick overview of what is AI, what is Machine Learning, which types of Machine Learning algorithms do we have, how do we use them, recommended use cases and a live example comparing them.

        I know this is a lot to process, and I have not gotten into any depth of the math whatsoever, but you can probably imagine that the better you understand the algorithm and the problem you are trying to solve, the better outcomes you get, meaning math is somewhat important.

        If you build cool stuff with this, or this sparked any ideas, please do let me know, very interested in knowing what´s out there and how to improve!

        Leaving you with some ideas to build on your own, finding patters on Proxy, Firewall, IDS, WMI, etc.. anything you can get malicious known samples for, either via clustering or supervised learning, there is a high chance of finding evil here.

        Code used to generate all graphs in here.

        Consider taking SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals if this was interesting enough to learn more about it!

        References & More Useful Information

        • My GitHub, full code explained in this post can be found - HERE
        • AI For everyone, recommended course - HERE
        • Supervised Learning - Extra Info - HERE
        • Unsupervised Learning - Extra Info - HERE
        • Pandas Tutorial - HERE
        • Pandas Tutorial - Video format - HERE
        • Code to Use all Clustering Techniques - HERE

        Tags

        #technical #python #Clustering #MachineLearning #Python #Automation #AI #ArtificialIngelligence #memes





        Comments

        Popular posts from this blog

        Detecting C2-Jittered Beacons with Frequency Analysis

        Web Scraping for Cyber Security