AI Security & RAG Architectures - How do we secure AI Systems?

Hola everyone, I´m Diego from diegowritesa.blog

After taking all SANS AI courses, SEC495, SEC545 & SEC595 - I felt like sharing some of my thoughts on how security fits into the AI world, and why it’s so important to consider from the very beginning.

You can find me on L inkedin & T witter - These blogs take quite a bit of time to put together, and they’re all just meant to share knowledge. Any comments, requests, or connections are always much appreciated — love you all! ❤️

As last time, do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you’d like.

Today, we’re talking about securing RAG AI architectures — whether on-prem or in the cloud.

-----------------------------------------------------------------------------------------------------------------------------

Executive Summary

"How do we secure RAG AI Systems? Either local or cloud/managed architectures, review & opinions"

-----------------------------------------------------------------------------------------------------------------------------

AI Architectures - Localhost

Before we dive in, it’s important to highlight a key limitation: LLMs are flawed by design when it comes to execution control. Unlike a traditional operating system, they don’t enforce a separation between User and Kernel modes. In other words, even if we label prompts as SYSTEM or USER, once the input reaches the LLM, it can all be treated the same — meaning anything could potentially be executed.

That said, we can still design security controls around the model to minimize risks and reduce the chances of things going wrong.

Let’s get into AI architectures. To start, I’ll share a few diagrams and walk you through them step by step. I’ll be focusing on RAG-based architectures, since I’ve seen their popularity growing rapidly.

Here’s the very first architecture I put together when I started experimenting with AI:

1. Local RAG AI Architecture

Note that this setup runs entirely on localhost, which means you can easily replicate it on your own laptop without much trouble. You’ll want around 16GB of RAM and a reasonably modern graphics card so things don’t run painfully slow.

First step, get yourself some Jupyter Labs, I am a fan of python, and personally I just wanted to understand how all these pieces connect, so found Jupyter to be the perfect playground for it.

In Jupyter I installed Ollama to host my models, and then created the RAG pipeline with a couple classes, documents/knowledge is stored somewhere in PDF, and you can index it in the Vector DB of your choice. In here I strongly recommend playing with custom embeddings/ranking to see what works best for your use case.

Now, let’s expand on the idea of adding Security & Controls to our pipeline. The big question is: how can we better manage our inputs and outputs?

One approach is to introduce additional agents — as many as you need — both before and after the core LLM that handles retrieval. These agents act as filters, controlling the flow of data in and out.

Here’s how that looks with a simple modification to the architecture we saw above:

2. Add as many Agents as you want, each agent has a $ cost $

Think of the Auditor and Refiner as a couple of “agents” running alongside the pipeline. In practice, I’m just reusing the same LLM for all queries — but you can abstract this as much as you like. Different agents, different LLMs… it doesn’t really matter; the pattern stays the same.

So, what are these agents doing?

Refiner → Summarizes and rewrites incoming data into a positive, filtered query. The goal is to reduce the risk of prompt injections.
Auditor → Sits on the outgoing flow and checks whether the response actually answers the original question. If it doesn’t, it simply requests another response.

How can this be improved?

Use different models for different agents.
Add a form of “chain of thought” tracking to monitor how the LLM is reasoning.
Log input/output data for better visibility and auditing.
Explore other custom layers depending on your risk profile and use case.

To wrap up this localhost chapter, I want to share an example of a lightweight, efficient architecture you can replicate yourself. It’s designed as a self-hosted, privacy-preserving MVP — balancing quick prototyping, strong privacy, and room for future scalability. Of course, feel free to adapt it to your own needs and use cases.

3. Simplified-efficient RAG AI Architecture proposal - localhost

Architecture Overview

Model Runtime:

Ollama as local runtime for small/medium LLMs.
High-end setups can run LLaMA 70B / Qwen2 72B for stronger reasoning.
Prioritizes on-device execution to protect confidential article archives.

Embeddings & Retrieval:

Two Domain-tuned embeddings, better data quality, depending on your article-type.
ChromaDB (SQLite) as lightweight vector DB for ~2,000 articles. o Migration path to pgvector / Pinecone for larger datasets.

Orchestration:

LlamaIndex manages ingestion, retrieval, and query execution.
Clean abstractions, built-in RAG pipeline, easy migration to managed services.

User Interface:

Gradio for rapid prototyping and stakeholder demos.
Allows quick sharing and testing across distributed teams.
Future upgrade path: Streamlit (dashboards) or Next.js + TipTap (collaborative editing).

Deployment:

Local Docker Compose for modular, reproducible, and cost-free deployment.
Scales later to cloud VMs / Kubernetes for production workloads.

AI Architectures - Cloud/API-based

Let's shift our focus to cloud/API-based AI architectures, building on the localhost setups we discussed earlier. These leverage managed cloud services for scalability, easier collaboration, and access to powerful APIs, but they introduce considerations like costs, data egress, and dependency on third-party providers.

This approach is ideal if you need to handle larger datasets (e.g., 10,000+ documents), support multiple users, or integrate with enterprise tools, but it requires careful privacy management to avoid exposing sensitive data.

Basic RAG pipeline in AWS

4. AWS RAG

Basic RAG pipeline in Azure

5. Azure RAG

Start with a cloud VM to host your orchestration logic. Use API-based embeddings to vectorize documents, store them in a managed vector DB for fast retrieval, and query an LLM API for generation. Tools like LangChain make it easy to chain these components.

Pros: Scalable retrieval for massive knowledge bases; no local hardware limits. Cons: API costs (~$0.02/1K tokens for embeddings) and potential latency from network calls.

Expanding with Security & Controls

To add security and controls in a cloud environment, we can introduce API gateways and monitoring layers. This modifies the basic pipeline to include input/output guards, similar to the local agents but scaled for cloud.

AWS provides basic US/EU Guardrails to comply with AI regulators and provide some security around your prompts, this is how they work:

6. AWS Guardrails + Bedrock

Essentially the same setup I showed above in localhost, but in a managed service.

These "guards" are serverless functions (e.g., AWS Lambda or Azure Functions) running lightweight LLMs or rule-based filters:

Input Guard: Uses an API like OpenAI's moderation endpoint to detect and rewrite malicious prompts (e.g., prompt injections), ensuring only safe queries reach the core LLM.
Output Guard: Checks responses for compliance (e.g., no PII leaks, factual alignment) using a secondary API call to a smaller model like GPT-3.5-turbo.

Remember, a "guard" or a "chain of thought" is yet another LLM call, which per se is not bad, but in a cloud or managed environment can incur into relevant costs over time. The usual trade-off between money and security :D

On top of these guards, you also have moderation agents, to kind of filter "PII" or "categories", think of it as an LLM that acts as your proxy. You want to prevent your LLM from telling your users how to build a nuclear warhead? sure thing, add a moderation LLM step in the middle and ban the "nuclear weapons" category.

Final Remarks

Choosing your RAG architecture is not an easy task, specially now with the AI field evolving every single week, at the time of writing this, META has just released a new form of RAG - REFRAG that makes everything x30 faster.

What I wanted to make very clear with this post is that it does not matter the architecture you choose, as long as you have in mind security before building your AI system. This is specially important for non-deterministic systems, where permissions & access can easily lead to a compromise. All it takes is, one of your developers downloading a compromised model from Hugging Face to execute a backdoor in your organization, as such, we must be careful.

Remember the trade-off between local vs managed/cloud. Amazon SageMaker, OpenAI integrations & MLOps best practices. Also important to remember the new AI regulations (specially in EU) about being an AI provider, something to take into account before deploying.

Important Logging capabilities:

Get Infrastructure logs, wherever your AI system might be running, Kubernetes, self-hosted, AWS CloudTrail etc..
GenAI app logs, chain of thought
Prompts & requests logs, Q&A

I know I have not touched MCPs, Agentic, Supply Chain, prompt injection, deserialization, etc.., that will come in a later post.

This has been a quick overview on AI Security on building and securing RAG AI Systems, as always, if you have any questions or anything else you would like to discuss, please do reach out. I read all comments!

Consider taking SEC595: Applied Data Science and AI/Machine Learning for Cybersecurity Professionals if this was interesting enough to learn more about it!

References & More Useful Information

My GitHub, full code explained in this post can be found - HERE
AI For everyone, recommended course - HERE
Code to Use all Clustering Techniques - HERE

Search This Blog

Diego Writes a Blog