AI Security & RAG Architectures - How do we secure AI Systems?
Hola everyone, I´m Diego from diegowritesa.blog
After taking all SANS AI courses, SEC495, SEC545 & SEC595 - I felt like sharing some of my thoughts on how security fits into the AI world, and why it’s so important to consider from the very beginning.
You can find me on Linkedin & Twitter - These blogs take quite a bit of time to put together, and they’re all just meant to share knowledge. Any comments, requests, or connections are always much appreciated — love you all! ❤️
As last time, do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you’d like.
Today, we’re talking about securing RAG AI architectures — whether on-prem or in the cloud.Executive Summary
"How do we secure RAG AI Systems? Either local or cloud/managed architectures, review & opinions"AI Architectures - Localhost
That said, we can still design security controls around the model to minimize risks and reduce the chances of things going wrong.
Here’s the very first architecture I put together when I started experimenting with AI:
Now, let’s expand on the idea of adding Security & Controls to our pipeline. The big question is: how can we better manage our inputs and outputs?
One approach is to introduce additional agents — as many as you need — both before and after the core LLM that handles retrieval. These agents act as filters, controlling the flow of data in and out.
Here’s how that looks with a simple modification to the architecture we saw above:
So, what are these agents doing?
- Refiner → Summarizes and rewrites incoming data into a positive, filtered query. The goal is to reduce the risk of prompt injections.
- Auditor → Sits on the outgoing flow and checks whether the response actually answers the original question. If it doesn’t, it simply requests another response.
How can this be improved?
- Use different models for different agents.
- Add a form of “chain of thought” tracking to monitor how the LLM is reasoning.
- Log input/output data for better visibility and auditing.
- Explore other custom layers depending on your risk profile and use case.
To wrap up this localhost chapter, I want to share an example of a lightweight, efficient architecture you can replicate yourself. It’s designed as a self-hosted, privacy-preserving MVP — balancing quick prototyping, strong privacy, and room for future scalability. Of course, feel free to adapt it to your own needs and use cases.
Architecture Overview
- Ollama as local runtime for small/medium LLMs.
- High-end setups can run LLaMA 70B / Qwen2 72B for stronger reasoning.
- Prioritizes on-device execution to protect confidential article archives.
- Two Domain-tuned embeddings, better data quality, depending on your article-type.
- ChromaDB (SQLite) as lightweight vector DB for ~2,000 articles. o Migration path to pgvector / Pinecone for larger datasets.
- LlamaIndex manages ingestion, retrieval, and query execution.
- Clean abstractions, built-in RAG pipeline, easy migration to managed services.
- Gradio for rapid prototyping and stakeholder demos.
- Allows quick sharing and testing across distributed teams.
- Future upgrade path: Streamlit (dashboards) or Next.js + TipTap (collaborative editing).
- Local Docker Compose for modular, reproducible, and cost-free deployment.
- Scales later to cloud VMs / Kubernetes for production workloads.
AI Architectures - Cloud/API-based
This approach is ideal if you need to handle larger datasets (e.g., 10,000+ documents), support multiple users, or integrate with enterprise tools, but it requires careful privacy management to avoid exposing sensitive data.
Basic RAG pipeline in AWS
Basic RAG pipeline in Azure
Start with a cloud VM to host your orchestration logic. Use API-based embeddings to vectorize documents, store them in a managed vector DB for fast retrieval, and query an LLM API for generation. Tools like LangChain make it easy to chain these components.Pros: Scalable retrieval for massive knowledge bases; no local hardware limits. Cons: API costs (~$0.02/1K tokens for embeddings) and potential latency from network calls.
Expanding with Security & Controls
To add security and controls in a cloud environment, we can introduce API gateways and monitoring layers. This modifies the basic pipeline to include input/output guards, similar to the local agents but scaled for cloud.Essentially the same setup I showed above in localhost, but in a managed service.
- Input Guard: Uses an API like OpenAI's moderation endpoint to detect and rewrite malicious prompts (e.g., prompt injections), ensuring only safe queries reach the core LLM.
- Output Guard: Checks responses for compliance (e.g., no PII leaks, factual alignment) using a secondary API call to a smaller model like GPT-3.5-turbo.
Remember, a "guard" or a "chain of thought" is yet another LLM call, which per se is not bad, but in a cloud or managed environment can incur into relevant costs over time. The usual trade-off between money and security :D
On top of these guards, you also have moderation agents, to kind of filter "PII" or "categories", think of it as an LLM that acts as your proxy. You want to prevent your LLM from telling your users how to build a nuclear warhead? sure thing, add a moderation LLM step in the middle and ban the "nuclear weapons" category.
Final Remarks
Choosing your RAG architecture is not an easy task, specially now with the AI field evolving every single week, at the time of writing this, META has just released a new form of RAG - REFRAG that makes everything x30 faster.
What I wanted to make very clear with this post is that it does not matter the architecture you choose, as long as you have in mind security before building your AI system. This is specially important for non-deterministic systems, where permissions & access can easily lead to a compromise. All it takes is, one of your developers downloading a compromised model from Hugging Face to execute a backdoor in your organization, as such, we must be careful.
Remember the trade-off between local vs managed/cloud. Amazon SageMaker, OpenAI integrations & MLOps best practices. Also important to remember the new AI regulations (specially in EU) about being an AI provider, something to take into account before deploying.
Important Logging capabilities:
- Get Infrastructure logs, wherever your AI system might be running, Kubernetes, self-hosted, AWS CloudTrail etc..
- GenAI app logs, chain of thought
- Prompts & requests logs, Q&A
I know I have not touched MCPs, Agentic, Supply Chain, prompt injection, deserialization, etc.., that will come in a later post.
This has been a quick overview on AI Security on building and securing RAG AI Systems, as always, if you have any questions or anything else you would like to discuss, please do reach out. I read all comments!
Comments
Post a Comment
Any comments / questions, please write it down in here!