Building a Scalable, Privacy-Preserving Intelligent System for Legal Contracts
The code related to this article can be found in the following repository. Feel free to reach out for any questions or suggestions.
I. The Cost of Silence in Entreprise Data
I.A. The Public AI Shortcut and Its Risks
I.B. Why Off-the-Shelf AI Often Falls Short
I.C. A Private-First ApproachII. The Architecture
II.A. The Interface
II.B. The Worker
II.C. The Inference Engine
II.D. The Vector Database
II.E. Connecting Things TogetherIII. The RAG Pipeline
III.A. Step 1: OCR and Chunking
III.B. Step 2: Embeddings
III.C. Step 3: Vector Search
III.D. Step 4: The LLM
III.E. The RAG Pipeline at a GlanceIV. The Secure Perimeter
IV.A. Security First: Data Sovereignty as a Default
IV.B. Scalability: The Power of KEDA
IV.C. Business Impact: Predictable Performance and CostV. Key Takeaways from the Implementation
V.A. Navigating the "Dirty Data" Problem
V.B. The "Private-First" Standard
V.C. The Future: From Search to Strategy
The Context
Every company today is sitting on mountains of documents: contracts, reports, compliance files, policies, internal notes. The information is there, however, finding the right answer at the right time is still slow and manual. In most industries, thatās inefficient, but in law itās also risky because a single contract can run hundreds of pages and a due diligence process can involve reviewing thousands. Missing a clause, a liability cap, or a termination condition can have real financial consequences and, thus, one is not just reading text, they are managing risk. Thatās where the real problem begins.
Generative AI promises instant answers, but for many companies sending confidential documents to public APIs is not an option for several reasons: client agreements, regulatory constraints, and professional ethics make data privacy non-negotiable. Many āAI for legalā tools look impressive in demos but fail when evaluated under enterprise security requirements and privacy regulations. This is the gap this project, AlisLeg, was built to close.
AlisLeg is a private, enterprise-grade Retrieval-Augmented Generation (RAG) system designed to run entirely inside a controlled and air-gapped cloud private perimeter. It does not rely on public LLM APIs, nor does it expose documents to external services as it is designed for environments where data sovereignty is mandatory. Instead of treating AI as a chatbot feature, we treat it as infrastructure and run it inside a private virtual network on Microsoft Azure alongside the other components of the system. All document processing, embedding, indexing, and retrieval happen within the same isolated environment; which means: no public endpoints, no external API calls, and no data leaving the perimeter.
At a functional level, AlisLeg allows a user to upload documents and ask natural language questions that they could try to answer by searching through the documents manually. The system retrieves the relevant sections, generates a grounded answer, and displays the original source text side-by-side for verification. It acts as an assistant, not a replacement as the user remains in control, but the time spent searching is dramatically reduced. Although we use the legal industry as the reference case, the architecture itself is industry-agnostic. Finance, healthcare, energy, or manufacturing; any sector dealing with sensitive internal documents faces the same challenge: how to unlock insight without compromising control.
This post explains how we built AlisLeg, from its modular ingestion pipeline to its private network isolation, and why we believe private, self-hosted AI systems are becoming the standard for sensitive enterprise use. The goal is simple: combine the speed of modern language models with the privacy standards that high-stakes industries require. To validate this architecture, we utilized the Contract Understanding Atticus Dataset (CUAD), which is a real life dataset and features over 500 complex commercial contracts and agreements. CUAD represents the gold standard for legal AI benchmarking and is an ideal choice for an MVP project as it reflects the actual linguistic density and regulatory complexity found in enterprise environments, ensuring the systemās retrieval and reasoning capabilities are tested against professional grade requirements rather than simplified text.
I. The Cost of Silence in Entreprise Data
In high stakes business environments, āI donāt knowā is not neutral, it is expensive. In legal operations, delays in finding information slow down deals, increase risk exposure, and create stress across teams and departments. During a M&A due diligence, regulatory audits or internal compliance reviews, the ability to surface facts quickly is not a luxury. It directly affects deal timelines, negotiation leverage and financial outcomes.
Most organizations already have the answers but face the problem that those answers are buried in piles of documents. Contracts are long, clauses are scattered, language varies across versions. Even well-organized repositories become difficult to search once the volume grows, and traditional keyword search does not understand meaning but only matches exact terms. So the real issue is not a lack of data, it is a lack of accessible intelligence.
The standard contract review process is manual where lawyers or paralegals scan hundreds of pages looking for specific clauses, which takes time and introduces huge risks. AlisLeg focuses on one metric: time to insight. The goal is that instead of hours of manual scanning, relevant information can be surfaced in seconds. The user still validates the answer, but the search phase is dramatically shortened, increasing operational efficiency.
I.A. The Public AI Shortcut and Its Risks
When tools like ChatGPT became widely available, many professionals tried the obvious shortcut: upload the document or paste a clause into a public interface and ask for clarification.
From a security perspective, this creates several problems, notably:
Data Exposure: Even anonymized snippets can contain sensitive information, and once the data leaves your controlled environment, you lose full visibility over how it is processed or stored (or even to whom it is sold).
Compliance Conflicts: Client agreements, professional privilege, GDPR, HIPAA, and industry regulations often prohibit transferring sensitive documents to third-party systems without strict controls.
Hallucination Risk: Public models are designed to be helpful and fluent, not to guarantee factual accuracy. In legal work, a confident but incorrect answer is worse than no answer.
There is also a technical limitation. Large language models struggle with very long inputs, so important details buried in the middle of long documents may be ignored. Simply increasing the context window does not reliably solve this, and, for enterprise use, ājust prompt it betterā is not a strategy. It is not a skill issue.
I.B. Why Off-the-Shelf AI Often Falls Short
Many AI vendors operate as multi-tenant SaaS platforms. While convenient, this also means:
- Your documents are stored outside your direct infrastructure.
- Data may coexist with other customersā data in shared environments.
- You rely on the vendorās internal controls for isolation, which are not always transparent.
For organizations with strict governance requirements, this shared model is rarely acceptable. Enterprise teams are not only buying functionality, they are buying control, auditability and predictable risk boundaries.
I.C. A Private-First Approach
AlisLeg was designed with a different starting point: keep everything inside your own cloud private perimeter. The storage, vector database, language model, and application layer all run within a private environment on Microsoft Azure, and no public AI APIs are involved, which means that no document is sent outside the virtual network. This changes the conversation: instead of choosing between AI capability and data protection, the system is built to deliver both. The goal here is not to replace human judgment but to remove the friction between questions and answers, without creating new security risks in the process.
II. The Architecture
If the goal is private, secure, and scalable AI, the architecture is what matters most. From the start, we needed a structure that could:
- Handle unpredictable ingestion workloads.
- Scale without running 24/7 infrastructure.
- Stays maintainable.
- Avoid the operational overhead and cost of full microservices.
We chose a Micro-Monolith approach that is modular but not fragmented: containers are separated by responsibility, while ensuring that the system is not split into dozens of loosely managed services. This keeps development fast while preserving scalability.
From a business perspective, we need agility without operational chaos since business workloads are not steady. For example, one week may involve uploading thousands of contracts for due diligence, the next week may involve only a handful of queries. Traditional infrastructure forces you to size for peak usage and pay for idle capacity, but this model does not work when cost control matters.
The Micro-Monolith approach allows:
- Independent scaling of heavy workloads (OCR, embeddings)
- Low idle cost when the system is not processing documents
- Faster feature releases without complex inter-service coordination
At the same time, we avoid the risk of a single, tightly coupled application where one failure impacts everything. It is a practical middle ground that offers the best of both worlds and a good balance between cost and performance. From a technical perspective, the system is built around four primary containers, each with a clear role.
II.A. The Interface
The user-facing layer is built with Streamlit, providing a clean and responsive entry point for users. This container is the focal point for secure user interaction, meticulously handling query submission and the sophisticated rendering of AI-generated responses. One of its standout features is the integrated split-screen interface, which allows users to view the AI's answer alongside the original PDF evidence retrieved from the archives, providing instant verification and grounding. By design, this container remains lightweight and does not engage in heavy tasks like document processing or embedding generation. This isolation ensures that the user experience remains fast and uninterrupted, even when heavy background workloads are running elsewhere in the system.
II.B. The Worker
The Worker is the heavy-lifting engine of the platform, a Python based service designed to handle the entire lifecycle of document ingestion. It orchestrates a multistage pipeline that begins with high-fidelity OCR processing for scanned documents and layout-aware text extraction to ensure that headers, tables, and clauses are correctly identified. Once the raw text is extracted, the Worker performs semantic chunking, breaking down long text paragraphs into manageable, context rich segments, before finally generating the high dimensional mathematical embeddings that allow for precise vector search. By centralizing these compute-heavy tasks into a dedicated container, we protect the user experience from being impacted by the background processing of large document volumes. Another way to approach this would to have a dedicated service for each task, but that was not the best approach for us as we wanted to move fast and keep the system "small" (a trade-off had to be made in a sense).
The Worker is managed through an event-driven architecture where it remains completely idle, consuming zero compute resources, until a specific message appears in the Azure Storage Queue signaling that a new document is ready for processing. We utilize Kubernetes based Event Driven Autoscaling (KEDA) to monitor this queue and automatically scale the required number of Worker containers to meet the demand. This "scale to zero" capability is a game changer for operational costs as it eliminates permanent compute burn during quiet periods while ensuring the system can instantly scale to handle sudden spikes in ingestion volume. This results in a clean, high-performance separation where ingestion only consumes resources when there is actual work to be done, leaving the rest of the infrastructure lean and cost effective.
II.C. The Inference Engine
The core intelligence of the platform resides in a dedicated container powered by Ollama, which hosts the Qwen 2.5 model locally. By isolating the inference engine from the rest of the application logic, we gain significant architectural control as this separation allows us to choose and apply highly specific resource allocation policies, such as pinning the container to high-memory nodes or dedicated GPU clusters, without forcing the same expensive requirements on the UI or Worker. It also ensures system wide resilience because the AI engine is decoupled from the data ingestion and storage layers, a model restart or upgrade does not impact the integrity of the underlying document archives or vector embeddings. This design not only guarantees total data sovereignty by eliminating the need for external API calls but also provides a future proof path for swapping or upgrading models as new SOTA models emerge.
II.D. The Vector Database
The semantic search layer is powered by Qdrant, a high performance vector database self-hosted within our private environment. Unlike traditional keyword based search, which often fails to capture the nuances of human language, Qdrant stores and indexes the high dimensional embeddings generated by the Worker to enable retrieval based on pure semantic meaning. This allows the system to understand that "termination" and "cancellation" are conceptually linked, even if the exact words don't match. By precisely identifying and returning only the most relevant document segments for a given query, this layer directly solves the "lost in the middle" problem, ensuring the LLM is fed only the specific context it needs to generate a grounded, accurate, and hallucination-free response.
II.E. Connecting Things Together
The containers are only part of the system and the surrounding cloud services handle storage, signaling, and persistence. All infrastructure runs inside Microsoft Azure using private networking. Othe key components include:
- Azure Blob Storage: Stores raw uploaded contracts. Public access is disabled and access is restricted via private endpoints.
- Azure Storage Queues: Decouples file upload from processing. Uploading a document creates a queue message, which triggers the Worker.
- Azure Files: Mounted to the Qdrant container to ensure embeddings persist even if containers restart or scale down.
Containers are ephemeral by design, but storage is not; this separation ensures durability without sacrificing scalability.
A High Level Overview
This modularity ensures that an upgrade to the LLM or a tweak in the OCR pipeline has a localized impact, significantly reducing the "blast radius" of any technical change. Perhaps most importantly, it creates a future-proof design that is ready to evolve: whether we move to a GPU based inference engine or add specialized "Vision Workers" for complex table extraction, the system is architected to adapt without requiring a fundamental redesign.
Below is the architectural layout of the AlisLeg platform, showcasing the high level relationship between our containerized components and the private Azure data layer:
III. The RAG Pipeline
At the core of AlisLeg is a Retrieval-Augmented Generation (RAG) pipeline, which serves as the fundamental engine transforming static documents into actionable intelligence. The concept is rather straightforward: instead of relying on the language model's broad and probabilistic "memory", which is often prone to fabrication, the system is designed to answer exclusively from our specific documents. Every response generated is meticulously grounded in real time retrieved document sections, meaning the model only processes information explicitly provided from our internal, private data. This strict constraint is what keeps answers factually accurate and effectively eliminates the risk of generic AI hallucination.
Crucially, to maintain our private-first mandate, all four stages of this pipeline (i.e., extraction, representation, retrieval, and generation) are executed locally within the private environment, ensuring that the entire logic flow remains completely isolated and under our direct control.
III.A. Step 1: OCR and Chunking
The first and perhaps most critical stage of our pipeline is high-fidelity extraction, as any downstream intelligence is only as good as the raw data being fed into it. Legal PDFs present a unique "stress test" for extraction: they are notoriously complex, featuring multi-column layouts, intricate tables with financial obligations, and dense footnotes that can fundamentally change the meaning of a clause.
While basic libraries often flatten these elements into a scrambled mess of raw text, we utilize the unstructured library within our Worker container to ensure the "visual semantics" of each document are preserved. This allows the system to stay layout aware, correctly identifying titles, paragraphs, and tables even in scanned documents via OCR. Crucially, this enables us to perform semantic chunking (i.e., splitting text based on actual clause boundaries rather than arbitrary character counts), which is essential for maintaining the high precision required in legal work. We considered alternatives like the raw speed of PyMuPDF or the power of Azure AI Document Intelligence, but we prioritized the former's structural awareness and the latter's privacy; by running unstructured locally, we maintain total control over the data and ensure a predictable cost structure without ever requiring information to leave our secure perimeter.
III.B. Step 2: Embeddings
After the text is successfully extracted and segmented into semantic chunks, it must be transformed into a format that the machine can truly "understand": this is the role of embeddings. Embeddings are high-dimensional mathematical representations of meaning, they allow the system to recognize that two sentences expressing similar legal obligations are related, even if they use entirely different terminology.
In our case, we chose to generate these embeddings using the nomic-embed-text model via the FastEmbed library. This was a strategic decision driven by our private-first mandate: by running the embedding engine inside our own Virtual Network instead of relying on external API calls, we ensure that sensitive contract content never traverses the public internet for vectorization. Furthermore, Nomicās model is specifically optimized for longer context windows and exhibits strong semantic performance on the structured, dense text typically found in commercial contracts.
While industry standards like OpenAIās text-embedding-ada-002 offer exceptional accuracy, their reliance on external API calls introduced an unacceptable security risk and unpredictable per-token costs. We also evaluated standard HuggingFace models like BERT, but found that nomic provided superior retrieval quality for the long-form legal language that defines our use case. Ultimately, this choice ensures our platform remains secure, self-contained, and economically predictable.
III.C. Step 3: Vector Search
With the documents converted into a high-dimensional vector space, the system is ready for precise retrieval when a user submits a query. This latter is vectorized using the same embedding method as in Step 2, and the system performs a similarity search to identify the most relevant contract segments. For this critical layer, we self-host Qdrant within our private Azure environment. Qdrant was selected for its exceptional performance in similarity search and, more importantly, its robust "Payload Filtering" capabilities. In a professional workflow setup, retrieval precision is crucial: a user needs the ability to filter results by specific contract names, dates, or clause types (e.g., "only search in Indemnification clauses from 2023"). Qdrant handles these complex filters with minimal latency, ensuring the model is fed only the most pertinent information for the question. We chose to self-host Qdrant over managed services like Pinecone to maintain strict data isolation and avoid dependency on external SaaS platforms.
While ChromaDB is an excellent tool for local prototyping, Qdrantās enterprise ready scaling and container friendly architecture made it the definitive choice for our MVP deployment where control over the index and storage layer is non-negotiable.
III.D. Step 4: The LLM
The final stage of the pipeline is answer generation, where the "brain" of the system synthesizes a response based on the user's query and the retrieved chunks of text. Rather than allowing the Large Language Model (LLM) to answer from its own training data, AlisLeg strictly provides it with only the user's query and the specific contract chunks retrieved in the previous step to ensure that the model operates as a reasoning engine over a provided set of facts, rather than a knowledge engine prone to creative fabrication. For this task, we chose the Qwen 2.5 model, hosted locally via Ollama in an isolated container. During our benchmarking, Qwen 2.5 demonstrated superior reasoning capabilities and a deeper understanding of dense legal terminology compared to other open-source variants like Llama 3 and by deploying via Ollama, we achieve both simple container management and strict internal-only network exposure.
While no model can guarantee zero hallucinations, our setup mitigates this risk by displaying the page containing the original source paragraphs directly next to the AI's answer, in addition to citing the specific source document and page numbers. This keeps the user firmly in the loop to verify every claim as quickly as possible. This approach transforms the AI from a black-box oracle into a transparent assistant, providing the speed of modern LLMs without compromising the factual integrity required in professionals environments.
III.E. The RAG Pipeline at a Glance
Below is a summarized view of how a raw contract is transformed into a verified answer within our private perimeter:
IV. The Secure Perimeter
Building a performant RAG system is one thing, deploying it in a way that meets the strict security requirements of a privacy-first organization is another thing entirely. For AlisLeg, we couldn't just "put it in the cloud", we had to build a digital fortress using Azureās advanced networking and scaling capabilities.
The scenario we are building for here is that of a law firm that needs to process and analyze large volumes of legal documents by taking advantage of the cloud's scalability and security features. While having your data processed in the cloud is a concern for many organizations, we strongly believe that with the right security measures in place it can be a safe and efficient way to process and analyze large volumes of private documents. Furthermore, cloud providers are increasingly striving to help organizations meet their security and compliance requirements by offering guarantees on data sovereignty, data residency and processing. As such, our environment has been built to leverage these capabilities to provide a secure and efficient way to process and analyze large volumes of private documents.
IV.A. Security First: Data Sovereignty as a Default
The cornerstone of our security strategy is the use of Azure Virtual Networks (VNETs) and Private Links. In a trivial cloud setup, services often communicate over public endpoints, even if they are protected by firewall rules, but for AlisLeg, we have completely eliminated public internet exposure with every component, from the UI to the Blob Storage, communicating over private internal IP addresses on the Microsoft backbone (The UI was briefly exposed to the internet for demo purposes only). This defense-in-depth approach means the cloud is not just a hosting layer, but a controlled network perimeter.
By using private endpoints and disabling public access, all traffic remains inside the Azure backbone network as access is restricted at the network, identity, and service levels. While the cloud provider manages the infrastructure, document access is governed by strict network isolation, encryption, and role-based controls, significantly reducing exposure compared to public SaaS platforms.
IV.B. Scalability: The Power of KEDA
Legal workloads are inherently irregular: a given day might involve zero processing, while a "data room" opening day might require ingesting 1000s of documents. Traditional autoscaling (which reacts to CPU or Memory usage) is often too slow and expensive for this pattern as it requires the containers to already be "working hard" before it triggers a scale-up. Instead, we utilize KEDA (Kubernetes-based Event-Driven Autoscaling) that allows our infrastructure to be proactive rather than reactive. It monitors the Azure Storage Queue directly and when the queue is empty, the system scales the Worker tier to zero, eliminating permanent compute burn. As soon as a document is uploaded and a message is dropped in the queue, KEDA instantly spins up the necessary Workers to handle the specific volume of work.
IV.C. Business Impact: Predictable Performance and Cost
This shift from "Always-On" to "Event-Driven" has a massive impact on the bottom line. It means the system handles 10 documents or 1000s of documents with the exact same level of overhead efficiency. We only pay for the compute we actually use, ensuring that the project remains cost-effective regardless of the volume of document review. For a legal firm or compliance department, this translates to an AI assistant that is always ready but never wasteful, providing enterprise-scale power on a predictable, lean budget.
The diagram below illustrates the security boundaries. It shows how resources are isolated inside the Azure Virtual Network and how data is locked down using Private Endpoints. It also shows how the user can access the system and interact with it.
V. Key Takeaways from the Implementation
Building AlisLeg wasn't a straight path from idea to deployment, which is not something unusual in this type of projects. It was an iterative process of testing, failing, and refining, moving beyond a basic application of RAG and into the high stakes world of legal technology. This opened our eyes to the reality of implementing AI in order to add value to the enterprise while playing by the rules under which the enterprise operates.
V.A. Navigating the "Dirty Data" Problem
The biggest technical hurdle wasn't the AI model itself, it evolved around the data, which we expected to face but not to this extent. Real world dcouments in general are messy, but legal documents are notorious for being extremely complexe. Mainly, we encountered documents with complex layouts and multi-column structure that standard extraction tools turned into pure gibberish (in a funny way, it was like a puzzle where all the pieces were there but in the wrong order). Tuning the RAG pipeline to handle this required moving away from classical text extractors toward the layout aware processing of unstructured. We quickly learned that using AI with messy data is a recipe for disaster: garbage in = hallucinations out. Precise retrieval requires precise extraction, and getting that right was 70% of the effort.
We also faced the "context window" challenge as long contracts can easily exceed the memory of smaller local models. Optimizing our chunking strategy, by ensuring that headers and metadata were prepended to every snippet of text, was crucial to ensuring the AI didn't "lose the plot" halfway through a long clause.
V.B. The "Private-First" Standard
The most significant constraint was the fact that Private-First AI is no longer a niche requirement but is a prerequisite for enterprise adoption. By proving that we could run a high-performance LLM (Qwen 2.5) and a vector database (Qdrant) entirely inside a VNET, we removed the single biggest blocker for LLMs adoption: the fear of data leaks. We believe this will be the standard for enterprise AI adoption in the future, especially with the increasing number of data privacy regulations and the rise of open-source AI models that are becoming increasingly powerful while being lightweight and efficient at the sametime.
Architecturally, the reliability of Azure Container Apps was a game-changer as it allowed us to abstract away the complexity of managing a full Kubernetes cluster while still giving us the deep network controls (Private Endpoints, VNET integration) and the event-driven scaling (KEDA) we needed to keep the system production-ready and cost-effective.
V.C. The Future: From Search to Strategy
AlisLeg is an MVP, but it has the potential to be much more. The roadmap ahead could be very ambitious, and we believe that the future of legal AI lies in moving beyond simple search and into the realm of strategic intelligence. Possible next steps could involve:
- Vision Aware Workers: Moving beyond text extraction to full visual understanding of tables and charts using Vision-Language Models (VLMs).
- Multi Agent Reasoning: Implementing specialized agents that can "argue" both sides of a clause to identify hidden risks. (something like Karpath's "LLM Council")
- Automated Remediation: Not just identifying missing clauses, but suggesting legally sound language to bridge the gaps.
- Better UX: The current UI is functional but basic, so a more polished interface with better navigation of the search results and the source documents would significantly improve the user experience.
- Better Performance: While the current system is reasonably fast, further optimization of the RAG pipeline and model inference could further reduce response times and improve scalability. Inference is the bottleneck in the current system, and using GPU acceleration (or just more CPU optimized container) for model inference could significantly improve performance and allows us to use larger and more powerful models.
The journey from a "searchable archive" to a "strategic advisor" could be a good next step, and the foundation we've built ensures we can scale that intelligence without ever compromising on the privacy that makes the system trustworthy.
VI. The Value of Grounded Intelligence
Technology is only as good as the problem it solves, and for AlisLeg the objective wasn't to replace the lawyer, but to augment them. In an industry where "time is money" (literally), the ability to shorten the distance between a question and a verified answer is the ultimate competitive advantage.
The true ROI of AlisLeg truly lies in its ability to reduce cumulative risk by ensuring that every contract in an archive is indexed and searchable, the hidden liabilities, the outlier clauses that are missed during a sleepy Friday afternoon review, are brought to light. Furthermore, by automating the hard work of document search and layout flattening, this allows legal teams to focus on what they are paid for: high-level strategy, negotiation, and risk mitigation. It provides peace of mind, knowing that every AI generated answer is backed by a clickable link to the original source text. In short, AlisLeg delivers a proactive security posture where compliance isn't a post-mortem activity, but a real-time capability.
Finally, with AlisLeg, weāve demonstrated that we don't have to choose between the cutting edge power of LLMs and the rigid security requirements of the enterprise. By leveraging a micro-monolith architecture, event-driven scaling with KEDA, and a 100% private RAG pipeline, we have created a blueprint for what we can be perceived as regulated intelligence.