Real-World AI Inference: Reducing Hallucinations and Boosting LLM Accuracy by 40x

Nov 13, 2024

7 min read

133

STEVE WILLIAMS (ITERNAL TECHNOLOGIES) AND DENVR DATAWORKS

As AI continues to redefine what’s possible, businesses are at a crucial point where they must harness massive amounts of data to stay competitive. Whether you’re an enterprise managing customer interactions or a startup dealing with niche data, AI inference – the ability to generate real-time insights from large datasets – is becoming essential. In this blog, we’ll explore how Iternal Technologies and Denvr Dataworks partnered to transform Frank Herbert’s 425-page complex, context-heavy novel Dune into an interactive, real-time searchable database.

The goal wasn’t just to demonstrate technical prowess but to show how advanced AI inference can be applied to business problems in ways that create real-time value. We’ll show you what we did, how we did it, and more importantly, how you can apply the same methodologies and technologies to revolutionize your business.

Why AI Inference Matters and Why You Should Care

AI inference is more than just running a trained model; it goes beyond simple data queries. It allows you to extract real-time, accurate, and contextually relevant insights from unstructured data at scale. Imagine a healthcare provider retrieving critical patient history in milliseconds or a legal researcher instantly accessing key case law from thousands of documents.

It's also a complex process requiring vast amounts of operational optimization, which developers shouldn’t need to worry about. At Iternal Technologies and Denvr Dataworks, we wanted to demonstrate the power of AI inference when combined with the right platform, workflows, and infrastructure. And we didn’t want to choose just any dataset—we picked Dune, a novel known for its complexity, intricate world-building, and layered themes. Our aim was to create a system that delivers real-time, precise answers to detailed questions about the text, without burdening developers with the complexities of underlying infrastructure.

A Summary of the AI Inference Solution

Using Denvr Cloud, we utilized Intel Gaudi 2 AI accelerators to perform inference on the 425-page novel Dune and achieved the following:

Processing Time: The entire novel was processed in just 202 seconds.
Accuracy: We increased retrieval-augmented generation (RAG)-based LLM accuracy by 40 times and improved vector search precision by 51%.
Inference Speed: Achieved 0.68 inferences per second, with a throughput of 5,404 bytes per second.

These results weren’t achieved by just running a model; they came from an efficient pipeline combining Iternal’s Blockify technology with the easy to use infrastructure of Denvr Cloud. Real-time expertise and engagement were achieved with modular content indexed for dynamic contextual responses and hyper-personal media builds, optimized by Iternal’s dual-database approach. This solution drastically reduced complexity and optimized the entire process, showing how scalable AI inference can solve real-world problems, particularly in industries where speed, accuracy, and scalability are vital.

What We Did: Breaking Down the Framework

The challenge was to transform Dune – a dense, multi-layered narrative – into something AI could analyze and retrieve meaningful insights from. Here are the methodology and components:

Denvr Cloud: The Infrastructure Backbone

For AI inference to perform at scale, robust infrastructure is essential. Cloud platforms like Denvr Cloud streamline the process of deployment and management of large language models (LLMs), reducing the complexity of maintaining in-house systems while allowing you to “make it your own.”

Denvr Cloud is a purpose-built AI platform, offering on-demand or reserved resources designed to streamline AI operations. Whether you’re developing models or deploying them, Denvr Cloud handles the operational burden, freeing developers to focus on innovation. Critical features include:

Ease of Use: Denvr Cloud intuitive infrastructure allows developers to quickly get up and running, focusing model refinement rather than dealing with complex system setups.

Scalability: Pre-configured environments and real-time monitoring enables businesses to efficiently scale models and dynamically manage increasing workloads.

Flexibility: With access to the latest GPUs and AI accelerators, like Intel Gaudi 2, Denvr Cloud ensures developers can select the most appropriate configuration for their needs.

Iternal Technologies: Innovation in AI Inference

AI solution providers are advancing AI inference by developing custom pipelines and managing the complete transformation process for businesses. Innovations like Iternal Technologies’ Turnkey AI platform and patented Blockify technology help diverse industries move beyond generic AI solutions, unlocking new efficiencies and enabling real-time data processing and faster decision-making.

Blockify is Iternal Technologies’ patented data ingestion and distillation technology that includes a dual-database approach for storing modular content blocks, enabling both hyper-personal media build and highly contextual retrieval. In the Dune project, Blockify enabled real-time data processing and more accurate decision-making, increasing RAG-based LLM accuracy by 40 times and improving vector search by 51%, significantly reducing AI hallucinations.

In just five commands, Iternal was able to deploy and refine AI inference models on Denvr Cloud. This approach not only sped up project completion but also demonstrated how easily this rapid deployment method can be replicated for other projects.

The Workflow

Figure: Iternal Technologies AI Inference Workflow.

The Iternal Technologies AI inference workflow is shown in the figure above. The setup included a pre-installed environment with the Docker runtime environment, the Hugging Face Optimum library, PyTorch libraries and tools, and all necessary drivers for a great out-of-the-box experience and seamless deployment. The choice to use Intel Gaudi 2 processors was deliberate: Gaudi 2 is designed for deep learning workloads, and its efficiency in both training and inference made it the ideal choice for this inference project. Its architecture allows for high throughput, making it perfect for tasks that require parallel processing of complex, layered data—like Dune.

Dune is known for its length, complexity and intricate world-building. To enable real-time retrieval and answer detailed questions, the novel was pre-processed and indexed using Iternal’s Blockify technology to modularize content into manageable blocks. These blocks are stored in two dedicated databases, one for modular media output and another for contextually relevant retrieval.

The Llama 3 LLM was prepared and fine-tuned using Low-Rank Adaptation (LoRA) and run on a single Intel Gaudi 2 core. Various configurations were tested through both the Dune pre-processing and inference steps with performance optimized resulting in parameters such as chunking the novel into 8,000-characters segments and generating 1,000 tokens per query output with 100 parallel jobs.

The Blockify workflow steps included:

Chunking the Text: The novel was divided into smaller content blocks based on chapters, scenes, or themes, creating modular sections that can be reused or reassembled based on user needs.
Embeddings: These content blocks were converted into embeddings (vector representations) to capture their unique context and structure, enabling content-aware retrieval.
Dual Indexing: Content blocks were indexed in two dedicated databases to support both modular media builds and contextually-aware retrieval:
1. Content Block Database: Contains raw, modular content blocks to support hyper-personal media builds.
2. Context-Aware Retrieval Database: Houses embeddings for efficient, context-sensitive retrieval based on specific user queries.
Retrieval and Response Generation: Based on user queries, the system retrieves relevant content from the Context-Aware Retrieval Database for accurate, contextually relevant responses. At the same time, modular content from the Content Block Database is assembled to create hyper-personal media builds, ensuring flexibility and adaptability in response to unique user needs.

This enhanced workflow enabled the system to instantly recall specific plotlines, characters, and themes with the capability to dynamically assemble content for hyper-personal media builds. The result? Real-time expertise, engagement, and personalized content in just a few minutes – achievement that traditional methods would require 48 human hours to accomplish. output that would otherwise be unattainable.

Benchmark Results:

The entire novel—over 200,000 words—was processed in 202 seconds. The ability to process 5 million pages of text per month on a single Gaudi 2 core demonstrates the scalability and efficiency of the approach.

Total Time: 202,818 ms

Total Responses: 1,206
Inferences: 138
Inference Speed: 0.68 inferences/second
Average Inference Speed: 134,647.50 ms/inference
Total Bytes: 1,096,110 Bytes
Throughput: 5,404.61 Bytes/second

This process wasn’t just about speed; it was about improving accuracy while minimizing issues like AI hallucinations. Blockify increased the precision of vector searches and RAG models, ensuring that the system retrieved the most relevant, contextually accurate information from the dataset.

Business Use Cases: Real-Time AI Inference in Action

The Dune project is more than just a demonstration of technical capability and the implications extend far beyond literature. AI inference solutions can be applied across industries:

Healthcare: AI inference can automate the retrieval of medical records or offer real-time insights from patient data, improving diagnostics and patient care. Tools like AI scribe automate transcriptions to reduce workloads.
Legal Services: AI-powered platforms can streamline case research, offering instant retrieval of relevant case law and legal precedents.
Finance: Real-time, AI-driven analytics can help financial institutions make faster decisions, improve fraud detection, and optimize risk management.
Retail and Media: AI inference can drive personalized recommendations, improving user engagement and reducing order completion times by up to 70%.

The real power here is in how scalable and adaptable these AI inference models can be. For companies across industries, the ability to retrieve accurate, contextually rich data in real time isn’t just an operational advantage—it’s a strategic one. By eliminating the operational burden of managing AI systems, businesses can focus on growth, efficiency, and competitive differentiation.

What This Means for You: Scalable AI Inference for Future Growth

Whether you're a developer, a CTO, startup founder, or a business leader, scalable AI inference offers the chance to unlock vast data-driven insights in real time. As the AI inference market continues to mature and grow, solutions like Denvr Cloud and Iternal Technologies’ Turnkey AI and Blockify are key to maintaining a competitive edge. Scaling AI inference with minimal infrastructure investment means you can:

Automate Real-Time Insights: Deploy AI-driven solutions to analyze massive datasets, providing instant answers to complex queries.
Drive Personalized Experiences: In industries like media or retail, AI inference can drive personalization, tailoring user experiences in real time.
Enable Faster Decision-Making: Whether in finance or healthcare, faster data retrieval means more informed, timely decisions.

This isn’t just technology for big enterprises. Startups and mid-sized companies can deploy these systems quickly, scale them efficiently, and use them to carve out competitive advantages with minimal infrastructure investment. Examples like Khan Academy’s personalized learning, GitHub Copilot’s coding assistance, and Replika’s adaptive virtual conversations show how AI can transform niche expertise into practical, scalable platforms.

Final Thoughts: Now is the Time to Get Started

AI inference has moved away from theory to practice, offering real-world solutions across industries. Whether you’re analyzing legal documents, processing healthcare data, or delivering personalized customer experiences, the technology is ready and available.

The infrastructure and tools are in place, and with low barriers to entry, there’s not a better time to dive in. Try Denvr Cloud free for weeks and see firsthand how AI inference can integrate into your workflow. Contact Iternal Technologies today to get started.