SANTA CLARA, Calif., March 12, 2025 (GLOBE NEWSWIRE) -- Pliops, a leader in storage and accelerator solutions, today announced a strategic collaboration with the vLLM Production Stack developed by LMCache Lab at the University of Chicago. Aimed at revolutionizing large language model (LLM) inference performance, this partnership comes at a pivotal moment as the AI community gathers next week for the GTC 2025 conference.
Together, Pliops and the vLLM Production Stack, an open-source reference implementation of a cluster-wide full-stack vLLM serving system, are delivering unparalleled performance and efficiency for LLM inference. Pliops contributes its expertise in shared storage and efficient vLLM cache offloading, while LMCache Lab brings a robust scalability framework for multiple instance execution. The combined solution will also benefit from the ability to recover from failed instances, leveraging Pliops' advanced KV storage backend to set a new benchmark for enhanced performance and scalability in AI applications.
"We are excited to partner with Pliops to bring unprecedented efficiency and performance to LLM inference,” said Junchen Jiang, Head of LMCache Lab at the University of Chicago. “This collaboration demonstrates our commitment to innovation and pushing the boundaries of what’s possible in AI. Together, we are setting the stage for the future of AI deployment, driving advancements that will benefit a wide array of applications."
Key Highlights of the Combined Solution:
- Seamless Integration: By enabling vLLM to process each context only once, Pliops and the vLLM Production Stack set a new standard for scalable and sustainable AI innovation.
- Enhanced Performance: The collaboration introduces a new petabyte tier of memory below HBM for GPU compute applications. Utilizing cost-effective, disaggregated smart storage, computed KV caches are retained and retrieved efficiently, significantly speeding up vLLM inference.
- AI Autonomous Task Agents: This solution is optimal for AI autonomous task agents, addressing a diverse array of complex tasks through strategic planning, sophisticated reasoning, and dynamic interaction with external environments.
- Cost-Efficient Serving: Pliops’ KV-Store technology with NVMe SSDs enhances the vLLM Production Stack, ensuring high performance serving while reducing cost, power and computational requirements.
Looking to the future, the collaboration between Pliops and the vLLM Production Stack will continue to evolve through the following stages:
- Basic Integration: The current focus is on integrating Pliops KV-IO stack into the production stack. This stage enables feature development with an efficient KV/IO stack, leveraging Pliops LightningAI KV store. This includes using shared storage for prefill-decode disaggregation and KV-Cache movement, and joint work to define requirements and APIs. Pliops is developing a generic GPU KV store IO framework.
- Advanced Integration: The next stage will integrate Pliops vLLM acceleration into the production stack. This includes prompt caching across multi-turn conversations, as provided by platforms like OpenAI and DeepSeek, KV-Cache offload to scalable and shared key-value storage, and eliminating the need for sticky/cache-aware routing.
"This collaboration opens up exciting possibilities for enhancing LLM inference,” commented Pliops CEO Ido Bukspan. “It allows us to leverage complementary strengths to tackle some of AI’s toughest challenges, driving greater efficiency and performance across a wide range of applications."
Connect with Pliops
Read Blog
About Pliops
Visit Resource Center – XDP LightningAI Solution Brief
Connect on LinkedIn
Follow onX
About Pliops XDP LightningAI
With the increasing demand for generative AI applications, optimizing LLM inference efficiency and reducing costs has become essential. Pliops is empowering developers to tackle these challenges head-on with its XDP LightningAI solution, an accelerated KV distributed smart node that introduces a new petabyte tier of memory below high-bandwidth memory (HBM) for GPU compute applications. It utilizes cost-effective, disaggregated smart storage to retain computed KV caches, allowing them to be retrieved if discarded from HBM. When serving a pre-processed context, the saved KV caches are efficiently loaded from storage, allowing vLLM to generate new content significantly faster.
About Pliops
A winner of the FMS 2024 most innovative AI solution, Pliops is a technology innovator focused on making data centers run faster and more efficiently. The company’s Extreme Data Processor (XDP) radically simplifies the way data is processed and managed. Pliops overcomes I/O inefficiencies to massively accelerate performance and dramatically reduce overall infrastructure costs for data-hungry AI applications. Founded in 2017, Pliops has been recognized multiple times as one of the 10 hottest semiconductor startups. The company has raised over $200 million to date from leading investors including Koch Disruptive Technologies, State of Mind Ventures Momentum, Intel Capital, Viola Ventures, SoftBank Ventures Asia, Expon Capital, NVIDIA, AMD, Western Digital, SK hynix and Alicorn. For more information, visit www.pliops.com.
Media Contact:
Stephanie Olsen
Lages & Associates
(949) 453-8080
stephanie@lages.com
A photo accompanying this announcement is available at https://www.globenewswire.com/NewsRoom/AttachmentNg/94f982b7-1d6c-4522-9258-6a7048c07e73