VLLM

Wikipedia - Recent changes [en] - Thursday, April 23, 2026

Page created

New page

{{lowercase title}}
{{Short description|Open-source software for large language model inference}}
{{Use mdy dates|date=April 2026}}

{{Infobox software
| name = vLLM
| logo = vLLM.svg
| author = Sky Computing Lab<br>[[University of California, Berkeley|Cal Berkeley]]
| developer = vLLM contributors
| released = 2023
| programming language = [[Python (programming language)|Python]], [[CUDA]], [[C++]]
| genre = [[Large language model]] [[inference engine]]
| license = [[Apache License 2.0]]
| website = {{URL|https://vllm.ai}}
| repo = {{URL|https://github.com/vllm-project/vllm}}
}}

'''vLLM''' is an open-source software framework for inference and serving of [[large language model]]s and related [[multimodal model]]s. Originally developed at the [[University of California, Berkeley]]'s Sky Computing Lab, the project is centered on ''PagedAttention'', a [[memory management|memory-management]] method for [[Transformer (deep learning)|transformer]] [[Transformer (deep learning)#KV caching|key–value cache]]s, and supports features such as continuous batching, [[distributed computing|distributed]] inference, [[Large language model#Quantization|quantization]], and [[OpenAI]]-compatible [[application programming interface|APIs]].<ref name="github">{{cite web |title=GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs |url=https://github.com/vllm-project/vllm |website=GitHub |publisher=GitHub, Inc. |access-date=April 22, 2026}}</ref><ref name="paper">{{cite conference |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |conference=Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles |year=2023 |url=https://arxiv.org/abs/2309.06180 |access-date=April 22, 2026}}</ref><ref name="pytorch-project">{{cite web |title=vLLM |url=https://pytorch.org/projects/vllm/ |website=PyTorch |publisher=PyTorch Foundation |access-date=April 22, 2026}}</ref> According to a project [[software maintainer|maintainer]], the "v" in vLLM originally referred to "virtual", inspired by [[virtual memory]].<ref>{{cite web |title=vLLM full name |url=https://github.com/vllm-project/vllm/issues/835 |website=GitHub |publisher=GitHub, Inc. |date=August 23, 2023 |access-date=April 22, 2026}}</ref>

== History ==
vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.<ref name="paper" /><ref name="github" /> Its core ideas were described in the 2023 paper ''Efficient Memory Management for Large Language Model Serving with PagedAttention'',<ref>{{cite arXiv |last1=Kwon |first1=Woosuk |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph E. |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |eprint=2309.06180 |title=Efficient Memory Management for Large Language Model Serving with PagedAttention |class=cs.LG |date=2023-09-12}}</ref> which presented the system as a [[High-throughput computing|high-throughput]] and [[Memory (computer)|memory]]-efficient serving engine for [[Large language model|large language model]]s.<ref name="paper" />

In 2025, the [[PyTorch]] Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the [[University of California, Berkeley]] contributed vLLM to the [[Linux Foundation]] in July 2024.<ref name="pytorch-hosted">{{cite web |title=PyTorch Foundation Welcomes vLLM as a Hosted Project |url=https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/ |website=PyTorch |publisher=PyTorch Foundation |date=May 7, 2025 |access-date=April 22, 2026}}</ref><ref name="pytorch-project" />

In January 2026, ''[[TechCrunch]]'' reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.<ref name="techcrunch">{{cite web |last=Temkin |first=Marina |title=Inference startup Inferact lands $150M to commercialize vLLM |url=https://techcrunch.com/2026/01/22/inference-startup-inferact-lands-150m-to-commercialize-vllm/ |website=TechCrunch |date=January 22, 2026 |access-date=April 22, 2026}}</ref>

== Architecture ==
According to its 2023 paper, vLLM was designed to improve the efficiency of [[large language model]] serving by reducing memory waste in the [[Transformer (deep learning)#KV caching|key–value cache]] used during [[Transformer (deep learning)|transformer]] inference.<ref name="paper" /> The paper introduced ''PagedAttention'', an algorithm inspired by [[virtual memory]] and [[paging]] techniques in [[operating system]]s, and described vLLM as using block-level memory management and request scheduling to increase [[throughput]] while maintaining similar [[Latency (engineering)|latency]].<ref name="paper" />

The project documentation and repository describe support for continuous batching, chunked prefill, [[speculative decoding]], prefix caching, [[Large language model#Quantization|quantization]], and multiple forms of [[distributed computing|distributed]] inference and serving.<ref name="github" /><ref name="pytorch-project" /> [[PyTorch]] has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including [[Nvidia|NVIDIA]] and [[Advanced Micro Devices|AMD]] [[graphics processing unit|GPUs]], [[Tensor Processing Unit|Google TPUs]], [[AWS]] Trainium, and [[Intel]] processors.<ref name="pytorch-hosted" /><ref name="pytorch-project" />

== See also ==
* [[SGLang]]
* [[llama.cpp]]
* [[OpenVINO]]
* [[Open Neural Network Exchange]]
* [[Comparison of deep learning software]]
* [[Comparison of machine learning software]]
* [[Lists of open-source artificial intelligence software]]

== External links ==
* [https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.03.post1-py3 vLLM on NVIDIA NGC]
* [https://pytorch.org/projects/vllm/ vLLM project page at PyTorch]

== References ==
{{reflist}}