vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

MLX: Apple's ML framework with unified memory and Metal kernels
mlx-lm: Optimized LLM inference with KV cache and quantization
mlx-vlm: Vision-language models for multimodal inference
mlx-audio: Speech-to-Text and Text-to-Speech with native voices
mlx-embeddings: Text embeddings for semantic search and RAG

Features

Multimodal - Text, Image, Video & Audio in one platform
Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
OpenAI API compatible - drop-in replacement for OpenAI client
Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
MCP Tool Calling - integrate external tools via Model Context Protocol
Paged KV Cache - memory-efficient caching with prefix sharing
Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages

# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model	Languages	Description
Kokoro	EN, ES, FR, JA, ZH, IT, PT, HI	Fast, 82M params, 11 voices
Chatterbox	15+ languages	Expressive, voice cloning
VibeVoice	EN	Realtime, low latency
VoxCPM	ZH, EN	High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser	Models	Description
`qwen3`	Qwen3 series	Requires both `<think>` and `</think>` tags
`deepseek_r1`	DeepSeek-R1	Handles implicit `<think>` tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

Getting Started
- Installation
- Quick Start
User Guides
Reference
Benchmarks

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model	Speed	Memory
Qwen3-0.6B-8bit	402 tok/s	0.7 GB
Llama-3.2-1B-4bit	464 tok/s	0.7 GB
Llama-3.2-3B-4bit	200 tok/s	1.8 GB

Continuous Batching (5 concurrent requests):

Model	Single	Batched	Speedup
Qwen3-0.6B-8bit	328 tok/s	1112 tok/s	3.4x
Llama-3.2-1B-4bit	299 tok/s	613 tok/s	2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model	RTF*	Use Case
whisper-tiny	197x	Real-time, low latency
whisper-large-v3-turbo	55x	Best quality/speed balance
whisper-large-v3	24x	Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting	Max Context	Memory
Default (1024)	~10K tokens	~16GB
`GEMMA3_SLIDING_WINDOW=8192`	~40K tokens	~25GB
`GEMMA3_SLIDING_WINDOW=0`	~50K tokens	~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

Bug fixes and improvements
Performance optimizations
Documentation improvements
Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

MLX - Apple's ML framework
mlx-lm - LLM inference library
mlx-vlm - Vision-language models
mlx-audio - Text-to-Speech and Speech-to-Text
mlx-embeddings - Text embeddings
vLLM - High-throughput LLM serving

vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

MLX: Apple's ML framework with unified memory and Metal kernels
mlx-lm: Optimized LLM inference with KV cache and quantization
mlx-vlm: Vision-language models for multimodal inference
mlx-audio: Speech-to-Text and Text-to-Speech with native voices
mlx-embeddings: Text embeddings for semantic search and RAG

Features

Multimodal - Text, Image, Video & Audio in one platform
Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
OpenAI API compatible - drop-in replacement for OpenAI client
Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
MCP Tool Calling - integrate external tools via Model Context Protocol
Paged KV Cache - memory-efficient caching with prefix sharing
Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages

# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model	Languages	Description
Kokoro	EN, ES, FR, JA, ZH, IT, PT, HI	Fast, 82M params, 11 voices
Chatterbox	15+ languages	Expressive, voice cloning
VibeVoice	EN	Realtime, low latency
VoxCPM	ZH, EN	High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser	Models	Description
`qwen3`	Qwen3 series	Requires both `<think>` and `</think>` tags
`deepseek_r1`	DeepSeek-R1	Handles implicit `<think>` tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit

# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

Getting Started
- Installation
- Quick Start
User Guides
Reference
Benchmarks

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model	Speed	Memory
Qwen3-0.6B-8bit	402 tok/s	0.7 GB
Llama-3.2-1B-4bit	464 tok/s	0.7 GB
Llama-3.2-3B-4bit	200 tok/s	1.8 GB

Continuous Batching (5 concurrent requests):

Model	Single	Batched	Speedup
Qwen3-0.6B-8bit	328 tok/s	1112 tok/s	3.4x
Llama-3.2-1B-4bit	299 tok/s	613 tok/s	2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model	RTF*	Use Case
whisper-tiny	197x	Real-time, low latency
whisper-large-v3-turbo	55x	Best quality/speed balance
whisper-large-v3	24x	Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting	Max Context	Memory
Default (1024)	~10K tokens	~16GB
`GEMMA3_SLIDING_WINDOW=8192`	~40K tokens	~25GB
`GEMMA3_SLIDING_WINDOW=0`	~50K tokens	~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

Bug fixes and improvements
Performance optimizations
Documentation improvements
Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

MLX - Apple's ML framework
mlx-lm - LLM inference library
mlx-vlm - Vision-language models
mlx-audio - Text-to-Speech and Speech-to-Text
mlx-embeddings - Text embeddings
vLLM - High-throughput LLM serving

vllm-mlx

What You Are Adopting

Move Fast

About

README

vLLM-MLX

Overview

Features

Quick Start

Installation

Start Server

Use with OpenAI SDK

Use with Anthropic SDK

Multimodal (Images & Video)

Audio (TTS/STT)

Reasoning Models

Embeddings

Documentation

Architecture

Performance

Gemma 3 Support

Usage

Long Context Patch (mlx-vlm)

Contributing

License

Citation

Acknowledgments

Tech Stack

Installation

Start server with Gemma 3 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000 # Verify it loaded as MLLM (not LLM) curl http://localhost:8000/health # Should show: "model_type": "mllm"

Reviews0

auto_awesomeYour strongest next moves after vllm-mlx

vllm-mlx

What You Are Adopting

Move Fast

About

README

vLLM-MLX

Overview

Features

Quick Start

Installation

Start Server

Use with OpenAI SDK

Use with Anthropic SDK

Multimodal (Images & Video)

Audio (TTS/STT)

Reasoning Models

Embeddings

Documentation

Architecture

Performance

Gemma 3 Support

Usage

Long Context Patch (mlx-vlm)

Contributing

License

Citation

Acknowledgments

Tech Stack

Installation

Start server with Gemma 3 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000 # Verify it loaded as MLLM (not LLM) curl http://localhost:8000/health # Should show: "model_type": "mllm"

Reviews0

auto_awesomeYour strongest next moves after vllm-mlx