local_fire_departmentHoneystax
search⌘K
loginLog Inperson_addSign Up
layers
HONEYSTAX TERMINAL v1.0
HomeNewsSavedSubmit
Back to the live board
V

vllm-mlx

MCP Server

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA)...

Copy the install, test the workflow, then decide if it earns a permanent slot.

641
Why nowMoving now

Fresh repo activity plus visible builder pull. This is the kind of tool people test before it turns obvious.

DecisionHigh-conviction move

Copy the install, test the workflow, then decide if it earns a permanent slot.

Trial costFast eval

You can test this quickly and remove it cleanly if it misses.

Risk35/100

GitHub health 42/100. no security policy. 37 open issues make this testable, but not something to trust blind.

What You Are Adopting

AI Agent

Claude Code

Model

Multiple

Build Time

Instant

Test This In Your Stack

One command inClean rollbackLow commitment
settingsRegistryAdds a named entry to Claude config. One command to remove.

Fastest way to find out if vllm-mlx belongs in your setup.

Copy the install command, run a real test, and back it out cleanly if it slows you down.

Try now
claude mcp add vllm-mlx -- npx vllm-mlx

Run this first. You will know quickly if the workflow earns a permanent slot.

Back out
claude mcp remove vllm-mlx

No messy cleanup loop. If it misses, remove it and keep moving.

Install Location

~/  └─ .claude.json    └─ mcp_servers/      └─ vllm-mlx ← registers here

About

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.. An open-source mcp server for the AI coding ecosystem.

README

vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

License Python 3.10+ Apple Silicon GitHub

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

  • MLX: Apple's ML framework with unified memory and Metal kernels
  • mlx-lm: Optimized LLM inference with KV cache and quantization
  • mlx-vlm: Vision-language models for multimodal inference
  • mlx-audio: Speech-to-Text and Text-to-Speech with native voices
  • mlx-embeddings: Text embeddings for semantic search and RAG

Features

  • Multimodal - Text, Image, Video & Audio in one platform
  • Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
  • Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
  • OpenAI API compatible - drop-in replacement for OpenAI client
  • Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
  • Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
  • Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
  • MCP Tool Calling - integrate external tools via Model Context Protocol
  • Paged KV Cache - memory-efficient caching with prefix sharing
  • Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model Languages Description
Kokoro EN, ES, FR, JA, ZH, IT, PT, HI Fast, 82M params, 11 voices
Chatterbox 15+ languages Expressive, voice cloning
VibeVoice EN Realtime, low latency
VoxCPM ZH, EN High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser Models Description
qwen3 Qwen3 series Requires both <think> and </think> tags
deepseek_r1 DeepSeek-R1 Handles implicit <think> tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

  • Getting Started

    • Installation
    • Quick Start
  • User Guides

    • OpenAI-Compatible Server
    • Anthropic Messages API
    • Python API
    • Multimodal (Images & Video)
    • Audio (STT/TTS)
    • Embeddings
    • Reasoning Models
    • MCP & Tool Calling
    • Continuous Batching
  • Reference

    • CLI Commands
    • Supported Models
    • Configuration
  • Benchmarks

    • LLM Benchmarks
    • Image Benchmarks
    • Video Benchmarks
    • Audio Benchmarks

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model Speed Memory
Qwen3-0.6B-8bit 402 tok/s 0.7 GB
Llama-3.2-1B-4bit 464 tok/s 0.7 GB
Llama-3.2-3B-4bit 200 tok/s 1.8 GB

Continuous Batching (5 concurrent requests):

Model Single Batched Speedup
Qwen3-0.6B-8bit 328 tok/s 1112 tok/s 3.4x
Llama-3.2-1B-4bit 299 tok/s 613 tok/s 2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model RTF* Use Case
whisper-tiny 197x Real-time, low latency
whisper-large-v3-turbo 55x Best quality/speed balance
whisper-large-v3 24x Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting Max Context Memory
Default (1024) ~10K tokens ~16GB
GEMMA3_SLIDING_WINDOW=8192 ~40K tokens ~25GB
GEMMA3_SLIDING_WINDOW=0 ~50K tokens ~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements
  • Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

  • MLX - Apple's ML framework
  • mlx-lm - LLM inference library
  • mlx-vlm - Vision-language models
  • mlx-audio - Text-to-Speech and Speech-to-Text
  • mlx-embeddings - Text embeddings
  • vLLM - High-throughput LLM serving

Tech Stack

PythonGoLLMOpenAIClaudeAnthropic

Installation

Start server with Gemma 3 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000 # Verify it loaded as MLLM (not LLM) curl http://localhost:8000/health # Should show: "model_type": "mllm"

Open Live ProjectAudit Repo

Reviews0

Log in to write a review.

ActiveLast commit today
bug_report37open issues
Submitted December 6, 2025

auto_awesomeYour strongest next moves after vllm-mlx