local_fire_departmentHoneystax
search⌘K
loginLog Inperson_addSign Up
layers
HONEYSTAX TERMINAL v1.0
HomeNewsSavedSubmit
Back to the live board
K

kreuzberg

MCP Server

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured informati...

Copy the install, test the workflow, then decide if it earns a permanent slot.

8,140
Why nowMoving now

Fresh repo activity plus visible builder pull. This is the kind of tool people test before it turns obvious.

DecisionHigh-conviction move

Copy the install, test the workflow, then decide if it earns a permanent slot.

Trial costMedium lift

Testable in one sitting, but you will likely touch real infra or local setup before you know if it sticks.

Risk25/100

GitHub health 62/100. no security policy. Fresh enough repo health and manageable issue load keep the risk controlled.

What You Are Adopting

AI Agent

Universal

Model

Multiple

Build Time

Hours

Test This In Your Stack

One command inClean rollbackLow commitment
settingsRegistryAdds a named entry to Claude config. One command to remove.

Fastest way to find out if kreuzberg belongs in your setup.

Copy the install command, run a real test, and back it out cleanly if it slows you down.

Try now
claude mcp add kreuzberg -- npx kreuzberg

Run this first. You will know quickly if the workflow earns a permanent slot.

Back out
claude mcp remove kreuzberg

No messy cleanup loop. If it misses, remove it and keep moving.

Install Location

~/  └─ .claude.json    └─ mcp_servers/      └─ kreuzberg ← registers here

About

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

README

Kreuzberg

Rust Elixir Python Node.js WASM Java Go C# PHP Ruby R Docker C License Documentation
Linkedin- Banner
Discord

Extract text and metadata from a wide range of file formats (75+), generate embeddings and post-process at native speeds without needing a GPU.

Key Features

  • Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, and document extractors
  • Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
  • 75+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), extensible via plugin API
  • High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
  • Ruby – RubyGems package, idiomatic Ruby API, native bindings
  • PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
  • Elixir – Hex package, OTP integration, concurrent processing
  • R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

  • @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

  • Go – Go module with FFI bindings, context-aware async
  • Java – Maven Central, Foreign Function & Memory API
  • C# – NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust – Core library, flexible feature flags, zero-copy APIs
  • C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

  • Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

  • CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python ✅ ✅ ✅ ✅
Node.js ✅ ✅ ✅ ✅
WASM ✅ ✅ ✅ ✅
Ruby ✅ ✅ ✅ -
R ✅ ✅ ✅ ✅
Elixir ✅ ✅ ✅ ✅
Go ✅ ✅ ✅ ✅
Java ✅ ✅ ✅ ✅
C# ✅ ✅ ✅ ✅
PHP ✅ ✅ ✅ ✅
Rust ✅ ✅ ✅ ✅
C (FFI) ✅ ✅ ✅ ✅
CLI ✅ ✅ ✅ ✅
Docker ✅ ✅ ✅ -

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. macOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.24+:

    • Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
    • macOS: brew install onnxruntime
    • Windows: Download from ONNX Runtime releases
  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

75+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .odt Full text, tables, lists, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods Sheet data, formulas, cell metadata, charts
Presentations .pptx, .pptm, .ppsx Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtf CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, UTF-16 support
Archives .zip, .tar, .tgz, .gz, .7z Recursive extraction, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .ris, .nbib, .enw, .csl BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific .tex, .latex, .typ, .typst, .jats, .ipynb LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing .fb2, .docbook, .dbk, .opml FictionBook, DocBook XML, OPML outlines
Documentation .pod, .mdoc, .troff Perl POD, man pages, troff

Complete Format Reference →

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

  • Installation Guide – Setup and dependencies
  • User Guide – Comprehensive usage guide
  • API Reference – Complete API documentation
  • Format Support – Supported file formats
  • OCR Backends – OCR engine setup
  • CLI Guide – Command-line usage
  • Migration Guide – Upgrading from v3

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details. You can use Kreuzberg freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.

Tech Stack

RustElixirPythonJavaGoPHPRubyDockerTypeScriptJavaScriptClaudeVercelCloudflareBunDeno

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started: Scripting Languages: Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR) Ruby – RubyGems package, idiomatic Ruby API, native bindings PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction Elixir – Hex package, OTP integration, concurrent processing R – r-universe package, idiomatic R API, extendr bindings Jav

Open Live ProjectAudit Repo

Reviews0

Log in to write a review.

ActiveLast commit today
bug_report34open issues
Submitted April 29, 2026

auto_awesomeYour strongest next moves after kreuzberg