A native macOS MCP server that gives Claude Code full desktop control — 22 tools for app automation, plus zero-image screen understanding via Vision OCR.
No screenshots. No vision models. Pure structured data.
Old way: Screenshot → Vision Model → Process pixels → Guess coordinates → Click → Miss → Repeat
ai-os-mcp: OCR text+coords (250ms) → Claude reads JSON → click_at(x, y) → Done
ai-os-mcp gives Claude Code the ability to see and control any macOS application through:
- Vision OCR — Extracts all on-screen text with pixel coordinates using macOS Vision framework. Works for every app (native, Chromium, Electron). Zero images — Claude processes JSON, not pixels.
- Accessibility Tree — Semantic UI structure for native apps (buttons, menus, text fields with actions).
- Direct Control — Mouse clicks, keyboard input, app launching, window management, AppleScript execution.
- Browser Companion — Separate Node.js MCP server for Playwright-based browser control via CDP.
| Tool | Description |
|---|---|
get_screen |
OCR the screen — returns all text with pixel coordinates as JSON. ~250ms. |
act_and_see |
Perform an action AND return OCR result in one call. |
run_macro |
Execute multiple actions in ONE call, OCR once at end. Eliminates round-trips. |
| Tool | Description |
|---|---|
get_running_apps |
List all GUI apps with name, PID, bundle ID |
get_frontmost_app |
Get the focused app |
get_ax_tree |
Read accessibility tree of any app |
click_element |
Click element by semantic search (title/ID/description) |
type_text |
Type text into focused or found element |
press_key |
Keyboard shortcuts (Cmd+C, Return, etc.) |
| Tool | Description |
|---|---|
mouse_click_at |
Click at screen coordinates (from OCR or AX positions) |
mouse_drag |
Drag between two points with smooth interpolation |
scroll |
Scroll within an app or element |
| Tool | Description |
|---|---|
open_application |
Launch app by name or bundle ID |
open_url |
Open URL in default browser |
navigate_url |
Open URL in a specific browser (one-call activate + navigate) |
manage_window |
Resize, move, minimize, maximize, fullscreen, restore |
| Tool | Description |
|---|---|
run_applescript |
Execute AppleScript or JXA (with safety checks) |
get_menu_bar |
Read all menu items for an app |
click_menu_item |
Click menu item by path (e.g. "File > Export > PDF") |
read_pasteboard |
Read clipboard (text, HTML, RTF, file URLs) |
write_pasteboard |
Write to clipboard |
take_screenshot |
Capture screen to file (fallback when OCR isn't enough) |
11 tools via Playwright CDP: browser_connect, browser_navigate, browser_get_dom, browser_get_text, browser_click, browser_type, browser_select, browser_fill_form, browser_execute_js, browser_get_tabs, browser_switch_tab.
- macOS 13.0+ (Ventura)
- Xcode 16.0+ (Swift 6.0)
- Node.js 20+ (for browser companion, optional)
git clone https://github.com/charantejmandali18/ai-os-mcp.git
cd ai-os-mcp
./scripts/install.shThis builds both servers, installs them, and configures Claude Desktop.
- Accessibility — System Settings > Privacy & Security > Accessibility > add
~/.local/bin/ai-os-mcp - Screen Recording — System Settings > Privacy & Security > Screen Recording > add
ai-os-mcp(required forget_screen/ Vision OCR)
claude mcp add ai-os-mcp -- ~/.local/bin/ai-os-mcpOr add to .mcp.json in your project:
{
"mcpServers": {
"ai-os-mcp": {
"type": "stdio",
"command": "/Users/YOUR_USERNAME/.local/bin/ai-os-mcp"
}
}
}Open a website and click a link:
get_screen → see all text + coordinates
act_and_see(app="Dia", → navigate + return OCR
action="navigate", url="example.com")
act_and_see(app="Dia", → click at coords from OCR
action="click_at", x=1091, y=118)
Play music on Spotify:
open_application(app_name="Spotify")
press_key(key="k", modifiers=["command"], app_name="Spotify")
type_text(text="gym rush", app_name="Spotify")
press_key(key="return", app_name="Spotify")
run_applescript(script='tell application "Spotify" to play')
Create a Google Doc:
navigate_url(app_name="Dia", url="docs.new")
type_text(text="My Document Title\n\nBody content here...", app_name="Dia")
Instead of screenshots, ai-os-mcp uses a persistent ScreenCaptureKit stream (2 FPS) and macOS Vision framework to OCR the screen:
WindowServer → SCStream (2 FPS, in memory) → VNRecognizeTextRequest (~250ms)
↓
JSON: [{text: "CONTACT", x: 1056, y: 109, w: 65, h: 14}, ...]
↓
Claude reads text, calls click_at(1089, 116)
- Frame is always in memory — zero capture latency
- FNV-1a hash detects changes — skip OCR if screen unchanged
- Coordinates scaled to real screen pixels — pass directly to
click_at - Claude processes JSON text, not pixels — orders of magnitude faster than vision
- Vision OCR (
get_screen) — works for ALL apps - AppleScript/JXA — scriptable apps (browsers, Finder, Mail)
- Menu bar — always accessible even when content isn't
- Keyboard — tab, arrows, shortcuts
- Pasteboard — Cmd+A, Cmd+C, read clipboard
- Coordinate click —
click_atwith OCR coordinates
Claude Code (AI Brain)
├── ai-os-mcp (Swift, stdio MCP) ── 22 native macOS tools
│ ├── ScreenCaptureKit + Vision OCR (zero-image screen reading)
│ ├── Accessibility APIs (semantic element interaction)
│ ├── CGEvent (mouse, keyboard, scroll)
│ ├── NSWorkspace (app launch, URL open)
│ └── AppleScript/JXA (scriptable app automation)
│
└── ai-os-browser (Node.js, stdio MCP) ── 11 browser tools
└── Playwright CDP (DOM access, clicks, typing, JS execution)
swift build # Debug build
swift build -c release # Release build
swift test # Run tests (caution: CGEvent tests send real input)After building, sign and install:
codesign --force --sign - .build/release/ai-os-mcp
cp .build/release/ai-os-mcp ~/.local/bin/ai-os-mcp
codesign --force --sign - ~/.local/bin/ai-os-mcp- Accessibility tree tools (v0.1.0)
- Mouse, scroll, screenshot, window management (v0.2.0)
- AppleScript, menu bar, pasteboard (v0.2.0)
- Browser companion via Playwright (v0.2.0)
- Zero-image Vision OCR (v0.3.0)
- act_and_see compound tool (v0.3.0)
- run_macro batch execution (v0.3.0)
- Persistent ScreenCaptureKit stream (v0.3.0)
- Fix run_macro JSON array parsing
- AX tree caching (30s TTL)
- WebSocket transport for remote access
MIT — see LICENSE