wraith/docs/prompts/2026-03-17-ai-copilot-integration.md

161 lines
9.1 KiB
Markdown

# Mission Brief: Wraith AI Copilot Integration
Full boot sequence first — CLAUDE.md, AGENTS.md, Memory MCP. Read the spec at `docs/superpowers/specs/2026-03-17-wraith-desktop-design.md` and both phase plans before you start.
---
## The Mission
Design and build a first-class AI copilot integration into Wraith. Not a chatbot sidebar. Not a prompt window. A co-pilot seat where any XO (Claude instance) can:
1. **See what the Commander sees** — in any RDP session, receive the screen as a live visual feed (FreeRDP3 bitmap frames → vision input). No Playwright needed. The RDP session IS the browser.
2. **Type what the Commander types** — in any SSH/terminal session, read stdout in real-time and write to stdin. Full bidirectional terminal I/O. The XO can run commands, read output, navigate filesystems, edit files, run builds — everything a human can do in a terminal.
3. **Click what the Commander clicks** — in any RDP session, emulate mouse movements, clicks, scrolls, and keyboard input via FreeRDP3's input channel. The XO can navigate a Windows desktop, open applications, click buttons, fill forms, interact with any GUI application.
4. **Do development work** — an XO can open an SSH session to a dev machine, cd to a repo, run a build, open an RDP session to the same machine, navigate to `localhost:3000` in a browser, and visually verify the output — all without Playwright, all through Wraith's native protocol channels.
5. **Collaborate in real-time** — the Commander and the XO see the same sessions. The Commander can watch the XO work, take over at any time, or let the XO drive. Shared context, shared view, shared control.
---
## Design Requirements
### SSH/Terminal Integration
The XO needs these capabilities on any active SSH session:
- **Read terminal output** — subscribe to the `ssh:data:{sessionId}` event stream. Receive raw terminal output as it happens.
- **Write terminal input** — call `SSHService.Write(sessionId, data)` to type commands.
- **Read CWD** — use the OSC 7 CWD tracker (already built in Phase 2) to know the current directory.
- **Resize terminal** — call `SSHService.Resize(sessionId, cols, rows)` if needed.
- **SFTP operations** — use `SFTPService` methods to read/write files, upload/download, navigate the remote filesystem.
This means the XO can: ssh into a Linux box, `cd /var/log`, `tail -f syslog`, read the output, identify an issue, `vim /etc/nginx/nginx.conf`, make an edit via stdin keystrokes, save, `systemctl restart nginx`, verify the fix — all autonomously.
### RDP Vision Integration
The XO needs to see the remote desktop:
- **Frame capture** — FreeRDP3 already decodes RDP bitmap updates. Capture the current screen state as an image (JPEG/PNG) at a configurable interval or on-demand.
- **Frame → AI vision** — send the captured frame to the Claude API as an image input. The XO receives it as visual context — it can read text on screen, identify UI elements, understand application state.
- **Configurable capture rate** — the Commander controls how often frames are sent (e.g., on-demand, every 5 seconds, or continuous for active work). Token cost matters — don't stream 30fps to the API.
- **Region-of-interest** — optionally crop to a specific region of the screen for focused analysis (e.g., "watch this log window").
### RDP Input Emulation
The XO needs to interact with the remote desktop:
- **Mouse** — move to coordinates, left/right click, double-click, scroll, drag. FreeRDP3 has input channels for all of these.
- **Keyboard** — send keystrokes, key combinations (Ctrl+C, Alt+Tab, Win+R), and text strings. Support both individual key events and bulk text entry.
- **Coordinate mapping** — the XO specifies actions in terms of what it sees in the frame ("click the OK button at approximately x=450, y=320"). The integration layer maps pixel coordinates to RDP input coordinates.
This means the XO can: connect to a Windows server via RDP, see the desktop, open a browser (Win+R → "chrome" → Enter), navigate to a URL (click address bar → type URL → Enter), read the page content via vision, interact with web applications — all without Playwright or any browser automation tool.
### The AI Service Layer
Build a Go service (`internal/ai/`) that:
```
AIService
├── Connect to Claude API (Anthropic SDK or raw HTTP)
├── Manage conversation context (system prompt + message history)
├── Tool definitions for SSH, SFTP, RDP input, RDP vision
├── Process tool calls → dispatch to Wraith services
├── Stream responses to the frontend (chat panel)
└── Handle multi-session awareness (which sessions exist, which is active)
```
**Tool definitions the AI should have access to:**
```
Terminal Tools:
- terminal_write(sessionId, text) — type into a terminal
- terminal_read(sessionId) — get recent terminal output
- terminal_cwd(sessionId) — get current working directory
File Tools:
- sftp_list(sessionId, path) — list directory
- sftp_read(sessionId, path) — read file content
- sftp_write(sessionId, path, content) — write file
- sftp_upload(sessionId, local, remote)
- sftp_download(sessionId, remote)
RDP Tools:
- rdp_screenshot(sessionId) — capture current screen
- rdp_click(sessionId, x, y, button) — mouse click
- rdp_doubleclick(sessionId, x, y)
- rdp_type(sessionId, text) — type text string
- rdp_keypress(sessionId, key) — single key or combo (ctrl+c, alt+tab)
- rdp_scroll(sessionId, x, y, delta) — scroll wheel
- rdp_move(sessionId, x, y) — move mouse
Session Tools:
- list_sessions() — what's currently open
- connect_ssh(connectionId) — open a new SSH session
- connect_rdp(connectionId) — open a new RDP session
- disconnect(sessionId) — close a session
```
### Frontend: The Copilot Panel
A collapsible panel (right side or bottom) that shows the AI interaction:
- **Chat messages** — the conversation between Commander and XO
- **Tool call visualization** — when the XO executes a tool, show what it did (e.g., "Typed `ls -la` in Terminal 1", "Clicked at (450, 320) in RDP 2", "Read /etc/nginx/nginx.conf")
- **Screen capture preview** — when the XO takes an RDP screenshot, show a thumbnail in the chat
- **Session awareness indicator** — show which session the XO is currently focused on
- **Take control / Release control** — the Commander can let the XO drive a session or take it back
- **Quick commands** — "Watch this session", "Fix this error", "Deploy this", "What's on screen?"
### Interaction Model
The Commander and XO interact through natural language in the chat panel. The XO has access to all tools and uses them autonomously based on the conversation:
```
Commander: "SSH into asgard and check if the nginx service is running"
XO: [calls connect_ssh(asgardConnectionId)]
[calls terminal_write(sessionId, "systemctl status nginx")]
[calls terminal_read(sessionId)]
"Nginx is active (running) since March 15. PID 1234, 3 worker processes.
Memory usage is 45MB. No errors in the last 50 journal lines."
Commander: "Open RDP to dc01 and check the Event Viewer for any critical errors"
XO: [calls connect_rdp(dc01ConnectionId)]
[calls rdp_screenshot(sessionId)]
"I can see the Windows Server desktop. Opening Event Viewer..."
[calls rdp_keypress(sessionId, "win+r")]
[calls rdp_type(sessionId, "eventvwr.msc")]
[calls rdp_keypress(sessionId, "enter")]
[waits 2 seconds]
[calls rdp_screenshot(sessionId)]
"Event Viewer is open. I can see 3 critical errors in the System log from today.
Let me click into the first one..."
[calls rdp_click(sessionId, 320, 280, "left")]
[calls rdp_screenshot(sessionId)]
"Critical error: The Kerberos client received a KRB_AP_ERR_MODIFIED error
from the server dc02$. This usually indicates a DNS or SPN misconfiguration..."
```
---
## Architecture Constraints
- **Claude API key** stored in the encrypted vault (same Argon2id + AES-256-GCM as credentials)
- **Token budget awareness** — track token usage per conversation, warn at thresholds
- **Conversation persistence** — save conversations to SQLite, resume across sessions
- **No external dependencies** — the AI service is a Go package using the Claude API directly (HTTP + SSE streaming), not a Python sidecar
- **Model selection** — configurable in settings (claude-sonnet-4-5-20250514, claude-opus-4-5-20250414, etc.)
- **Streaming responses** — SSE from Claude API → Wails events → Vue frontend, token by token
---
## What to Build
Design this system fully (spec it out), then implement it. Phase it if needed — terminal integration first (lower complexity, immediate value), then RDP vision, then RDP input. But design the whole thing upfront so the architecture supports all three from day one.
The end state: a single Wraith window where a human and an AI work side by side on remote systems, sharing vision, sharing control, sharing context. The AI sees what you see. The AI types what you'd type. And you can take the wheel whenever you want.
Build it.