帮助用户在 Windows 上后台截图并自动操作窗口、鼠标与键盘。
Enables text-only agents to process images by accepting image files, base64 data, or URLs, sending them to multimodal models, and returning structured text results via MCP.