Build local AI agents on Windows to securely run on-device assistants and double inference performance
Want to build local AI agents on Windows securely? Use Microsoft eXecution Containers with NVIDIA OpenShell to sandbox actions, then run fast models on RTX GPUs with llama.cpp or vLLM. Add multi-GPU scaling, Windows AI APIs, and tools like NemoClaw and Hermes to deliver safe, quick, always-on assistants on your PC.
AI agents are moving from cloud demos to daily desktop tools. New work from Microsoft and NVIDIA makes it easier to set up, secure, and speed up agents on Windows PCs. You can isolate agent actions, boost inference speed, and scale across two GPUs—all without leaving your main workflow.
Secure the agent first: isolation, identity, and policy
Microsoft eXecution Containers (MXC)
MXC gives each agent a safety box. It sets identity and policy. It limits file access and code execution so an agent cannot touch your whole system. This helps stop prompt injection and data leaks when agents read email, edit files, or use apps.
NVIDIA OpenShell on Windows
OpenShell uses MXC and adds policy tools, inference routing, and PII obfuscation. It packages the hard parts so you can deploy always-on agents with clear guardrails. OpenClaw and Hermes Agent plan to use this stack to harden their Windows security.
Hardware built for personal AI
RTX Spark systems for developers
NVIDIA RTX Spark desktops and laptops bring up to 1 petaflop of AI power and as much as 128 GB of memory in small designs. Microsoft’s Surface RTX Spark Dev Box ships with a developer-tuned Windows image and the tools you need to start fast.
How to build local AI agents on Windows
Step-by-step setup
Turn on sandboxing: Use MXC plus OpenShell to define what an agent can read, write, and run.
Install an agent stack: NVIDIA NemoClaw now supports GeForce RTX, NVIDIA RTX PRO, DGX Spark, and DGX Station for Windows via Linux or WSL. It helps you set up a sandboxed agent and pick models that fit your GPU. NemoClaw can also run Hermes Agent.
Choose agent apps: Hermes Agent now runs natively on Windows with a CLI and a desktop app. It can work with Windows files and apps more smoothly.
Pick models with “Computer Use”: H Company’s Holo 3.1 models can see the screen and click. They include quantized checkpoints that cut memory by about 35% vs FP8. A Computer Use harness with local model support is coming. NVIDIA helped tune these for over 2x speed on NVIDIA GPUs.
Optimize inference locally:
– llama.cpp adds Multi-Token Prediction (MTP) and Programmatic Dependent Launch (PDL). Expect about 2x speed on Qwen 3.5/3.6 27B and up to 1.6x on Qwen 3.5/3.6 35B MoE.
– vLLM includes MTP and new optimizations, lifting throughput up to 2.6x with better BF16 kernels for MoE and improved CUDA Graphs.
Use two GPUs when you can:
– llama.cpp now supports tensor parallelism to use both GPUs for more memory and up to ~1.8x faster generation. In LM Studio, open Settings → Runtime to enable TP.
– In ComfyUI, Classifier-Free Guidance lets you use both GPUs for near 2x compute. Split model chains across devices to run in high VRAM mode and avoid swapping.
Integrate with Windows AI: Windows AI Foundry and Windows AI APIs now route supported models to RTX GPUs via TensorRT for RTX. The first built-in model is Phi-Silica (3.3B) for on-device tasks like summarizing and coding.
Use WSL-C for Linux containers: Windows Subsystem for Linux Containers lets your Windows app create and run Linux AI containers without users managing WSL. A C/C++ library ties it into your app, keeping dev and prod closer.
Following these steps makes it practical to build local AI agents on Windows that are safe, fast, and easy to maintain.
Speed that matches 24/7 workloads
llama.cpp and vLLM: smarter decoding and kernels
– MTP lets a small draft model propose several next tokens that the main model checks in one pass. This raises throughput without hurting output quality or needing extra training.
– PDL can run dependent kernels in the same CUDA stream at once, cutting decode time.
– vLLM adds better BF16 kernels for MoE and trims runtime overhead with stronger CUDA Graphs usage.
These upgrades help you build local AI agents on Windows that respond faster and consume less power over long sessions.
Two GPUs, one agent: scaling on RTX PCs
– Tensor parallelism in llama.cpp uses both GPUs for bigger models and faster tokens.
– ComfyUI can split nodes across devices and enable high VRAM mode to skip costly swapping.
Together, these features make local agents smoother, especially with long contexts, tools, or multi-step plans.
Expand what your agent can do
NemoClaw and Hermes Agent
NemoClaw streamlines install, sandboxing, and model choice across RTX hardware on Windows and WSL. Hermes Agent adds a native Windows app and CLI, improving access to local files and apps with a clean UX.
H Company’s Computer Use models
Holo 3.1 models let agents act by seeing and clicking, which extends reach across many apps. With 35% lower memory vs FP8 and NVIDIA-optimized performance, they fit better on PCs while staying quick.
AI for Media SDK (AI4M)
– LipSync (GA) adds French, German, and Spanish with clearer articulation for dubbing and localization.
– Active Speaker Detection (GA) now handles multicam and multimic setups and links speakers across clips, cutting manual edits.
Ship GPU-accelerated Windows apps
TensorRT for RTX and Windows ML
Windows routes supported AI calls to RTX GPUs, bringing higher local performance. Early results show:
– Voicemod: 42% faster real-time voice conversion
– Topaz: 20% faster 1080p-to-4K upscaling with 3–4x smaller engine storage
– DxO PhotoLab 9.7: faster AI photo processing
– Camo Streamlight: real-time light autotune with AI
Linux containers without the hassle
WSL-C lets you run Linux AI containers from a Windows app with a simple API. Users avoid WSL setup, while developers keep workflows aligned with production.
Building safe desktop agents is now within reach. Use MXC and OpenShell to set clear rules. Pick fast local runtimes like llama.cpp or vLLM. Scale with two GPUs. Tie in Windows AI and WSL-C for smooth delivery. With these tools, you can confidently build local AI agents on Windows that stay private, fast, and helpful all day.
(pSource:
https://developer.nvidia.com/blog/build-personal-ai-agents-on-windows-pcs-with-new-tools-from-microsoft-and-nvidia/)
For more news: Click Here
FAQ
Q: What are Microsoft eXecution Containers (MXC) and how do they secure agents?
A: Microsoft eXecution Containers (MXC) provide a policy layer that defines isolation, identity, and containment for each agent, limiting file access and code execution so an agent cannot touch the full system. This reduces prompt-injection risks when agents interact with personal files, apps, or data.
Q: What does NVIDIA OpenShell add to MXC for Windows agent sandboxing?
A: NVIDIA OpenShell on Windows is built on MXC and packages policy creation and management, inference routing, and PII obfuscation to help developers deploy always-on agents safely. It provides an easy-to-integrate runtime so agent apps can adopt sandboxing and routing features without building them from scratch.
Q: How do RTX Spark systems and the Surface RTX Spark Dev Box support personal AI agents?
A: RTX Spark desktops and laptops deliver up to 1 petaflop of AI power and up to 128 GB of memory in small form factors, with CUDA-accelerated AI frameworks for running large models alongside everyday work. Microsoft’s Surface RTX Spark Dev Box is a special developer edition preloaded with a modified Windows image and developer tools to help developers get started.
Q: How do I build local AI agents on Windows securely in practice?
A: To build local AI agents on Windows, start by enabling MXC with OpenShell to define sandboxing, identity, and file and execution policies, then install an agent stack such as NVIDIA NemoClaw or Hermes Agent to set up and manage sandboxed agents. Next, optimize local inference with llama.cpp or vLLM, enable tensor parallelism or multi-GPU support in LM Studio when available, and integrate Windows AI APIs and WSL‑C for containerized workflows.
Q: What inference speedups do llama.cpp and vLLM provide for local agent workloads?
A: llama.cpp adds Multi-Token Prediction (MTP) and Programmatic Dependent Launch (PDL) to deliver about 2x performance on Qwen 3.5/3.6 27B dense models and about 1.6x on Qwen 3.5/3.6 35B MoE models. vLLM has adopted MTP and added BF16 kernel selection and CUDA Graphs optimizations to raise throughput by about 2.6x.
Q: How can I scale agent performance across two GPUs on an RTX PC?
A: llama.cpp supports tensor parallelism to utilize both GPUs for roughly double memory capacity and up to ~1.8x compute performance, and LM Studio exposes a Runtime setting to enable TP. ComfyUI supports Classifier-Free Guidance and splitting model chains across devices to get up to ~2x compute and avoid memory-swapping overhead.
Q: What are H Company’s Holo 3.1 models and how do they extend agent capabilities?
A: Holo 3.1 models are tuned for Computer Use so agents can observe the screen and take actions like clicking, extending agentic capabilities across more apps. They include quantized checkpoints that reduce memory by about 35% compared to FP8, and NVIDIA optimizations for these models and the forthcoming harness delivered over 2x performance on NVIDIA GPUs.
Q: How do Windows AI APIs, TensorRT for RTX, and WSL‑C fit into a Windows agent workflow?
A: Windows AI Foundry and Windows AI APIs route supported model calls to RTX GPUs via TensorRT for RTX, with Phi‑Silica (3.3B) cited as the first supported on-device model for tasks like summarization and code generation. WSL‑C lets Windows apps create and run Linux AI containers without users managing WSL, and exposes a C/C++ library so developers can integrate container workflows into their apps.