AgentRoam

A multimodal game agent simulating movement control on physical controller + keyboard to navigate to destinations or freeroam. And it can take selfies.

How?

At a high-level, the program passes 1-2 images (depending on if game start or a in-progress session) along with a structured prompt indicating whether the agent is freeroam or following a mini-map. The program is set to run with numerous different multimodal models, including Claude Sonnet 4.5, Llama 4 Maverick/Scout, and all multimodal variants of GPT.

The multimodal agent then returns either a directional movement, a camera change, or for certain games, a request to Take a Photo. For movement and camera, the length of the movement is passed (i.e steps, camera turns). Finally, reasoning is also requested from the agent, especially for photos, where we want to know why it chose to take the photo. Once sanitised, these outputs are passed to either a physical or simulated controller, which then triggers movement in-game.

Why?

AgentRoam is a project developed to understand

How different multimodal models traverse an open-world.
How different observability platforms perform when logging data from an open-world.

This is a fun ongoing project, that in no way aims to replace our many logged hours in open-world games, but instead seeks to understand how a multimodal model might explore the world, and what they find beautiful (🤩 via taking photos)

Currently Playing 🎮

We have AgentRoam working in:

Watch Dogs 2
Cyberpunk 2077

Request AgentRoam plays a game by leaving a comment on a Youtube video, or email using the details below.

Get in touch

If you have questions, speak to a hackathon dev at me at agentroam.dev. View AgentRoam on Youtube