Fara-7B: Microsoft's Compact Agentic AI for Computer Automation

artificial intelligence

Discover Fara-7B, Microsoft's 7B-parameter agentic small language model designed for computer use. It leverages visual perception to automate multi-step web tasks, offering state-of-the-art performance, on-device deployment, and efficiency comparable to larger AI systems.

Fara-7B: Microsoft's Compact Agentic AI for Computer Use

Overview

Fara-7B is Microsoft's pioneering agentic small language model (SLM) specifically engineered for computer interaction. With a mere 7 billion parameters, Fara-7B functions as an ultra-compact Computer Use Agent (CUA), delivering state-of-the-art performance within its size category and proving competitive with larger, more resource-intensive agentic systems.

Local Deployment

Experience Fara-7B locally by following these steps (refer to the Installation section for detailed Windows instructions) or through Magentic-UI:

Clone Repository

git clone https://github.com/microsoft/fara.git
cd fara

Setup Environment

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install

After setting up, host the model in one process:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

Then, you can interactively query it:

fara-cli --task "whats the weather in new york now"

To integrate Fara-7B with Magentic-UI, follow the instructions provided on the Magentic-UI + Fara-7B page. You will need to serve the model as described above, but utilize Magentic-UI for its user-friendly interface and video demonstrations.

Important Notes:

For Windows users, we strongly recommend using WSL2 (Windows Subsystem for Linux). Detailed Windows instructions are available in the Installation section.
You might need to add --tensor-parallel-size 2 to the vllm command if you encounter memory issues.

Fara-7B demonstrates its capabilities through various use cases, including multi-turn shopping, GitHub issue management, and complex navigation tasks.

What Makes Fara-7B Unique

Unlike conventional chat models that primarily generate text, Fara-7B leverages computer interfaces—specifically mouse and keyboard—to execute multi-step tasks on behalf of users. Key differentiating features include:

Visual Operation: The model perceives webpages and performs actions like scrolling, typing, and clicking on directly predicted coordinates, eliminating the need for accessibility trees or separate parsing models.
On-Device Deployment: Its compact 7B-parameter size enables deployment directly on local devices, leading to reduced latency and enhanced privacy by keeping user data local.
Efficient Task Completion: Fara-7B completes tasks efficiently, averaging approximately 16 steps per task, significantly fewer than the roughly 41 steps typically required by comparable models.

Fara-7B is trained using an innovative synthetic data generation pipeline built upon the Magentic-One multi-agent framework. This pipeline produced 145K trajectories covering a wide array of websites, task types, and difficulty levels. The model is based on Qwen2.5-VL-7B and was trained using supervised fine-tuning.

Key Capabilities

Fara-7B can automate a broad range of everyday web tasks, such as:

Searching for information and summarizing results.
Filling out forms and managing online accounts.
Booking travel, movie tickets, and restaurant reservations.
Shopping and comparing prices across different retailers.
Finding job postings and real estate listings.

Performance Highlights

Fara-7B achieves state-of-the-art results across several prominent web agent benchmarks, outperforming both similarly-sized models and larger systems.

Model	Params	WebVoyager	Online-M2W	DeepShop	WebTailBench
SoM Agents
SoM Agent (GPT-4o-0513)	-	90.6	57.7	49.1	60.4
SoM Agent (o3-mini)	-	79.3	55.4	49.7	52.7
SoM Agent (GPT-4o)	-	65.1	34.6	16.0	30.8
GLM-4.1V-9B-Thinking	9B	66.8	33.9	32.0	22.4
Computer Use Models
OpenAI computer-use-preview	-	70.9	42.9	24.7	25.7
UI-TARS-1.5-7B	7B	66.4	31.3	11.6	19.5
Fara-7B	7B	73.5	34.1	26.2	38.4

Table: Online agent evaluation results showing success rates (%) across four web benchmarks. Results are averaged over 3 runs.

WebTailBench: A New Benchmark for Real-World Web Tasks

We are introducing WebTailBench, a novel evaluation benchmark specifically designed for 11 real-world task types that are either underrepresented or entirely absent in existing benchmarks. This benchmark comprises 609 tasks across diverse categories. The initial 8 segments assess single skills or objectives, typically on a single website, while the remaining 3 evaluate more complex multi-step or cross-site tasks.

WebTailBench Detailed Results

Task Segment	Tasks	SoM GPT-4o-0513	SoM o3-mini	SoM GPT-4o	GLM-4.1V-9B	OAI Comp-Use	UI-TARS-1.5	Fara-7B
Single-Site Tasks
Shopping	56	62.5	71.4	38.1	31.0	42.3	41.1	52.4
Flights	51	60.1	39.2	11.1	10.5	17.6	10.5	37.9
Hotels	52	68.6	56.4	31.4	19.9	26.9	35.3	53.8
Restaurants	52	67.9	59.6	47.4	32.1	35.9	22.4	47.4
Activities	80	70.4	62.9	41.7	26.3	30.4	9.6	36.3
Ticketing	57	58.5	56.7	37.4	35.7	49.7	30.4	38.6
Real Estate	48	34.0	17.4	20.1	16.0	9.0	9.7	23.6
Jobs/Careers	50	49.3	44.0	32.7	22.7	20.7	20.7	28.0
Multi-Step Tasks
Shopping List (2 items)	51	66.0	62.7	17.0	7.8	34.0	20.9	49.0
Comparison Shopping	57	67.3	59.1	27.5	22.8	1.2	8.8	32.7
Compositional Tasks	55	51.5	39.4	26.7	17.0	10.3	9.1	23.0
Overall
Macro Average	609	59.7	51.7	30.1	22.0	25.3	19.9	38.4
Micro Average	609	60.4	52.7	30.8	22.4	25.7	19.5	38.4

Table: Breakdown of WebTailBench results across all 11 segments. Success rates (%) are averaged over 3 independent runs. Fara-7B demonstrates the highest performance among computer-use models across all task categories.

Evaluation Infrastructure

Our robust evaluation setup incorporates:

Playwright: A cross-browser automation framework that accurately replicates various browser environments.
Abstract Web Agent Interface: This interface facilitates the seamless integration of any model, regardless of its origin, into our evaluation environment.
Fara-Agent Class: A reference implementation specifically designed for running the Fara model within the evaluation framework.

Note: Fara-7B is an experimental release aimed at fostering community exploration and feedback. We advise running it within a sandboxed environment, closely monitoring its execution, and refraining from using it with sensitive data or on high-risk domains.

Installation

Linux

For Linux systems, follow these instructions. Refer to the Windows section below for native Windows installation.

Install the package using pip and configure the environment with Playwright:

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[vllm]
playwright install

Note: If your plan is to host only with Azure Foundry, you can omit [vllm] and simply run pip install -e ..

Windows

For Windows, we highly recommend using WSL2 (Windows Subsystem for Linux) to create a Linux-like environment. If native Windows execution is preferred, follow these steps:

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv
.venv\Scripts\activate
pip install -e .
python3 -m playwright install

Hosting the Model

Recommendation: The simplest approach is Azure Foundry hosting, which eliminates the need for GPU hardware or model downloads. Alternatively, self-hosting with vLLM is an option if you have access to GPU resources.

Azure Foundry Hosting (Recommended)

Deploy Fara-7B on Azure Foundry to avoid managing GPU infrastructure or downloading model weights.

Setup:

Deploy the Fara-7B model on Azure Foundry and obtain your unique endpoint URL and API key.

Create an endpoint configuration JSON file (e.g., azure_foundry_config.json):

{
  "model": "Fara-7B",
  "base_url": "https://your-endpoint.inference.ml.azure.com/",
  "api_key": "YOUR_API_KEY_HERE"
}

Then, you can run Fara-7B using this endpoint configuration:

fara-cli --task "how many pages does wikipedia have" --endpoint_config azure_foundry_config.json [--headful]

Note: You can also specify the endpoint configuration directly via command-line arguments: --base_url [your_base_url] --api_key [your_api_key] --model [your_model_name], instead of using a JSON file.

Troubleshooting fara-cli not found: If fara-cli command isn't recognized, try:

python -m fara.run_fara --task "what is the weather in new york now"

This method requires no GPU or model downloads.

Self-hosting with vLLM or LM Studio / Ollama

If you have GPU resources available, you can self-host Fara-7B using vLLM. This typically requires a GPU machine with at least 24GB of VRAM.

For Linux only: Run the following command to start the vLLM server:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

For quantized models or GPUs with lower VRAM, consult the Fara-7B GGUF on HuggingFace page.

For Windows/Mac: vLLM is not natively supported. You can use WSL2 on Windows to execute the above command or leverage LM Studio / Ollama as described below.

Alternatively, utilize LM Studio or Ollama for local model hosting. We recommend the GGUF versions of our models available on Fara-7B GGUF on HuggingFace. Select the largest model compatible with your GPU. Ensure the context length is set to at least 15000 tokens and temperature to 0 for optimal results.

Then, run Fara-7B, pointing to your local server:

fara-cli --task "what is the weather in new york now"

If you did not use vLLM for hosting, remember to specify the correct --base_url [your_base_url] --api_key [your_api_key] --model [your_model_name].

Troubleshooting fara-cli not found: If fara-cli command isn't recognized, try:

python -m fara.run_fara --task "what is the weather in new york now"

Reproducibility

We provide a framework within the webeval/ directory to reproduce our evaluation results on WebVoyager and OnlineMind2Web. Agentic evaluations on live websites present inherent challenges due to dynamic changes. We have implemented several measures to ensure reliable and comparable evaluations:

BrowserBase Integration: We utilize BrowserBase to manage browser session hosting, ensuring stable and consistent browser instance management.
Time-sensitive Task Updates: To address tasks that become stale or impossible over time, we have:
- Removed approximately 48 impossible tasks from the original WebVoyager benchmark.
- Updated about 50 tasks with future dates to maintain their achievability (e.g., "Search for a hotel in Bali from Jan 1 to Jan 4, 2024" was changed to "Search for a hotel in Bali from Jan 1 to Jan 4, 2026"). Our updated WebVoyager benchmark dataset is located at webeval/data/webvoyager/WebVoyager_data_08312025.jsonl.
Environment Error Handling: Browser errors, such as connection drops or page timeouts, are robustly managed:
- Trajectories are retried up to 5 times when environment errors occur.
- Complete but incorrect trajectories are never retried.
- Each retry initiates a fresh browser session, ensuring no state is retained from previous attempts.
Step Budget: Each trajectory across all online benchmarks is capped at a maximum of 100 actions. Trajectories exceeding this budget without successfully stopping are marked as incorrect.

WebEval Package Installation

conda create --name fara_webeval python=3.12
conda activate fara_webeval

# Install fara package
pip install -e .

# Install autogen submodule
git submodule update --init --recursive
cd autogen/python/packages
pip install -e autogen-core
pip install -e autogen-ext

# Install webeval
cd webeval
pip install -e .

# Install playwright
playwright install

Running Evaluations

Navigate to the scripts directory: cd webeval/scripts

Ensure you configure a valid OpenAI GPT-4o endpoint in endpoint_configs_gpt4o/dev to run the WebVoyager LLM-as-a-judge evaluation!

Option 1: Self-hosted VLLM

python webvoyager.py --model_url /path/where/you/want/to/download/model/ --model_port 5000 --eval_oai_config ../endpoint_configs_gpt4o/dev/ --out_url /path/to/save/eval/files --device_id 0,1 --processes 1 --run_id 1 --max_rounds 100
python om2w.py --model_url /path/where/you/want/to/download/model/ --model_port 5000 --eval_oai_config ../endpoint_configs_o4/dev/ --eval_model o4-mini --out_url /path/to/save/eval/files --device_id 0,1 --processes 1 --run_id 1 --max_rounds 100

Option 2: Azure Foundry Deployment Deploy Fara-7B on Foundry endpoint(s), then place endpoint URLs and keys in JSONs under endpoint_configs/:

python webvoyager.py --model_endpoint ../../endpoint_configs/ --eval_oai_config ../endpoint_configs_gpt4o/dev/ --out_url /path/to/save/eval/files --processes 1 --run_id 1_endpoint --max_rounds 100
python om2w.py --model_endpoint ../../endpoint_configs/ --eval_oai_config ../endpoint_configs_o4/dev/ --eval_model o4-mini --out_url /path/to/save/eval/files --processes 1 --run_id 1_endpoint --max_rounds 100

Notes:

We use the identical LLM-as-a-judge prompts and model (GPT-4o) as WebVoyager, which is why the --eval_oai_config argument is necessary.
Set --browserbase for browser session management (requires exported API key and project ID environment variables).
Avoid over-stressing a single vLLM deployment with more than approximately 10 concurrent processes due to known issues.
Debugging output can be found in fara/webeval/scripts/stdout.txt.

Analyzing Evaluation Results

Evaluation Output Structure

Evaluation results are stored under the --out_url path, organized into folders by model name, dataset, username, and run ID.

Example path: /runs/WebSurfer-fara-100-max_n_images-3/fara-7b/<username>/WebVoyager_WebVoyager_data_08312025.jsonl/<run_id>

Each evaluation folder contains:

gpt_eval/: Results from the LLM-as-a-judge evaluation.
traj/: Subdirectories for each task's trajectory, which include:
- *-final_answer.json (e.g., Amazon--1_final_answer.json): Indicates the final answer or <no_answer> if aborted or step budget was exceeded.
- scores/*_eval.json: LLM judge scores (gpt_eval.json for WebVoyager, WebJudge_Online_Mind2Web_eval-3.json for Online-Mind2Web).
- web_surfer.log: A comprehensive history of actions and any errors encountered.
- screenshot_X.png: Screenshots captured before each action X.
- times.json: Records the start and end times of the task.
- core.log: High-level logs detailing whether the trajectory needed to be started, was cached/done, evaluation score, duration, and errors.

Running Analysis

Use the analysis notebook to compute metrics:

cd webeval/scripts/analyze_eval_results/
jupyter notebook analyze.ipynb

This script performs the following:

Identifies trajectories that were aborted mid-execution and provides diagnostic reasons.
Computes average scores across trajectories that were not aborted.
Distinguishes between aborted trajectories (due to errors during sampling) and completed trajectories (either by a terminate() call or exceeding the step budget).

To re-run failed tasks, simply execute the evaluation script again with the same run_id and username. It will automatically skip tasks that were not aborted.

Example WebVoyager GPT Eval Result

{
  "score": 1.0,
  "gpt_response_text": "To evaluate the task, we need to verify if the criteria have been met:

1. **Recipe Requirement**: A vegetarian lasagna recipe with zucchini and at least a four-star rating.

2. **Search and Results**:
- The screenshots show that the search term used was \"vegetarian lasagna zucchini.\"
- Among the search results, \"Debbie's Vegetable Lasagna\" is prominently featured.

3. **Evaluation of the Recipe**:
- Rating: \"Debbie's Vegetable Lasagna\" has a rating of 4.7, which satisfies the requirement of being at least four stars.
- The presence of zucchini in the recipe is implied through the search conducted, though the screenshots do not explicitly show the ingredients list. However, the result response confirms the match to the criteria.

Given the information provided, the task seems to have fulfilled the requirement of finding a vegetarian lasagna recipe with zucchini and a four-star rating or higher.

**Verdict: SUCCESS**
"
}

Example of `web_surfer.log` (Action History)

{
  "message": "
Thought #1: To begin locating the closest GameStop store, I will click on the 'Trade-In' button, which typically leads to a store locator feature on retail websites.
Action #1: executing tool 'left_click' with arguments {\"action\": \"left_click\", \"coordinate\": [1074, 37]}",
  "timestamp": "2025-11-25T22:23:42.788320",
  "type": "OtherEvent"
}
{
  "message": "Observation#1: I clicked at coordinates (1083.0252100840337, 37.16517857142858).",
  "timestamp": "2025-11-25T22:23:47.616613",
  "type": "OtherEvent"
}
{
  "message": "
Thought #2: Since access to Gamestop's website has been blocked, I will search for alternative sources of store location and hours for GameStop locations near zip code 90028. This could include cached or third-party website information.
Action #2: executing tool 'web_search' with arguments {\"action\": \"web_search\", \"query\": \"GameStop store locator zip code 90028\"}",
  "timestamp": "2025-11-25T22:23:50.579702",
  "type": "OtherEvent"
}
{
  "message": "Observation#2: I typed 'GameStop store locator zip code 90028' into the browser search bar.",
  "timestamp": "2025-11-25T22:23:55.119495",
  "type": "OtherEvent"
}
{
  "message": "
Thought #3: The closest GameStop store to zip code 90028 is located at 5533 W Sunset Blvd, Los Angeles, CA 90028. It is currently closed and will open tomorrow at 11 AM. The contact number is (323) 466-3714.
Action #3: executing tool 'terminate' with arguments {\"action\": \"terminate\", \"status\": \"success\"}",
  "timestamp": "2025-11-25T22:23:58.729038",
  "type": "OtherEvent"
}
{
  "message": "Observation#3: The closest GameStop store to zip code 90028 is located at 5533 W Sunset Blvd, Los Angeles, CA 90028. It is currently closed and will open tomorrow at 11 AM. The contact number is (323) 466-3714.",
  "timestamp": "2025-11-25T22:24:02.379069",
  "type": "OtherEvent"
}

Citation

If you use Fara in your research, please cite our work.