v1.0 — open source & free

Compare AI models side-by-side
in your terminal

One prompt, multiple models, real-time streaming, performance stats, and an AI judge — all in a single command.

npx yardstiq "your prompt" -m claude-sonnet -m gpt-4o
terminal

Everything you need to compare models

Stop copying prompts between tabs. One command gives you streaming comparisons, hard numbers, and AI-powered evaluation.

Side-by-Side Streaming

Watch model outputs appear in parallel, in real time. No more tab-switching between chat windows.

🤖

40+ Models

Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Grok — every major model in one tool.

📊

Performance Stats

Time to first token, throughput, token counts, and cost per model. Data, not vibes.

⚖️

AI Judge

Let an AI evaluate which response wins with scored verdicts and reasoning.

📁

Export Anywhere

JSON for pipelines, Markdown for docs, self-contained HTML for sharing.

🧪

Benchmark Suites

Define prompt suites in YAML and run them across models with aggregate scoring.

🏠

Local Models

Compare Ollama models with zero API cost. Your hardware, your data, your rules.

🔑

Flexible Auth

One Vercel AI Gateway key for everything, or individual provider keys. Mix and match.

Up and running in 60 seconds

No config files. No web UI. Just your terminal.

1

Install (or just use npx)

npm install -g yardstiq
# or skip install entirely
npx yardstiq "your prompt" -m claude-sonnet -m gpt-4o
2

Set your API key

# One key for 40+ models via Vercel AI Gateway
export AI_GATEWAY_API_KEY=your_key

# Or individual provider keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
3

Compare models

# Basic comparison
yardstiq "Explain monads" -m claude-sonnet -m gpt-4o

# With AI judge
yardstiq "Write a sort algorithm" -m claude-sonnet -m gpt-4o --judge

# Three models + export
yardstiq "Explain DNS" -m claude-sonnet -m gpt-4o -m gemini-flash --json > results.json
4

Go local (optional)

# No API key needed — just run Ollama
yardstiq "hello" -m local:llama3.2 -m local:mistral

Real benchmarks, not marketing

Run your own benchmark suites with YAML configs. Here's a sample across coding, creative writing, and reasoning tasks.

# benchmark.yaml
name: model-showdown
prompts:
  - "Write a Python fibonacci with memoization"
  - "Explain quantum entanglement to a 10-year-old"
  - "Debug this async race condition: ..."
models:
  - claude-sonnet
  - gpt-4o
  - gemini-flash
judge: true

yardstiq benchmark run benchmark.yaml --json

ModelCodingCreativeReasoningSpeedCost/req
Claude Sonnet
92
88
94
69 t/s$0.0013
GPT-4o
89
85
90
48 t/s$0.0010
Gemini Flash
84
82
86
112 t/s$0.0004
Llama 3.1 70B
81
79
83
35 t/s$0.0000
Sample results — run your own benchmarks to get real numbers for your use case

Stop guessing. Start measuring.

Join developers who use yardstiq to make data-driven model decisions.