Cursor — Model Selection & Token Efficiency

The right framing

Don't optimize individual prompts — optimize task completion. Token efficiency is the outcome of solving tasks quickly and correctly, not of counting tokens per prompt.

Model sets ~75% of output quality. That's fixed. The remaining 25% is yours to control — and it all comes down to context.

The real cost isn't the plan — it's the correction loop. Skipping planning leads to blind edits, bugs to fix, and multiple back-and-forths. That wave of fix cycles costs far more tokens than a good upfront plan.

Recommended workflow

Step 1 Pull full ticket context via MCP (Jira, Linear…)

Step 2 Plan with a thinking model Sonnet / GPT / Opus

Step 3 Review & edit the plan free — zero tokens

Step 4 Execute the plan Composer 1.5 or 1

Why this works: the expensive model is used once to reason through the whole task. Execution is handled by a cheaper, faster model following a clear structure — far less likely to make mistakes than a blind run. Real example from the sessions: planning cost $0.42 · execution $0.76 = $1.18 total, with fewer errors than a planless approach.

Pro tip — add parallel guidance to your plan prompt: "Structure the plan with a dependency tree (DAG) and prefix tasks so parallel branches are clear." This makes the plan directly actionable for sub-agents or team distribution.

Model selection guide

Use case	Model to reach for
Planning, architecture, complex reasoning	Frontier / thinking model — Sonnet, GPT, Opus
Executing a clear, detailed plan	Composer 1.5 — fast and cost-effective middle ground
Docs, research, codebase exploration, small features	Composer 1 — cheapest, still very capable for routine tasks
Routing by task complexity automatically	Auto mode — defaults to Composer 2, routes up when needed
Sub-agents running isolated parallel work	Composer 1, or inherit from parent for simple output tasks

Prune your enabled model list in settings to only the ones you actually use. Smaller list = faster decisions. Run the same task on multiple models occasionally to calibrate — but not routinely, since you pay for every run.

Context management

Precision

Tag exact resources with @

Don't let the agent search the entire repo. Tag the specific file, doc, or past chat. It short-circuits semantic search — fewer tokens, faster, more accurate.

Limits

Keep context below 60–65%

Exceeding the limit triggers compaction — prior context gets compressed and quality degrades, but you still pay for all of it. Monitor the indicator on every chat.

Hygiene

Start a new chat per task

Every prior message in a thread inflates all future requests. When switching tasks or features, open a fresh agent — don't continue an unrelated thread.

Summarization

Summarize manually, not automatically

Auto-summaries give equal weight to everything. Instead: "Summarize focusing on X — save as markdown." Paste the result into a new chat. Context drops dramatically while preserving what matters.

You can also fork a chat from any earlier message (hover → fork) to start from a clean mid-point, or reference a past chat with @ to get a summary instead of reloading the full thread.

Rules, skills & sub-agents

Always-on

Rules

Static context added to every request. Keep them short and essential — every rule adds tokens to every single call. Don't replicate a linter or formatter here.

Scope tip: nest rules inside /frontend or /backend dirs so they only apply to those files.

On-demand

Skills

Only the name + description enter the context window upfront. The full skill body is loaded on demand when relevant. You can have many skills without bloating every prompt — this is called progressive disclosure.

The agent can update a failing skill itself. MCPs now work identically — loaded as skills, not dumped into context at startup.

Parallel

Sub-agents

Best when the how doesn't matter — only the output does. Each sub-agent has its own isolated context window; the parent only receives what you define in the output format.

Don't use them when the task is complex and you'll likely need to give feedback mid-run — you can't follow up with a sub-agent once it's started.

Commands vs Skills: a command is a skill with disableModelInvocation: true — meaning you invoke it manually rather than having the agent invoke it automatically. You can migrate existing commands to skills with /migrate to skills. To browse production examples, go to Settings → Plugins → Marketplace → "Cursor Team Kit".

Caching — a hidden cost driver

Switching models mid-conversation breaks the cache. The full thread must be re-sent to the new model. Pick your model at the start and stick with it for the duration of a task.

Don't edit rules or skills mid-conversation. Once read, they're cached in the thread. Changing them forces a re-read and breaks the cache for everything that follows.

Auto mode can silently break the cache by switching models between turns without telling you. For cache-sensitive tasks, pick a model explicitly.

Forking a chat breaks the cache — it's a brand new session. Sometimes it's worth it (cleaner context = better output), but be aware of the tradeoff.

Adding a new message does not break the cache. Prior context is unchanged. A window refresh also doesn't break it — cache is managed server-side by the LLM provider, not in your browser.

Quick reference — do vs don't

Do

Give the agent a specific, scoped task

Use plan mode for anything non-trivial

Edit the plan before pressing Build

Tag the exact file with @ instead of letting the agent search

Start a new chat when switching tasks

Summarize manually when context gets high

Don't

Say "make the app better" on an expensive model

Switch models in the middle of a task

Edit rules or skills mid-conversation

Let context go above 65% without acting

Use a sub-agent for tasks you'll need to iterate on

Replicate linter/formatter logic inside rules