Making Python More LLM and Agent-Friendly

This post was written in collaboration with Anthropic's Opus 4.8 High Reasoning model.

There's a tension at the heart of writing code for AI agents. The features that make a language pleasant for humans to write quickly—dynamism, terseness, implicitness, metaprogramming—are largely the same ones that make it hard for a large language model to generate reliably. An LLM generates code left to right, committing to each token before it sees what comes next, and it does this without the ability to compile or run anything mid-thought. So the properties that help it are the ones that make the next correct token predictable from nearby context: a parallel channel of machine-checkable intent, locality of meaning, explicitness over cleverness, and a fast feedback loop it can iterate against.

The languages that have these properties built in—statically typed, explicit, with excellent compiler feedback—tend to be the ones humans dismiss as bureaucratic. But that verbosity isn't overhead; it's the redundancy that keeps generation on track. Rust's borrow checker and Elm's error messages turn the compiler into a collaborator rather than a gatekeeper.

The good news is that you don't have to switch languages to get most of these benefits. An LLM's competence is dominated by how much of a language it saw during training, and Python sits near the top of that distribution. So the winning move isn't to abandon Python for something theoretically purer—it's to bolt the helpful properties onto the language the model already knows best. With strict type checking, runtime validation, and a one-command feedback loop, you can recreate the verify-and-iterate cycle that compiled languages get for free, while keeping Python's enormous data advantage.

Here's how to do it, roughly in order of effort-to-payoff.

1. Strict static type checking (the single biggest win)

This is the parallel verification channel. Use pyright (or mypy) in strict mode so the agent gets immediate, local feedback on whether its generated code is internally consistent.

# pyproject.toml
[tool.pyright]
typeCheckingMode = "strict"
reportMissingTypeStubs = true

Then actually annotate everything—function signatures, return types, class attributes. Type hints don't just catch errors; they constrain what the next correct token can be, which is exactly what helps a generator. An agent that can run pyright after editing has a fact-checker in the loop instead of waiting for a runtime crash.

2. A fast linter and formatter

Ruff covers both and is fast enough to run on every edit. Formatting matters more than it seems: consistent style means the model isn't burning probability mass deciding between equivalent layouts, and a deterministic formatter keeps diffs small and reviewable.

[tool.ruff]
line-length = 88
 
[tool.ruff.lint]
select = ["E", "F", "I", "UP", "B", "SIM", "RUF", "ANN"]
# ANN flags missing annotations; UP pushes modern syntax; B catches bug-prone patterns

3. Runtime validation at the boundaries

Static types vanish at runtime, so anything crossing a trust boundary—API payloads, config files, LLM outputs themselves—should be validated. Pydantic turns a schema into both a static type and a runtime guarantee, and its validation errors are descriptive enough for an agent to self-correct from.

from pydantic import BaseModel
 
class AgentConfig(BaseModel):
    model: str
    max_retries: int = 3
    timeout_seconds: float = 30.0

This is also the cleanest way to get structured output from an LLM: define the schema, validate the response, and feed the error back on failure.

4. Tests that double as a feedback signal

For an agent, tests are less about catching regressions and more about giving it something to iterate against. pytest with descriptive assertions works well because failures print expected-vs-actual inline. Keep tests close to the code they cover and prefer many small, named tests over a few large ones—each failure then localizes the problem.

[tool.pytest.ini_options]
addopts = "-v --tb=short"

Property-based testing with Hypothesis is worth adding for pure functions, since it surfaces edge cases the model wouldn't think to write.

5. Project structure that maximizes locality

This is the "understand a chunk without distant context" property. Concretely: keep modules small and single-purpose so a relevant unit fits in a reasonable context window. Prefer explicit imports over star imports, and avoid deep dynamic behavior—monkey-patching, metaclass magic, runtime attribute injection. Those are the "spooky action at a distance" features that force the model to reason about code it can't see. Docstrings on public functions give the model intent in natural language right where it's editing, which measurably helps.

6. Make the toolchain runnable in one command

An agent is only as good as its feedback loop, so collapse the checks into a single entry point it can call after every change. A Makefile works well:

check:
    ruff format .
    ruff check --fix .
    pyright
    pytest

Now the agent's loop is: edit → make check → read errors → fix. That closes the gap between Python and the compiler-driven languages—you've manually assembled the feedback signal that Rust gets for free.

7. Pin the environment

Use uv (or Poetry) with a lockfile so the agent isn't fighting nondeterministic dependency resolution. A reproducible environment means an error the agent sees is a real error, not an environment artifact—which keeps the feedback signal trustworthy.

uv init
uv add pydantic
uv add --dev ruff pyright pytest hypothesis

Where to start

If you rank these by effort-to-payoff: strict pyright and Ruff come first—an afternoon of work for a huge return—then runtime validation at boundaries, then the one-command check loop. Those four alone move a Python project most of the way toward the agent-friendly sweet spot. The type checker plus the single runnable command do the heavy lifting; together they give the agent the verify-and-iterate cycle that makes the difference.

The deeper point is that you're not fighting Python's nature so much as adding discipline around it. The static channel constrains generation, the validation catches what slips through at runtime, and the one-command loop turns the whole toolchain into something an agent can actually use. None of it requires leaving the language the model already understands best.