Large language models can pass the bar, write production code, and reason about abstract mathematics — yet they fail at tasks a junior civil engineer performs intuitively. Ask GPT-4 to route a fiber cable around a building, place transformers to balance load across a neighborhood, or schedule construction work given physical access constraints, and the answers degrade quickly. We release a benchmark suite designed to measure exactly this gap.
The third modality
Most of the public LLM benchmark landscape evaluates two modalities: language (MMLU, HellaSwag, GSM8K) and, increasingly, vision (MMMU, ChartQA). A third modality — spatial reasoning over the physical world — remains under-served. Infrastructure planning, logistics, robotics, and urban design all depend on it.
Spatial reasoning is not geometry. It is the integration of geometry with domain constraints: a cable has a minimum bend radius, a water main has a maximum pressure drop, a delivery van has a depot and a shift length. Solving these tasks requires a model that can hold geographic context, topological structure, and physical constraints simultaneously.
What we measure
Our benchmark consists of four task families, each grounded in real infrastructure problems with verifiable ground truth:
- Topological reasoning — given a network of nodes and edges with spatial coordinates, answer questions about connectivity, shortest paths, and cut-set vulnerabilities.
- Constraint-aware placement — place equipment (transformers, splitters, sensors) to satisfy coverage, distance, and capacity constraints.
- Resource-bounded routing — route cables, pipes, or vehicles through obstacles given physical limits (bend radius, pressure drop, range).
- Cascading impact — predict the downstream effect of a failure or change on the rest of the network.
Headline results
Frontier models score well on natural-language descriptions of spatial scenes but regress sharply when the same scene is presented as structured geographic data. Performance falls further when the task requires composing multiple constraints rather than evaluating them in isolation.
| Task family | Best model (verbal) | Best model (structured) | Human expert |
|---|---|---|---|
| Topological reasoning | 71% | 43% | 94% |
| Constraint-aware placement | 58% | 29% | 91% |
| Resource-bounded routing | 47% | 22% | 88% |
| Cascading impact | 52% | 31% | 86% |
Why this matters
Autonomous infrastructure design is one of the largest under-automated domains in the global economy. Power grids, telecom networks, water systems, logistics, and urban planning all depend on engineers performing repetitive spatial reasoning tasks. Closing the gap between language-grounded LLMs and the physical world unlocks tools that can plan, validate, and operate this infrastructure with the precision the work demands.
What we are releasing
- The full benchmark dataset across the four task families, with synthetic and real-world splits.
- An evaluation harness compatible with the OpenAI, Anthropic, and Google APIs.
- A scoring rubric that distinguishes constraint satisfaction from optimality, so a model that produces a feasible plan is rewarded even if it is not optimal.
Spatial reasoning is the bottleneck between language models and the physical world. The faster the field measures it, the faster the gap closes.
Open collaboration
The benchmark and harness are open. We are particularly interested in contributions that expand the real-world splits across new domains — water, electric, drone corridors, forestry, and logistics. If you work on spatial AI, autonomous design, or geospatial optimization, we would like to hear from you.