Treating Excel as the Spec

I wrote an Excel-compatible calculation engine in two weeks. =VLOOKUP, =SUMPRODUCT, the rest — they had to return the same results as Microsoft Excel. Build an AST, implement the functions, write the tests. A small project called formulon.

What is an AST?

Abstract Syntax Tree. A tree-structured representation of a program or expression. =A1+B1*2 expands into something like "+ with A1 on the left and a * node on the right (the * having B1 on the left and 2 on the right)." The calculation engine evaluates this tree, not the original string.

The thing I noticed first: there is no spec.

To be precise, the file format spec exists (ECMA-376 / OOXML). XML schemas, cell coordinate notation, style storage — all written down. But "what does =SUMPRODUCT(A1:A10, B1:B10) return when one argument contains an error value?" Or "how does =PERCENTILE.EXC(data, 0.95) round at the boundary?" Or "does =VLOOKUP match the dakuten on half-width katakana?" — these calculation semantics are documented nowhere.

What is ECMA-376 / OOXML?

The Office Open XML specification. It defines how Microsoft Office file formats (.xlsx, .docx, .pptx) are represented as a ZIP archive of XML files. Standardized by ECMA International, later adopted as ISO/IEC 29500. It defines file structure but not Excel's calculation behavior. It can say "this cell contains this formula" but never "evaluating this formula yields this value."

How do you implement what isn't written down? I started by reading MSDN. The MSDN pages and Excel's actual behavior diverged almost immediately. Worse, Excel 365 and Excel 2019 return different results for the same function. Negative dimensions in OFFSET, the argument-type error in SHEETS, dakuten composition for half-width katakana — all version-dependent.

Two weeks doesn't leave time to read the spec and clean-room the engine. Forget trusting the spec. Treat the artifact as the answer. Cheaper, by orders of magnitude.

This is a note on the methodology.

The limits of golden files

I started with goldens.

What are golden files / snapshot tests?

You save the "expected output" to a file (the golden file), then compare it against the actual output during test runs. React component testing with jest, CLI output comparison, and so on. The premise: same input always produces same output.

Feed Excel a few hundred formulas, dump the results as JSON, compare against that during tests. Snapshot testing — familiar to any web engineer. The first few hundred cases went smoothly.

Then Excel pushed an update and the goldens rotted.

One day, the output of =ARRAYTOTEXT("") changed. Was it from "" to []? Back to empty string? I couldn't tell. When a golden mismatch came in, I had no way to separate my regression from Excel's behavior change.

What is ARRAYTOTEXT?

A function added in Excel 365. Converts an array into a string. =ARRAYTOTEXT({1,2,3}) returns "1, 2, 3" or similar. Excel 2019 and earlier don't even recognize the name and return #NAME?. Later in this article, it serves as the sentinel formula for detecting whether we're talking to Excel 365 or 2019.

What is a regression?

When code that used to work stops working, usually as a side effect of a new change. Tests that were passing start failing. Also called a "regression bug."

Snapshot testing has a fundamental assumption — "the target stays still" — that doesn't hold for Excel. Excel is a live product. New builds ship monthly. 365 differs from 2019 differs from 2016. ja-JP differs from en-US differs from de-DE. macOS Excel differs from Windows Excel.

ja-JP / en-US / de-DE

Locale identifiers following BCP 47. ja-JP means Japanese as spoken in Japan, en-US is American English, de-DE is German as spoken in Germany. Excel changes its error messages (#VALUE! becomes #WERT! in German, #VALEUR! in French), number formats (. versus , for the decimal point), and argument separators (, versus ;) based on this locale.

The cost of maintaining static goldens started to exceed the value the tests were protecting.

The live oracle

So I switched approach. Run Excel itself during the test run, and pull the results out.

The test harness runs on Linux (specifically WSL2, both in CI and on my dev machine). Excel only runs on Windows. COM automation (via xlwings) lets Windows-side Python drive Excel; the results come back to the Linux-side harness. Each test case compares its own result against Excel's, on the spot, every time.

What is WSL2?

Windows Subsystem for Linux 2. A way to run a real Linux kernel inside a lightweight VM on Windows. You can reach Windows files at /mnt/c/... and invoke Windows .exe binaries from the WSL2 shell. The use case: "I want my dev environment to be Linux, but I also need to integrate with Windows-native apps like Excel."

What is COM automation / xlwings?

COM (Component Object Model) is Windows' inter-process communication mechanism. Office apps like Excel and Word expose APIs through COM so external programs can drive them. xlwings is a Python library that wraps this COM layer; you write code like app.books.add() or sheet.range('A1').value = 1. Anything you can do in VBA, you can usually do from xlwings.

Now I can tell whether "Excel changed" or "I broke it." If Excel changed, Excel's output changes too. If both move the same amount, no divergence. If only one moves, you know which side caused it.

What is divergence?

In formulon, "divergence" means a case where my implementation and Excel's behavior disagree. A ledger file at tests/divergence.yaml records which cases diverge, and why. The ledger's use is covered later in this article.

This approach has two big pitfalls.

Don't cold-start 92 suites

My first naive implementation called xlwings.App(visible=False) at the start of every run_suite() invocation. Excel boots, a temporary workbook gets created, then we tear it all down. Cold start: 5–15 seconds.

There were 92 suites.

92 × 10 seconds ≈ 15 minutes. Added to every test run. While Excel boots, the Linux-side CPU sits idle. The Windows-side CPU paints Excel's splash screen. Nobody benefits.

The fix is a persistent stdio server. Launch the Windows-side Python in --serve mode; it reads JSON requests from stdin and writes results to stdout. Excel boots once and is shared across every suite.

One cold start covers 92 suites. Wall-clock: 15 minutes down to 30 seconds.

I hit a trap mid-implementation. Windows Python uses locale-dependent encoding for stdin/stdout by default. A ja-JP Windows means CP932. When Excel's COM raises an error, you get back a Japanese message like "例外が発生しました。", which mojibakes if the WSL2 side reads it as UTF-8.

CP932 / WSLENV / PYTHONUTF8

CP932: The character encoding for the Japanese Windows locale (ja-JP). Microsoft's extension of Shift_JIS.

WSLENV: An environment variable that lists which env vars WSL2 should forward to Windows-side processes. The format looks like WSLENV=PATH/l:FOO/u. By default, Linux-side env vars do not cross over.

PYTHONUTF8: An environment variable that enables Python's "UTF-8 mode." Available since Python 3.7. With PYTHONUTF8=1, Python ignores locale and uses UTF-8 for stdio and file I/O.

Setting PYTHONUTF8=1 via env vars should work — but it doesn't. WSL2 only forwards env vars listed in WSLENV to Windows processes. PYTHONUTF8 isn't on the default list.

What works instead is python.exe -X utf8=1. A command-line argument bypasses the WSLENV gate entirely and reliably forces Python into UTF-8 mode. formulon's wsl_bridge.py carries this discovery and the verification trail in comments.

I hit another trap. To read a cell's displayed text via COM, I call cell.api.Text. Error cells sometimes have .value as None, so the displayed text is what tells you about #DIV/0! and friends.

I wrote a defensive fallback: "if Text doesn't work, try text (lowercase)." All mine.

At the IDispatch level, Excel doesn't distinguish case — Text and text resolve to the same DISPID. The problem sits one layer up, in win32com. COMRetryObjectWrapper has a code path that retries forever on unknown lowercase attributes. For cases where the result is legitimately empty (=ARRAYTOTEXT("") returning ""), Text returned "", then text got tried, then the retry loop spun. The test run hung for 30+ minutes; Excel's CPU sat at 60%.

IDispatch / DISPID / COMRetryObjectWrapper

IDispatch: The COM late-binding mechanism for resolving function names at runtime. The attribute name goes through GetIDsOfNames to become an integer DISPID, then through Invoke. Excel's automation API is exposed through IDispatch.

COMRetryObjectWrapper: A wrapper in Python's win32com library that retries COM calls automatically. Useful when Excel returns a transient "call was rejected by callee." Convenient — except at certain edge cases with unknown attribute names, it can enter an infinite loop with no exit condition.

Took a full day to find the cause. The fix was small: try Text only, no fallback. The lesson: you can't read what counts as undefined behavior through COM. Code that looks defensive can be a landmine.

Application state leakage

The second pitfall is nastier.

A COM object carries process-local global state. The Application instance exposed via xlwings.App holds the window zoom, the view mode, the calculation mode, the print preview state — all internally. Closing the workbook doesn't clear them. The next workbook inherits them.

The specific snag was PageBreakPreview.

What is PageBreakPreview?

One of Excel's view modes — the "Page Break Preview" you find in the View menu. The normal editing view shows a rough page-break line based on the physical screen size; it doesn't match the actual print pagination. PageBreakPreview makes Excel run the real pagination pass and draws actual page boundaries as solid blue lines. To read print layout from code, you switch into this mode first, then read HPageBreaks / VPageBreaks / Pages.Count.

To read page breaks, you have to switch Excel into PageBreakPreview mode (PBP for short). In the normal editing view, Pages.Count is influenced by the display zoom and returns inaccurate values. Switching to PBP triggers real pagination, and HPageBreaks (horizontal) and VPageBreaks (vertical) match the physical print output.

I wrote the obvious thing: "for each case, switch to PBP, read, switch back."

It ran stably for a while. Then one day, the same case started returning different results depending on where it was in the suite. Running zoom_50_dense_fit_off alone gave pages: 2; running it after another case gave pages: 3. 100% reproducible.

Turned out the PBP window zoom is global state on the Application. If the previous case had left PBP at a different zoom, the next case's read drifted. I thought I was "resetting state per case," but the reset was too shallow.

The fix: pin the PBP zoom explicitly to 60 (the default value) every time. Not in the spec. Implicit assumption of Excel COM.

To put it in web-server terms: this is like creating a fresh request object for every request, but somehow the previous request's env partially survives. You're tripping over process-local globals without realizing it. Not a class of bug you see in the multi-tenant web. Unavoidable in desktop automation.

Where	Web intuition	Excel COM reality
Process	Discarded per request	Persistent from launch to exit
Global state	Effectively zero across requests	Application object holds plenty
Reset boundary	Explicit (new request)	Implicit (survives unless cleared)

Confirm who you're talking to, every time

A hidden virtue of the live oracle approach: you can confirm who's on the other side at startup.

Microsoft 365 vs Excel 2019

Microsoft 365 (M365): The subscription edition of Excel. Monthly updates bring new functions (ARRAYTOTEXT, FILTER, LET, LAMBDA, etc.) and features (dynamic arrays, SPILL, expanded 3D references) on a rolling basis.

Excel 2019 / 2016: The perpetual-license edition. The function set is frozen at release. Many of the 365 functions don't exist and return #NAME? when typed.

formulon assumes M365. Running oracle generation against Excel 2019 produces piles of meaningless errors because the new functions aren't there.

Excel 365 and Excel 2019 differ in what functions exist and how the shared functions behave. ARRAYTOTEXT is 365-only. SHEETS(text-arg) returns #VALUE! on 2019 but behaves differently on 365. Negative dimensions in OFFSET, dakuten matching in VLOOKUP, 3D-reference support — all differ between 365 and 2019.

If you lump them together as "Excel," the cause of a divergence — "my implementation is wrong" versus "the target is a different version" — gets entangled.

formulon throws a sentinel formula at the entry point of oracle generation.

python

sht.range("A1").formula2 = "=ARRAYTOTEXT(1)"
app.calculate()
v = sht.range("A1").value
if isinstance(v, str) and v == "#NAME?":
    raise RuntimeError(
        "Excel does not recognise ARRAYTOTEXT — this Excel install is "
        "pre-M365 (Office 2019 or earlier). Formulon's oracle requires "
        "Microsoft 365."
    )

One formula identifies whether the target is M365 or Office 2019 or earlier. #NAME? returned: abort. Anything else: proceed.

The result is reflected in the golden directory layout. tests/oracle/variants/win-365-ja_JP/ and tests/oracle/variants/win-2019-ja_JP/ — goldens are split on two axes: version × locale. The same formula carries a different expected value if the target is different.

Without this, the divergence ledger collapses. Reports of "differs from Excel" mix behavior differences from different versions, and you can't prioritize.

Empirical rule discovery — vary one axis at a time

Here's the heart of it.

For the regions of Excel's behavior that the spec and MSDN both leave undocumented — print pagination is the classic example — how do you find the rules?

The naive approach is to prepare 100 different setting combinations, capture goldens, and diff against your own implementation. It doesn't work. Each of the 100 cases differs in 100 ways. You can't tell cause from effect.

formulon's print_matrix suite took a different approach. Vary one axis per case.

Axis	Variable	Pinned	What it isolates
Axis A	Zoom ∈	FitToPages OFF, dense data	The effect of Zoom alone on break count
Axis B	FitToPages ∈	Zoom 100, dense data	Precedence between Fit and Zoom
Axis C	Data density ∈	Fixed geometry	Whether empty columns suppress page breaks
Axis D	PrintTitleRows depth ∈	40-row dense grid	How many points header rows take from body capacity
Axis E	Scale × vertical extent combinations	Everything else fixed	How Scale affects vertical breaks

Within each axis, diff the results as the variable changes. Four cases for Axis A, three for Axis D. The effect of that one axis becomes visible, without the noise of the others.

This is the design-of-experiments idea. To isolate each variable's effect in a multi-variable system, you change one variable at a time. Textbook, but rarely done deliberately in software testing.

What is design of experiments?

A methodology for efficiently uncovering how multi-variable systems behave. Techniques include "vary one factor at a time (OFAT)" and "use orthogonal arrays to prune the combinations." It originated in manufacturing quality control and agricultural yield research. Pairwise testing and combinatorial testing are software-testing applications.

What did Axis D actually reveal? I ran PrintTitleRows = 1, 3, 5 against an A1:D40 grid. "More header rows means less body capacity" is intuitive. The question is how much less.

The naive hypothesis: subtract the physical height of the title rows (in points) from the body. 15pt per row, so 5 rows means 75pt subtracted.

What I measured: all three cases produced h_breaks=[39]. Same break position for 1, 3, and 5 title rows. The "subtract physical height" model can't explain that.

What's happening: Excel reserves a minimum of 5 default rows' worth of height for title rows. Even when you specify just 1, it subtracts 5 rows' worth internally. Same for 3. So all three cases produce identical results.

The code that captures this is a constant: kMinTitleReserveRows = 5.0. The title height subtracted from body height is max(actual_title_height, 5 × default_row_height). A formula derived from measurement, dropped directly into the implementation.

This rule appears in neither MSDN nor ECMA-376. Reverse-engineered from Excel COM behavior. Only possible because each axis was varied alone.

Rule vs Quirk

Empirical discovery produces two kinds of results.

The first kind: rules you can put in code. Like Axis D's min-title-reserve, a deterministic formula ("N title rows → subtract M points from body") that goes into src/print/pagination.cpp.

The second kind: quirks that resist rules. In the low-scale regions of Axis A (Zoom = 50, 25), Excel's VPageBreaks places column auto-breaks at positions that no geometric model explains. Same print area, same scale, same data — and the break count varies depending on the PageBreakPreview state. formulon's geometric model can't reproduce it, no matter what we do.

Print/pagination is an area long discussed in the compatibility debates of Excel-compatible tooling (LibreOffice, OnlyOffice, Aspose, and others). No one has reached full parity. The values you can read through COM themselves shift depending on Excel's internal cache and window state. This isn't an "implementation problem" — it's closer to "the target isn't deterministic."

What to do, then. formulon declares it, with a reason, in tests/divergence.yaml.

yaml

- id: zoom_50_dense_fit_off
  mode: skip-oracle
  applies_to: [win-365-ja_JP]
  reason: "Excel PBP emits scale<=50 column auto-breaks that don't follow geometric pagination (community-documented COM quirk)."
  prefer: formulon
  first_noted: 2026-05-17
  last_verified_excel_version: "16.0"

Three things to notice.

mode: skip-oracle: This case is excluded from both oracle-gen and oracle-verify. We're saying "the target isn't deterministic, so we won't compare."
reason: Natural-language explanation of why we skip. Cite a commit, issue, or document.
prefer: formulon: A declaration that "my implementation is the one that's right." prefer: win-excel-365 means the target is.

This is not a "TODO comment." It is a machine-readable compatibility ledger. Each entry is keyed to the oracle harness by id and scoped to a version × locale by applies_to. oracle_gen.py reads the ledger and changes its behavior.

When you're writing under time pressure, this distinction matters. Being able to immediately classify "is my implementation wrong" versus "is the target non-deterministic" changes debugging efficiency by an order of magnitude. With TODO comments, the person who wrote them can't remember what they deferred by the next day. With the ledger, you read reason and prefer and you know.

The hidden virtue: the boundary is always current, machine-readable, not natural-language. Compatibility knowledge becomes formalized in parallel with the implementation.

Draw the line and stop

To be honest, formulon hasn't fully solved print/pagination. tests/divergence.yaml still holds entries like zoom_50_dense_fit_off, zoom_25_dense_fit_off, scale_50_short_row. Low-scale regions of Axis A. Some scale × vertical-extent combinations from Axis E.

Two weeks wasn't enough to push these into "solved." Reproducing Excel's behavior on the implementation side requires the existence of a reproducible rule. For these areas, the values coming through COM aren't reliably reproducible to begin with.

But for the regions where rules emerged — Axis D's min-title-reserve, Axis B's FitToPages precedence, Axis C's sparse-data break suppression — I dropped them into code.

The output of two weeks isn't "solved everything." It's "drew a machine-readable boundary between what can be ruled and what can't." Codify what can be codified. Declare what can't, in the ledger. Being able to declare what we can't do was the condition for focusing on what we can.

Print isn't the only hard place

Print/pagination was the easy-to-explain example. But the live oracle approach helped elsewhere too.

Recalc and dependencies

When a cell's value changes, Excel recalculates every cell that depends on it. Change A1 in B1 = A1 * 2, and B1 updates automatically. Add C1 = B1 + 1, and C1 cascades.

This "walk the dependency graph and recalc only what's needed" is recalc. formulon had to implement it.

What is a dependency graph?

A directed graph representing references between cells. B1 = A1 * 2 creates an edge "A1 → B1." When a cell changes, you walk downstream from it to find the affected range. Same idea as build systems (make, bazel) or React's reactive system.

The hard part was races between parallel recalc and mutations (writes). During tests, a formula evaluation in flight could read the dependency graph while another path mutated the workbook. In web terms: like a database getting updated by another request mid-request.

The fix is a facade pattern called LockedMutator. All write APIs on the Workbook go through this facade; writes are serialized while a parallel recalc is in progress. The spec says nothing about "how Excel locks internally." This too had to be invented on the implementation side.

The other thorny one is iterative calculation. Excel has a setting that allows circular references. A1 = B1 + 1, B1 = A1 * 0.5 — by default rejected with #REF!-class errors, but with "enable iterative calculation" turned on, the engine switches to fixed-point iteration.

What is iterative calculation?

Excel's circular-reference feature. You specify max_iterations (the maximum number of passes) and max_change (the convergence threshold), and the engine re-evaluates every cell until the change drops below the threshold. Used in financial modeling for goal-seek scenarios (e.g., backing out a loan payment from principal and interest), engineering simulations, and probability models.

formulon's implementation drives a fixed-point loop in evaluate(). The previous pass's value seeds the next; iteration continues until |delta| < max_change or max_iterations is hit. The hard question was what counts as "converged": per cell, per workbook? Excel's docs don't say. Reverse-engineered through the live oracle.

SPILL and dynamic arrays

Dynamic arrays, introduced in Excel 365, changed the calculation engine's design fundamentally. A formula like =FILTER(A:A, A:A>10) returns a result that spans multiple cells from a single source cell. The result "spills" downward and to the right.

What is SPILL?

A feature introduced in Excel 365. When the cell you enter the formula into (the anchor cell) evaluates to an array, the array's elements automatically populate the surrounding cells. Enter =SEQUENCE(5) in A1 and 1–5 spill across A1:A5. Only the anchor holds the formula; the spilled cells are "ghost cells." If something else already occupies a destination cell, you get a #SPILL! error.

What's hard about SPILL? The timing and scope of the #SPILL! check.

After evaluating the anchor to determine the result's shape (N rows × M columns), you scan that range virtually to check for existing values. But the meaning of this check depends on evaluation mode.

Mode	Collision check
Normal mode (writable)	Verify you can actually overwrite the surrounding cells
Read-only mode (oracle evaluation)	Detect collisions virtually, given that you can't write

In oracle evaluation, I implemented a function called spill_would_collide(). It scans the anchor's "prospective spill region" without writing. Without this, =FILTER(...) is always judged "no collision," and the bare anchor value comes back instead of #SPILL!.

The SPILL implementation rippled through formulon's cell evaluation model. The simple assumption that "one cell returns one value" breaks. The distinction between anchor evaluation and ghost-cell reference had to be built into the heart of the evaluation engine.

Pivot tables

Pivot tables are one of the most spec-thin areas of Excel.

formulon implemented the v1.0 evaluator feature set: StdDev/Var aggregation, date grouping (year/quarter/month/day), axis filters, show-values-as (percent of total, running total, percent of parent row, and 10+ other post-aggregation transforms).

What is show-values-as (showDataAs)?

A transform applied to pivot-table aggregated values, after aggregation. % of grand total, running total, % of parent row, difference from, and 10+ more. The goal is to present the numbers in ways the user can read intuitively. ECMA-376 defines the attribute names (e.g., showDataAs="percentOfTotal"), but not the actual calculation rules.

ECMA-376 says almost nothing about how these behave. showDataAs="percentOfTotal" is defined as an attribute name and that's it; how to compute it is implementation-defined. What does "percent of parent row" mean? How does it behave with nested groups? All reverse-engineered.

Worse: OOXML extension elements. The pivot table XML carries Microsoft's proprietary extensions inside an <extLst> bag. Even unknown elements inside <extLst> must be preserved during read/write, or you break the file. formulon's policy is "preserve unknown extensions as opaque XML and emit them back unchanged on write." Without that pass-through, Excel reopens the file and says "the file is corrupt."

Pivots have locale dependencies too. Strings like "総計" (Grand Total) and "(空白)" ("(blank)") in the Japanese version come from a locale layer that emits the right strings per locale. Without this, ja-JP goldens don't match an en-US formulon implementation.

recalc, SPILL, pivots — they share three things: ECMA-376 coverage is thin, single-cell evaluation isn't enough, and the state isn't static. What the live oracle does is answer the cross-cutting "right answer" for situations that don't reduce to "verify one function." In thin-spec territory, you decide "the truth lives over there." You save the time you'd otherwise burn arguing with yourself about "how it should be."

Notes on short-cycle development

In a world where the spec and the implementation diverge, treating the artifact as the spec is sometimes cheaper. Not just Excel — old browser CSS quirks, OS syscalls, implicit behavior of legacy APIs. "The docs are stale, the code is the truth" appears everywhere.

The shorter the timeline, the more this matters. Trust the spec and clean-room your way in, and your first =SUMPRODUCT eats days. Hit the artifact and compare against goldens, and the upfront cost is heavy but the per-case work is fast.

The live oracle approach is one answer to that class of problem. It only runs if you kill cold start (persistent stdio), kill global-state inheritance (state leakage), and verify the target's version every time (sentinel). Skip any of these three and the slowness or flakiness of the test run becomes your schedule's bottleneck.

For empirical rule discovery: vary one axis at a time. Vary several at once and you'll never know what caused what. Short-cycle development can't afford the time spent confused.

Quirks that resist rules go in the divergence ledger. Being able to declare scope makes the actual work tractable. Saying "we don't do this" is the prerequisite for focusing on "we do this."

Perfectly reproducing Excel's behavior is probably impossible. Two weeks can only cover so much, and more time wouldn't move the ceiling far. But you can write down, explicitly, how much you reproduced and where you gave up. In a world where the artifact is the spec, that ends up being the fastest path.

Treating Excel as the Spec ​

The limits of golden files ​

The live oracle ​

Don't cold-start 92 suites ​

Application state leakage ​

Confirm who you're talking to, every time ​

Empirical rule discovery — vary one axis at a time ​

Rule vs Quirk ​

Draw the line and stop ​

Print isn't the only hard place ​

Recalc and dependencies ​

SPILL and dynamic arrays ​

Pivot tables ​

Notes on short-cycle development ​