How we ingest a planning meeting

Every two weeks a planning department in some small town in America posts a PDF to a Document Center page. The PDF is the agenda. Buried in its body are a dozen links to other PDFs: staff reports, site plans, traffic studies, public comments. The links are the only machine-readable hint that those other documents exist. The names are inconsistent. The numbering is inconsistent. Three months from now, when a resident wants to know what was decided at the May 12th hearing, the answer is technically public but practically gone.

OpenPlanning is the index that ought to exist. This post is a tour of how we turn an agenda PDF into a structured meeting record without losing the fidelity of the source material.

The pipeline at a glance

Five stages, executed in order, each producing a typed intermediate that the next stage consumes:

  1. Parse: extract structured agenda items and document references from the agenda PDF.
  2. Annotate: deterministic NER over every document's text: addresses, people, parcels, code references.
  3. Enrich: LLM-assisted summarization of staff reports.
  4. Resolve: match extracted entities to real database rows (parcels by address, applications by file number, jurisdictions by abbreviation).
  5. Load: write the result as a Plan JSON that gets applied to prod.

Each stage is a pure function from its input to its output, which makes the whole pipeline composable, testable, and re-runnable. If the LLM gives a weird answer on Tuesday, we can re-run just the Enrich stage on Wednesday without re-parsing the PDF.

Parse: a PDF is not a document

The first thing you learn building this is that PDF is a paginated bag of text positions, not a document with semantic structure. A staff report header on page 3 is, to the PDF, indistinguishable from a footnote on page 17. The Parse stage is responsible for recovering that lost structure.

For municipality-specific layouts we ship a MunicipalityConfig dataclass that carries per-town regexes and heuristics: how this town numbers agenda items, what their staff-report cover pages look like, which page-headers to strip. Generic patterns live in defaults; town quirks override.

The output is a ParsedMeeting dataclass with agenda_items, each of which has its own documents list. URLs in the agenda PDF are followed once (we fetch the linked PDFs and run them through the parser too), but only one level deep — otherwise a single agenda can fan out into hundreds of references.

Annotate: deterministic NER before LLMs

Before we send anything to an LLM, we run a deterministic Named Entity Recognition pass: addresses via usaddress, people names via probablepeople, parcel IDs by regex, code section references against a FlashText gazetteer built from our zoning code index.

Every match becomes a span tied to a character offset in the source document. The spans are first-class data, they survive the LLM enrich step, and at load time they bind to real database entities (a parcel row, an application row). Annotation is the cheap, reliable substrate that keeps the expensive LLM step honest.

Enrich: the LLM has one job

Only after Annotate do we hand a staff report to an LLM, and we ask for exactly four things:

  • applicant_name
  • application_type
  • staff_recommendation
  • summary (2-3 sentences)

That's it. The LLM is not asked to extract entities. It is not asked to follow code references. It is not asked to opine on the merits. We've seen the failure modes that come from giving an LLM open-ended extraction prompts on legal text; they hallucinate confidently. Constraining it to a small set of high-level summary fields, with the deterministic spans as context, gets us answers we can trust.

Resolve: the database has the truth

Now we have a ParsedMeeting decorated with extracted entities. Resolve matches each entity to a row in our database:

  • Parcels by street address (not parcel ID — those change across reseeds; addresses don't).
  • Applications by file number when the source carries one, then by fuzzy title match when it doesn't.
  • Jurisdictions by abbreviation (PC, BZA, HDC) then by name.

When a match is ambiguous, we don't guess. We attach a confidence score and let the loader decide whether to merge or create-new. Most resolution runs at >0.95 confidence; the long tail of close matches gets manual review.

Load: idempotent by design

The final stage emits a Plan JSON: a fully-typed, idempotent description of every database write the meeting requires. Plans are content-addressed: running the same plan twice produces the same database state. This means we can re-run loads safely (we do, after every parser change), and we can ship plans across environments (parse locally, apply to prod).

The plan format is the thin contract between the parser pipeline and the production database. Everything upstream can change; as long as the plan stays valid, prod doesn't notice.

Why this shape

You could write a single 800-line script that does all five stages in one pass. We tried. The problem is that planning meetings are weird at every layer, and when something goes wrong you need to know which layer failed: did the PDF have a new format? Did the LLM hallucinate? Did the address match the wrong parcel? Splitting the pipeline along typed boundaries makes those questions answerable.

The second reason is that each stage produces a checkpoint we can persist. We don't re-parse a 200MB packet because someone wants to re-run the LLM with a different prompt. We don't re-LLM because we changed how addresses are normalized. Cheap stages are cheap, expensive stages run when their inputs change.

We'll talk about the load step itself, and what "applying a plan to prod" actually means, in a future post.