--- title: "feat: DOM-to-JSON content extraction for static snapshot" type: feat status: completed date: 2026-03-04 --- # feat: DOM-to-JSON content extraction for static snapshot ## Overview Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet. ## Problem Statement / Motivation Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface. ## Proposed Solution Build a deterministic DOM-to-JSON extraction flow for `ikfreunde.com.html` that captures visible text, selected metadata, and image fields (`src`, `alt`). The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items. ## Scope In scope: - Extract visible text content from page sections - Extract metadata: `title`, `description`, Open Graph, Twitter - Extract image fields: `img src`, `img alt` - Produce nested JSON output aligned with DOM sections - Define stable content identity strategy (reuse existing selectors/IDs; add `data-*` only when needed) Out of scope: - WYSIWYG editing UI - Styling/layout editing - Full responsive image source editing (`srcset`, `picture`) - Full bidirectional sync mechanics ## Technical Considerations - The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework. - Extraction rules must avoid pulling non-content technical strings from scripts/styles. - Section mapping should remain stable even if content text changes. - Output should be deterministic so repeated runs produce predictable key ordering/paths. ## SpecFlow Analysis Primary flow: 1. Input snapshot HTML is parsed. 2. Eligible text nodes and target attributes are identified. 3. Content is grouped by top-level page sections. 4. Metadata and image fields are merged into the same JSON tree. 5. Output JSON is written. Edge cases to cover: - Empty or whitespace-only nodes - Repeated text across sections - Links/buttons with nested elements - Missing `alt` attributes - Cookie/modal/footer content that may be conditionally visible ## Acceptance Criteria - [x] A single extraction run generates one nested JSON file from `ikfreunde.com.html`. - [x] JSON includes visible page text grouped by section subtopics. - [x] JSON includes `title`, `description`, Open Graph, and Twitter metadata values. - [x] JSON includes `img src` and `img alt` values where present. - [x] Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped. - [x] Extraction excludes JS/CSS artifacts and non-content noise. - [x] Re-running extraction on unchanged input produces stable output structure. ## Success Metrics - Editors can locate and update target strings in JSON without editing HTML directly. - JSON organization is understandable by section/context without reverse-engineering selectors. - No unintended layout/content regressions in source HTML (read-only extraction phase). ## Dependencies & Risks Dependencies: - Final agreement on section boundaries for grouping - Final output file location/name convention Risks: - Over-extraction of non-user-facing strings - Unstable keys if selector strategy is inconsistent - Ambiguity around “global/common” duplicate classification Mitigations: - Explicit extraction allowlist for elements/attributes - Deterministic key-generation policy - Documented duplicate decision rules with examples ## References & Research - Brainstorm: `docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md` - Source snapshot: `ikfreunde.com.html` - Existing site bundle references: `ikfreunde.com_files/*`