3.9 KiB
title, type, status, date
| title | type | status | date |
|---|---|---|---|
| feat: DOM-to-JSON content extraction for static snapshot | feat | completed | 2026-03-04 |
feat: DOM-to-JSON content extraction for static snapshot
Overview
Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet.
Problem Statement / Motivation
Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface.
Proposed Solution
Build a deterministic DOM-to-JSON extraction flow for ikfreunde.com.html that captures visible text, selected metadata, and image fields (src, alt).
The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items.
Scope
In scope:
- Extract visible text content from page sections
- Extract metadata:
title,description, Open Graph, Twitter - Extract image fields:
img src,img alt - Produce nested JSON output aligned with DOM sections
- Define stable content identity strategy (reuse existing selectors/IDs; add
data-*only when needed)
Out of scope:
- WYSIWYG editing UI
- Styling/layout editing
- Full responsive image source editing (
srcset,picture) - Full bidirectional sync mechanics
Technical Considerations
- The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework.
- Extraction rules must avoid pulling non-content technical strings from scripts/styles.
- Section mapping should remain stable even if content text changes.
- Output should be deterministic so repeated runs produce predictable key ordering/paths.
SpecFlow Analysis
Primary flow:
- Input snapshot HTML is parsed.
- Eligible text nodes and target attributes are identified.
- Content is grouped by top-level page sections.
- Metadata and image fields are merged into the same JSON tree.
- Output JSON is written.
Edge cases to cover:
- Empty or whitespace-only nodes
- Repeated text across sections
- Links/buttons with nested elements
- Missing
altattributes - Cookie/modal/footer content that may be conditionally visible
Acceptance Criteria
- A single extraction run generates one nested JSON file from
ikfreunde.com.html. - JSON includes visible page text grouped by section subtopics.
- JSON includes
title,description, Open Graph, and Twitter metadata values. - JSON includes
img srcandimg altvalues where present. - Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped.
- Extraction excludes JS/CSS artifacts and non-content noise.
- Re-running extraction on unchanged input produces stable output structure.
Success Metrics
- Editors can locate and update target strings in JSON without editing HTML directly.
- JSON organization is understandable by section/context without reverse-engineering selectors.
- No unintended layout/content regressions in source HTML (read-only extraction phase).
Dependencies & Risks
Dependencies:
- Final agreement on section boundaries for grouping
- Final output file location/name convention
Risks:
- Over-extraction of non-user-facing strings
- Unstable keys if selector strategy is inconsistent
- Ambiguity around “global/common” duplicate classification
Mitigations:
- Explicit extraction allowlist for elements/attributes
- Deterministic key-generation policy
- Documented duplicate decision rules with examples
References & Research
- Brainstorm:
docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md - Source snapshot:
ikfreunde.com.html - Existing site bundle references:
ikfreunde.com_files/*