89 lines
3.9 KiB
Markdown
89 lines
3.9 KiB
Markdown
---
|
|
title: "feat: DOM-to-JSON content extraction for static snapshot"
|
|
type: feat
|
|
status: completed
|
|
date: 2026-03-04
|
|
---
|
|
|
|
# feat: DOM-to-JSON content extraction for static snapshot
|
|
|
|
## Overview
|
|
Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet.
|
|
|
|
## Problem Statement / Motivation
|
|
Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface.
|
|
|
|
## Proposed Solution
|
|
Build a deterministic DOM-to-JSON extraction flow for `ikfreunde.com.html` that captures visible text, selected metadata, and image fields (`src`, `alt`).
|
|
|
|
The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items.
|
|
|
|
## Scope
|
|
In scope:
|
|
- Extract visible text content from page sections
|
|
- Extract metadata: `title`, `description`, Open Graph, Twitter
|
|
- Extract image fields: `img src`, `img alt`
|
|
- Produce nested JSON output aligned with DOM sections
|
|
- Define stable content identity strategy (reuse existing selectors/IDs; add `data-*` only when needed)
|
|
|
|
Out of scope:
|
|
- WYSIWYG editing UI
|
|
- Styling/layout editing
|
|
- Full responsive image source editing (`srcset`, `picture`)
|
|
- Full bidirectional sync mechanics
|
|
|
|
## Technical Considerations
|
|
- The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework.
|
|
- Extraction rules must avoid pulling non-content technical strings from scripts/styles.
|
|
- Section mapping should remain stable even if content text changes.
|
|
- Output should be deterministic so repeated runs produce predictable key ordering/paths.
|
|
|
|
## SpecFlow Analysis
|
|
Primary flow:
|
|
1. Input snapshot HTML is parsed.
|
|
2. Eligible text nodes and target attributes are identified.
|
|
3. Content is grouped by top-level page sections.
|
|
4. Metadata and image fields are merged into the same JSON tree.
|
|
5. Output JSON is written.
|
|
|
|
Edge cases to cover:
|
|
- Empty or whitespace-only nodes
|
|
- Repeated text across sections
|
|
- Links/buttons with nested elements
|
|
- Missing `alt` attributes
|
|
- Cookie/modal/footer content that may be conditionally visible
|
|
|
|
## Acceptance Criteria
|
|
- [x] A single extraction run generates one nested JSON file from `ikfreunde.com.html`.
|
|
- [x] JSON includes visible page text grouped by section subtopics.
|
|
- [x] JSON includes `title`, `description`, Open Graph, and Twitter metadata values.
|
|
- [x] JSON includes `img src` and `img alt` values where present.
|
|
- [x] Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped.
|
|
- [x] Extraction excludes JS/CSS artifacts and non-content noise.
|
|
- [x] Re-running extraction on unchanged input produces stable output structure.
|
|
|
|
## Success Metrics
|
|
- Editors can locate and update target strings in JSON without editing HTML directly.
|
|
- JSON organization is understandable by section/context without reverse-engineering selectors.
|
|
- No unintended layout/content regressions in source HTML (read-only extraction phase).
|
|
|
|
## Dependencies & Risks
|
|
Dependencies:
|
|
- Final agreement on section boundaries for grouping
|
|
- Final output file location/name convention
|
|
|
|
Risks:
|
|
- Over-extraction of non-user-facing strings
|
|
- Unstable keys if selector strategy is inconsistent
|
|
- Ambiguity around “global/common” duplicate classification
|
|
|
|
Mitigations:
|
|
- Explicit extraction allowlist for elements/attributes
|
|
- Deterministic key-generation policy
|
|
- Documented duplicate decision rules with examples
|
|
|
|
## References & Research
|
|
- Brainstorm: `docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md`
|
|
- Source snapshot: `ikfreunde.com.html`
|
|
- Existing site bundle references: `ikfreunde.com_files/*`
|