Files
interkollektives-micro-website/docs/plans/2026-03-04-feat-dom-to-json-content-extraction-plan.md

89 lines
3.9 KiB
Markdown

---
title: "feat: DOM-to-JSON content extraction for static snapshot"
type: feat
status: completed
date: 2026-03-04
---
# feat: DOM-to-JSON content extraction for static snapshot
## Overview
Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet.
## Problem Statement / Motivation
Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface.
## Proposed Solution
Build a deterministic DOM-to-JSON extraction flow for `web4beginners.com.html` that captures visible text, selected metadata, and image fields (`src`, `alt`).
The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items.
## Scope
In scope:
- Extract visible text content from page sections
- Extract metadata: `title`, `description`, Open Graph, Twitter
- Extract image fields: `img src`, `img alt`
- Produce nested JSON output aligned with DOM sections
- Define stable content identity strategy (reuse existing selectors/IDs; add `data-*` only when needed)
Out of scope:
- WYSIWYG editing UI
- Styling/layout editing
- Full responsive image source editing (`srcset`, `picture`)
- Full bidirectional sync mechanics
## Technical Considerations
- The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework.
- Extraction rules must avoid pulling non-content technical strings from scripts/styles.
- Section mapping should remain stable even if content text changes.
- Output should be deterministic so repeated runs produce predictable key ordering/paths.
## SpecFlow Analysis
Primary flow:
1. Input snapshot HTML is parsed.
2. Eligible text nodes and target attributes are identified.
3. Content is grouped by top-level page sections.
4. Metadata and image fields are merged into the same JSON tree.
5. Output JSON is written.
Edge cases to cover:
- Empty or whitespace-only nodes
- Repeated text across sections
- Links/buttons with nested elements
- Missing `alt` attributes
- Cookie/modal/footer content that may be conditionally visible
## Acceptance Criteria
- [x] A single extraction run generates one nested JSON file from `web4beginners.com.html`.
- [x] JSON includes visible page text grouped by section subtopics.
- [x] JSON includes `title`, `description`, Open Graph, and Twitter metadata values.
- [x] JSON includes `img src` and `img alt` values where present.
- [x] Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped.
- [x] Extraction excludes JS/CSS artifacts and non-content noise.
- [x] Re-running extraction on unchanged input produces stable output structure.
## Success Metrics
- Editors can locate and update target strings in JSON without editing HTML directly.
- JSON organization is understandable by section/context without reverse-engineering selectors.
- No unintended layout/content regressions in source HTML (read-only extraction phase).
## Dependencies & Risks
Dependencies:
- Final agreement on section boundaries for grouping
- Final output file location/name convention
Risks:
- Over-extraction of non-user-facing strings
- Unstable keys if selector strategy is inconsistent
- Ambiguity around “global/common” duplicate classification
Mitigations:
- Explicit extraction allowlist for elements/attributes
- Deterministic key-generation policy
- Documented duplicate decision rules with examples
## References & Research
- Brainstorm: `docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md`
- Source snapshot: `web4beginners.com.html`
- Existing site bundle references: `web4beginners.com_files/*`