Files
interkollektives-micro-website/docs/plans/2026-03-04-feat-dom-to-json-content-extraction-plan.md

3.9 KiB

title, type, status, date
title type status date
feat: DOM-to-JSON content extraction for static snapshot feat completed 2026-03-04

feat: DOM-to-JSON content extraction for static snapshot

Overview

Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet.

Problem Statement / Motivation

Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface.

Proposed Solution

Build a deterministic DOM-to-JSON extraction flow for web4beginners.com.html that captures visible text, selected metadata, and image fields (src, alt).

The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items.

Scope

In scope:

  • Extract visible text content from page sections
  • Extract metadata: title, description, Open Graph, Twitter
  • Extract image fields: img src, img alt
  • Produce nested JSON output aligned with DOM sections
  • Define stable content identity strategy (reuse existing selectors/IDs; add data-* only when needed)

Out of scope:

  • WYSIWYG editing UI
  • Styling/layout editing
  • Full responsive image source editing (srcset, picture)
  • Full bidirectional sync mechanics

Technical Considerations

  • The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework.
  • Extraction rules must avoid pulling non-content technical strings from scripts/styles.
  • Section mapping should remain stable even if content text changes.
  • Output should be deterministic so repeated runs produce predictable key ordering/paths.

SpecFlow Analysis

Primary flow:

  1. Input snapshot HTML is parsed.
  2. Eligible text nodes and target attributes are identified.
  3. Content is grouped by top-level page sections.
  4. Metadata and image fields are merged into the same JSON tree.
  5. Output JSON is written.

Edge cases to cover:

  • Empty or whitespace-only nodes
  • Repeated text across sections
  • Links/buttons with nested elements
  • Missing alt attributes
  • Cookie/modal/footer content that may be conditionally visible

Acceptance Criteria

  • A single extraction run generates one nested JSON file from web4beginners.com.html.
  • JSON includes visible page text grouped by section subtopics.
  • JSON includes title, description, Open Graph, and Twitter metadata values.
  • JSON includes img src and img alt values where present.
  • Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped.
  • Extraction excludes JS/CSS artifacts and non-content noise.
  • Re-running extraction on unchanged input produces stable output structure.

Success Metrics

  • Editors can locate and update target strings in JSON without editing HTML directly.
  • JSON organization is understandable by section/context without reverse-engineering selectors.
  • No unintended layout/content regressions in source HTML (read-only extraction phase).

Dependencies & Risks

Dependencies:

  • Final agreement on section boundaries for grouping
  • Final output file location/name convention

Risks:

  • Over-extraction of non-user-facing strings
  • Unstable keys if selector strategy is inconsistent
  • Ambiguity around “global/common” duplicate classification

Mitigations:

  • Explicit extraction allowlist for elements/attributes
  • Deterministic key-generation policy
  • Documented duplicate decision rules with examples

References & Research

  • Brainstorm: docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md
  • Source snapshot: web4beginners.com.html
  • Existing site bundle references: web4beginners.com_files/*