interkollektives/interkollektives-micro-website

Files

Robert Rapp fd9ea482bf Initial import: web4beginners editor and deployment setup

2026-03-06 13:49:43 +01:00

3.9 KiB

Raw Blame History

title, type, status, date

title	type	status	date
feat: DOM-to-JSON content extraction for static snapshot	feat	completed	2026-03-04

feat: DOM-to-JSON content extraction for static snapshot

Overview

Create a first-stage extraction workflow that converts the existing HTML snapshot into a nested JSON content file. This plan is intentionally limited to extraction and content mapping. It does not include building the WYSIWYG editor yet.

Problem Statement / Motivation

Content updates are currently tied to manual HTML edits. A JSON representation is needed so text and selected image properties can be adapted more easily and later edited through an interface.

Proposed Solution

Build a deterministic DOM-to-JSON extraction flow for web4beginners.com.html that captures visible text, selected metadata, and image fields (src, alt).

The JSON structure should be DOM-first with section-based top-level subtopics, matching the brainstorm decisions and keeping context for editors. Duplicate text handling should follow the agreed hybrid policy: keep section-local duplicates; dedupe only clearly global/common items.

Scope

In scope:

Extract visible text content from page sections
Extract metadata: title, description, Open Graph, Twitter
Extract image fields: img src, img alt
Produce nested JSON output aligned with DOM sections
Define stable content identity strategy (reuse existing selectors/IDs; add data-* only when needed)

Out of scope:

WYSIWYG editing UI
Styling/layout editing
Full responsive image source editing (srcset, picture)
Full bidirectional sync mechanics

Technical Considerations

The repository is a static snapshot with bundled/minified assets; there is no existing i18n framework.
Extraction rules must avoid pulling non-content technical strings from scripts/styles.
Section mapping should remain stable even if content text changes.
Output should be deterministic so repeated runs produce predictable key ordering/paths.

SpecFlow Analysis

Primary flow:

Input snapshot HTML is parsed.
Eligible text nodes and target attributes are identified.
Content is grouped by top-level page sections.
Metadata and image fields are merged into the same JSON tree.
Output JSON is written.

Edge cases to cover:

Empty or whitespace-only nodes
Repeated text across sections
Links/buttons with nested elements
Missing alt attributes
Cookie/modal/footer content that may be conditionally visible

Acceptance Criteria

A single extraction run generates one nested JSON file from web4beginners.com.html.
JSON includes visible page text grouped by section subtopics.
JSON includes title, description, Open Graph, and Twitter metadata values.
JSON includes img src and img alt values where present.
Duplicate policy is applied: section-local duplicates kept; global/common duplicates deduped.
Extraction excludes JS/CSS artifacts and non-content noise.
Re-running extraction on unchanged input produces stable output structure.

Success Metrics

Editors can locate and update target strings in JSON without editing HTML directly.
JSON organization is understandable by section/context without reverse-engineering selectors.
No unintended layout/content regressions in source HTML (read-only extraction phase).

Dependencies & Risks

Dependencies:

Final agreement on section boundaries for grouping
Final output file location/name convention

Risks:

Over-extraction of non-user-facing strings
Unstable keys if selector strategy is inconsistent
Ambiguity around “global/common” duplicate classification

Mitigations:

Explicit extraction allowlist for elements/attributes
Deterministic key-generation policy
Documented duplicate decision rules with examples

References & Research

Brainstorm: docs/brainstorms/2026-03-03-dom-json-wysiwyg-sync-brainstorm.md
Source snapshot: web4beginners.com.html
Existing site bundle references: web4beginners.com_files/*

3.9 KiB Raw Blame History