Publications Access Graphs: Difference between revisions
| Line 9: | Line 9: | ||
* accumulated human_get times series | * accumulated human_get times series | ||
* page_category scatter plot | * page_category scatter plot | ||
There are one set of invariants for title normalisation and corpus membership. | |||
==Authority and Governance== | ==Authority and Governance== | ||
* The projections are curator-governed and MUST be reproducible from declared inputs alone. | * The projections are curator-governed and MUST be reproducible from declared inputs alone. | ||
Revision as of 10:06, 30 January 2026
CM corpus access graphs
- 2026-01-30
Corpus Projection Invariants (Normative)
There are two main projections:
- accumulated human_get times series
- page_category scatter plot
There are one set of invariants for title normalisation and corpus membership.
Authority and Governance
- The projections are curator-governed and MUST be reproducible from declared inputs alone.
- The assisting system MUST NOT infer, rename, paraphrase, merge, split, or reorder titles beyond the explicit rules stated here.
- The assisting system MUST NOT optimise for visual clarity at the expense of semantic correctness.
- Any deviation from these invariants MUST be explicitly declared by the human curator with a dated update entry.
Authoritative Inputs
- Input A: Hourly rollup TSVs produced by logrollup tooling.
- Input B: Corpus bundle manifest (corpus/manifest.tsv).
- Input C: Host scope fixed to publications.arising.com.au.
- Input D: Full temporal range present in the rollup set (no truncation).
Eligible Resource Set (Corpus Titles)
- The eligible title set MUST be derived exclusively from corpus/manifest.tsv.
- Column 1 of manifest.tsv is the authoritative MediaWiki page title.
- Only titles present in the manifest (after normalisation) are eligible for projection.
- Titles present in the manifest MUST be included in the projection domain even if they receive zero hits in the period.
- Titles not present in the manifest MUST be excluded even if traffic exists.
Path → Title Extraction
- A rollup record contributes to a page only if a title can be extracted by these rules:
- If path matches /pub/<title>, then <title> is the candidate.
- If path matches /pub-dir/index.php?<query>, the title MUST be taken from title=<title>.
- If title= is absent, page=<title> MAY be used.
- Otherwise, the record MUST NOT be treated as a page hit.
- URL fragments (#…) MUST be removed prior to extraction.
Title Normalisation
- URL decoding MUST occur before all other steps.
- Underscores (_) MUST be converted to spaces.
- UTF-8 dashes (–, —) MUST be converted to ASCII hyphen (-).
- Whitespace runs MUST be collapsed to a single space and trimmed.
- After normalisation, the title MUST exactly match a manifest title to remain eligible.
- Main Page MUST be excluded from this projection.
Accumulated human_get_ok projection
Noise and Infrastructure Exclusions
- The following MUST be excluded prior to aggregation:
- Special:, Category:, Category talk:, Talk:, User:, User talk:, File:, Template:, Help:, MediaWiki:
- /resources/, /pub-dir/load.php, /pub-dir/api.php, /pub-dir/rest.php
- /robots.txt, /favicon.ico
- sitemap (any case)
- Static resources by extension (.png, .jpg, .jpeg, .gif, .svg, .ico, .webp)
Metric Definition
- The only signal used is human_get_ok.
- Redirects and non-human classifications MUST NOT be included.
- No inference from other status codes or agents is permitted.
Temporal Aggregation
- Hourly buckets MUST be aggregated into daily totals per title.
- Accumulated value per title is defined as:
- cum_hits(title, day_n) = Σ daily_hits(title, day_0 … day_n)
- Accumulation MUST be monotonic and non-decreasing.
Axis and Scale Invariants
- X axis: calendar date from earliest to latest available day.
- Major ticks every 7 days.
- Minor ticks every day.
- Date labels MUST be rotated (oblique) for readability.
- Y axis MUST be logarithmic.
- Zero or negative values MUST NOT be plotted on the log axis.
Legend Ordering
- Legend entries MUST be ordered by descending final accumulated human_get_ok.
- Ordering MUST be deterministic and reproducible.
Visual Disambiguation Invariants
- Each title MUST be visually distinguishable.
- The same colour MAY be reused.
- The same line style MAY be reused.
- The same (colour + line style) pair MUST NOT be reused.
- Markers MAY be omitted or reused but MUST NOT be relied upon as the sole distinguishing feature.
Rendering Constraints
- Legend MUST be placed outside the plot area on the right.
- Sufficient vertical and horizontal space MUST be reserved to avoid label overlap.
- Line width SHOULD be consistent across series to avoid implied importance.
Interpretive Constraint
- This projection indicates reader entry and navigation behaviour only.
- High lead-in ranking MUST NOT be interpreted as quality, authority, or endorsement.
- Ordering reflects accumulated human access, not epistemic priority.
Periodic Regeneration
- This projection is intended to be regenerated periodically.
- Cross-run comparisons MUST preserve all invariants to allow valid temporal comparison.
- Changes in lead-in dominance (e.g. Plain-Language Summary vs. CM-1 foundation paper) are observational signals only and do not alter corpus structure.
Accumulated human_get time series projection
Corpus Lead-In Projection: Deterministic Colour Map
This table provides the visual encoding for the core corpus pages. For titles not included in the colour map, use colours at your discretion until a Colour Map entry exists.
Colours are drawn from the Matplotlib tab20 palette.
Line styles are assigned to ensure that no (colour + line-style) pair is reused. Legend ordering is governed separately by accumulated human GET_ok.
Corpus Lead-In Projection: Colour-Map Hardening Invariants
This section hardens the visual determinism of the Corpus Lead-In Projection while allowing controlled corpus growth.
Authority
- This Colour Map is **authoritative** for all listed corpus pages.
- The assisting system MUST NOT invent, alter, or substitute colours or line styles for listed pages.
- Visual encoding is a governed property, not a presentation choice.
Binding Rule
- For any page listed in the Deterministic Colour Map table:
- The assigned (colour index, colour hex, line style) pair MUST be used exactly.
- Deviation constitutes a projection violation.
Legend Ordering Separation
- Colour assignment and legend ordering are orthogonal.
- Legend ordering MUST continue to follow the accumulated human GET_ok invariant.
- Colour assignment MUST NOT be influenced by hit counts, rank, or ordering.
New Page Admission Rule
- Pages not present in the current Colour Map MUST appear in a projection.
- New pages MUST be assigned styles in strict sequence order:
- Iterate line style first, then colour index, exactly as defined in the base palette.
- Previously assigned pairs MUST NOT be reused.
- The assisting system MUST NOT reshuffle existing assignments to “make space”.
Provisional Encoding Rule
- Visual assignments for newly admitted pages are **provisional** until recorded.
- A projection that introduces provisional encodings MUST:
- Emit a warning note in the run metadata, and
- Produce an updated Colour Map table for curator review.
Curator Ratification
- Only the human curator may ratify new colour assignments.
- Ratification occurs by appending new rows to the Colour Map table with a date stamp.
- Once ratified, assignments become binding for all future projections.
Backward Compatibility
- Previously generated projections remain valid historical artefacts.
- Introduction of new pages MUST NOT retroactively alter the appearance of older projections.
Failure Mode Detection
- If a projection requires more unique (colour, line-style) pairs than the declared palette provides:
- The assisting system MUST fail explicitly.
- Silent reuse, substitution, or visual approximation is prohibited.
Rationale (Non-Normative)
- This hardening ensures:
- Cross-run visual comparability
- Human recognition of lead-in stability
- Detectable drift when corpus structure changes
- Visual determinism is treated as part of epistemic governance, not aesthetics.
