Publications Access Graphs: Difference between revisions
| Line 252: | Line 252: | ||
* Visual determinism is treated as part of epistemic governance, not aesthetics. | * Visual determinism is treated as part of epistemic governance, not aesthetics. | ||
title: MW-page projection invariants (normative) — daily hits scatter (multi-agent / multi-method) | |||
format: MWDUMP | |||
name: mw-page daily hits scatter (invariants) | |||
purpose: | |||
* Produce deterministic, human-meaningful MediaWiki page-level analytics from nginx bucket TSVs, | |||
* folding all URL variants to canonical resources and rendering a scatter projection across agents, HTTP methods, and outcomes. | |||
=Authority= | =Authority= | ||
Revision as of 10:52, 30 January 2026
Publications access graphs
- 2026-01-30 accumulated human get
- 2026-01-30 page access scatter plot
Corpus Projection Invariants (Normative)
There are two main projections:
- accumulated human_get times series
- page_category scatter plot
There are one set of invariants for title normalisation and corpus membership.
Authority and Governance
- The projections are curator-governed and MUST be reproducible from declared inputs alone.
- The assisting system MUST NOT infer, rename, paraphrase, merge, split, or reorder titles beyond the explicit rules stated here.
- The assisting system MUST NOT optimise for visual clarity at the expense of semantic correctness.
- Any deviation from these invariants MUST be explicitly declared by the human curator with a dated update entry.
Authoritative Inputs
- Input A: Hourly rollup TSVs produced by logrollup tooling.
- Input B: Corpus bundle manifest (corpus/manifest.tsv).
- Input C: Host scope fixed to publications.arising.com.au.
- Input D: Full temporal range present in the rollup set (no truncation).
Eligible Resource Set (Corpus Titles)
- The eligible title set MUST be derived exclusively from corpus/manifest.tsv.
- Column 1 of manifest.tsv is the authoritative MediaWiki page title.
- Only titles present in the manifest (after normalisation) are eligible for projection.
- Titles present in the manifest MUST be included in the projection domain even if they receive zero hits in the period.
- Titles not present in the manifest MUST be excluded even if traffic exists.
Path → Title Extraction
- A rollup record contributes to a page only if a title can be extracted by these rules:
- If path matches /pub/<title>, then <title> is the candidate.
- If path matches /pub-dir/index.php?<query>, the title MUST be taken from title=<title>.
- If title= is absent, page=<title> MAY be used.
- Otherwise, the record MUST NOT be treated as a page hit.
- URL fragments (#…) MUST be removed prior to extraction.
Title Normalisation
- URL decoding MUST occur before all other steps.
- Underscores (_) MUST be converted to spaces.
- UTF-8 dashes (–, —) MUST be converted to ASCII hyphen (-).
- Whitespace runs MUST be collapsed to a single space and trimmed.
- After normalisation, the title MUST exactly match a manifest title to remain eligible.
- Main Page MUST be excluded from this projection.
Accumulated human_get_ok projection
Noise and Infrastructure Exclusions
- The following MUST be excluded prior to aggregation:
- Special:, Category:, Category talk:, Talk:, User:, User talk:, File:, Template:, Help:, MediaWiki:
- /resources/, /pub-dir/load.php, /pub-dir/api.php, /pub-dir/rest.php
- /robots.txt, /favicon.ico
- sitemap (any case)
- Static resources by extension (.png, .jpg, .jpeg, .gif, .svg, .ico, .webp)
Metric Definition
- The only signal used is human_get_ok.
- Redirects and non-human classifications MUST NOT be included.
- No inference from other status codes or agents is permitted.
Temporal Aggregation
- Hourly buckets MUST be aggregated into daily totals per title.
- Accumulated value per title is defined as:
- cum_hits(title, day_n) = Σ daily_hits(title, day_0 … day_n)
- Accumulation MUST be monotonic and non-decreasing.
Axis and Scale Invariants
- X axis: calendar date from earliest to latest available day.
- Major ticks every 7 days.
- Minor ticks every day.
- Date labels MUST be rotated (oblique) for readability.
- Y axis MUST be logarithmic.
- Zero or negative values MUST NOT be plotted on the log axis.
Legend Ordering
- Legend entries MUST be ordered by descending final accumulated human_get_ok.
- Ordering MUST be deterministic and reproducible.
Visual Disambiguation Invariants
- Each title MUST be visually distinguishable.
- The same colour MAY be reused.
- The same line style MAY be reused.
- The same (colour + line style) pair MUST NOT be reused.
- Markers MAY be omitted or reused but MUST NOT be relied upon as the sole distinguishing feature.
Rendering Constraints
- Legend MUST be placed outside the plot area on the right.
- Sufficient vertical and horizontal space MUST be reserved to avoid label overlap.
- Line width SHOULD be consistent across series to avoid implied importance.
Interpretive Constraint
- This projection indicates reader entry and navigation behaviour only.
- High lead-in ranking MUST NOT be interpreted as quality, authority, or endorsement.
- Ordering reflects accumulated human access, not epistemic priority.
Periodic Regeneration
- This projection is intended to be regenerated periodically.
- Cross-run comparisons MUST preserve all invariants to allow valid temporal comparison.
- Changes in lead-in dominance (e.g. Plain-Language Summary vs. CM-1 foundation paper) are observational signals only and do not alter corpus structure.
Accumulated human_get time series projection
Corpus Lead-In Projection: Deterministic Colour Map
This table provides the visual encoding for the core corpus pages. For titles not included in the colour map, use colours at your discretion until a Colour Map entry exists.
Colours are drawn from the Matplotlib tab20 palette.
Line styles are assigned to ensure that no (colour + line-style) pair is reused. Legend ordering is governed separately by accumulated human GET_ok.
Corpus Lead-In Projection: Colour-Map Hardening Invariants
This section hardens the visual determinism of the Corpus Lead-In Projection while allowing controlled corpus growth.
Authority
- This Colour Map is **authoritative** for all listed corpus pages.
- The assisting system MUST NOT invent, alter, or substitute colours or line styles for listed pages.
- Visual encoding is a governed property, not a presentation choice.
Binding Rule
- For any page listed in the Deterministic Colour Map table:
- The assigned (colour index, colour hex, line style) pair MUST be used exactly.
- Deviation constitutes a projection violation.
Legend Ordering Separation
- Colour assignment and legend ordering are orthogonal.
- Legend ordering MUST continue to follow the accumulated human GET_ok invariant.
- Colour assignment MUST NOT be influenced by hit counts, rank, or ordering.
New Page Admission Rule
- Pages not present in the current Colour Map MUST appear in a projection.
- New pages MUST be assigned styles in strict sequence order:
- Iterate line style first, then colour index, exactly as defined in the base palette.
- Previously assigned pairs MUST NOT be reused.
- The assisting system MUST NOT reshuffle existing assignments to “make space”.
Provisional Encoding Rule
- Visual assignments for newly admitted pages are **provisional** until recorded.
- A projection that introduces provisional encodings MUST:
- Emit a warning note in the run metadata, and
- Produce an updated Colour Map table for curator review.
Curator Ratification
- Only the human curator may ratify new colour assignments.
- Ratification occurs by appending new rows to the Colour Map table with a date stamp.
- Once ratified, assignments become binding for all future projections.
Backward Compatibility
- Previously generated projections remain valid historical artefacts.
- Introduction of new pages MUST NOT retroactively alter the appearance of older projections.
Failure Mode Detection
- If a projection requires more unique (colour, line-style) pairs than the declared palette provides:
- The assisting system MUST fail explicitly.
- Silent reuse, substitution, or visual approximation is prohibited.
Rationale (Non-Normative)
- This hardening ensures:
- Cross-run visual comparability
- Human recognition of lead-in stability
- Detectable drift when corpus structure changes
- Visual determinism is treated as part of epistemic governance, not aesthetics.
title: MW-page projection invariants (normative) — daily hits scatter (multi-agent / multi-method) format: MWDUMP name: mw-page daily hits scatter (invariants) purpose:
- Produce deterministic, human-meaningful MediaWiki page-level analytics from nginx bucket TSVs,
- folding all URL variants to canonical resources and rendering a scatter projection across agents, HTTP methods, and outcomes.
Authority
- These invariants are normative.
- The assisting system MUST follow them exactly.
- Visual encoding is a governed semantic property, not a presentation choice.
Inputs
- Bucket TSVs produced by page_hits_bucketfarm_methods.pl
- Required columns:
- server_name
- path (or page_category)
- <agent>_<METHOD>_<outcome> numeric bins
- Other columns MAY exist and MUST be ignored unless explicitly referenced.
Scope
- Projection MUST bind to exactly ONE nginx virtual host at a time.
- Example: publications.arising.com.au
- Cross-vhost aggregation is prohibited.
Wiki Farm Canonicalisation (Mandatory)
- Each MediaWiki instance in a farm is identified by a (vhost, root) pair.
- Each instance exposes paired URL forms:
- /<x>/<Title>
- /<x-dir>/index.php?title=<Title>
- For the bound vhost:
- /<x>/ and /<x-dir>/index.php MUST be treated as equivalent roots.
- All page hits MUST be folded to a single canonical resource per title.
- Canonical resource key:
- (vhost, canonical_title)
Resource Extraction Order (Mandatory)
- URL-decode the request path.
- Extract title candidate:
- If path matches ^/<x>/<title>, extract <title>.
- If path matches ^/<x-dir>/index.php?<query>:
- Use title=<title> if present.
- Else MAY use page=<title> if present.
- Otherwise the record is NOT a page resource.
- Canonicalise title:
- "_" → space
- UTF-8 dashes (–, —) → "-"
- Collapse whitespace
- Trim leading/trailing space
- Apply namespace exclusions.
- Apply infrastructure exclusions.
- Apply canonical folding.
- Aggregate.
Namespace Exclusions (Mandatory)
Exclude titles with case-insensitive prefix:
- Special:
- Category:
- Category talk:
- Talk:
- User:
- User talk:
- File:
- Template:
- Help:
- MediaWiki:
- Obvious misspellings (e.g. Catgeory:) SHOULD be excluded.
Infrastructure Exclusions (Mandatory)
Exclude:
- /
- /robots.txt
- Any path containing "sitemap"
- Any path containing /resources or /resources/
- /<x-dir>/index.php
- /<x-dir>/load.php
- /<x-dir>/api.php
- /<x-dir>/rest.php/v1/search/title
Exclude static resources by extension:
- .png .jpg .jpeg .gif .svg .ico .webp
Aggregation Invariant (Mandatory)
- Aggregate across ALL rollup buckets in the selected time span.
- GROUP BY canonical resource.
- SUM all numeric <agent>_<METHOD>_<outcome> bins.
- Each canonical resource MUST appear exactly once.
Human Success Spine (Mandatory)
- Define ordering metric:
- HUMAN_200_304 := human_GET_ok + human_GET_redir
- This metric is used ONLY for vertical ordering.
Ranking and Selection
- Sort resources by HUMAN_200_304 descending.
- Select Top-N resources (default N = 50).
- Any non-default N MUST be declared in run metadata.
Rendering Invariants (Scatter Plot)
Axes
- X axis MUST be logarithmic.
- X axis MUST include log-paper verticals:
- Major: 10^k
- Minor: 2..9 × 10^k
- Y axis lists canonical resources ordered by HUMAN_200_304.
Baseline Alignment
- The resource label baseline MUST align with human GET_ok.
- human_GET_ok points MUST have vertical offset = 0.
- Draw faint horizontal baseline guides for each resource row.
Category Plotting
- ALL agent/method/outcome bins present MUST be plotted.
- No category elision, suppression, or collapsing is permitted.
Intra-Row Separation
- Apply deterministic vertical offsets per METHOD_outcome key.
- Offsets MUST be stable and deterministic.
- human_GET_ok is exempt (offset = 0).
Redirect Jitter
- human_GET_redir MUST receive a small fixed positive offset (e.g. +0.35).
- Random jitter is prohibited.
Encoding Invariants
Agent Encoding
- Agent encoded by colour.
- badbot MUST be red.
Method Encoding
- GET → o
- POST → ^
- PUT → v
- HEAD → D
- OTHER → .
Outcome Overlay
- ok → no overlay
- redir → diagonal slash (/)
- client_err → x
- server_err → x
- other/unknown → +
Legend Invariants
- Legend MUST be present.
- Legend title MUST be exactly: Legend
- Legend MUST explain:
- Agent colours
- Method shapes
- Outcome overlays
- Legend MUST NOT overlap resource labels.
Determinism
- No random jitter.
- No data-dependent styling.
- Identical inputs MUST produce identical outputs.
Validation Requirements
- No duplicate logical pages after canonical folding.
- HUMAN_200_304 ordering is monotonic.
- All plotted points trace back to bucket TSV bins.
- /<x> and /<x-dir> variants MUST fold to the same canonical resource.
- END_MWDUMP

