Return to “General”

Post

Forum backup (only metadata, for now)

#1
The Start

It would be a shame to see this place disappear into oblivion. I don't think there's been any indication of impending closure; in fact, as far as I'm aware there has been no indication at all, and that absence is worrying in itself. Therefore, as alluded to in The End (page 36), I ran wget (*) to download everything public on this forum. This post doesn't provide the backup; instead, let's start with an outline of how this can be accomplished, and some metadata.

(*) (Actually, there was more than one run, at least if you count partial runs. A great many redundant requests were showing up (such as per-post views, even though posts were already included in per-topic/per-page views), so the crawl was restarted 2-3 times with tightened URL rejection patterns.)

For reference, this wget parameter should filter out the redundant pages:

Code: Select all

--reject-regex 'posting|search|print|ucp|feed|viewprofile|p=|hilit|memberlist'

There are probably better alternatives than wget, such as those mentioned in Rad's reply in The End page shortly after the post linked to above.



The Crawl

Raw results:

  • 1.9 GB raw crawl data (337 MB as a .tar.gz)
  • 27 831 HTML files deemed non-redundant (i.e., the rejection pattern does not match their URLs)

Retrieving this would likely take approximately 10-12 hours with the final rejection regex pattern above and a 1 second delay between requests.
(It's an inexact estimate because the restarts to tighten rejection left partial results, which the subsequent runs built on.)



The Scrape

Some JavaScript code got to parse these one-by-one into a more handy JSON representation of the forum structure and topics and posts, which turned it into this:

  • 195 MB on disk (42 MB as a .zip, or 39 MB as a .tar.gz - excluding avatars and other images)
  • 4875 topics (about 75% of the index number)
  • 167 319 posts (about 98% of the index number)

Interestingly enough, some integrity checks showed that nothing public is missing; explicit topic and post counts listed on forum and topic views do match up with the actual numbers of topics and posts extracted from the raw crawl. In other words, every forum page seems to be accounted for, and in turn all topics linked to on those pages also seem to be present. Presumably, the "missing" topics and posts belong to a hidden forum (or several), though I have no way of telling for sure.



The Woe

A personal backup of public fora might fall within fair use, or thereabouts. And I would gladly share this copy with all the rest of you, if not for two issues:

  • The legalities: personal use seems fair, but distribution at scale seems rather like it'd run afoul of some copyright consideration (since all participating forum users have of their own volition agreed to let the forum display their posts, but not to let anyone else repackage and reupload anywhere and everywhere). My user name is easily linked to other parts of my identity (and while I'm a nobody right now, I'm also a hopelessly aspiring indie game developer), so I am loath to risk association with copyright infringement by hosting this on my website. Though, for the record: I detest copyright law.
  • I'm resigned to the mediocrity of an unreliable wireless internet connection in a fibreless forest, which does not permit much self-hosting from home.

Maybe it can be uploaded somewhere, with a link finding its way to one or several moderators so they may readily partake of the data and do anything whatsoever (hopefully including a good way of sharing with the community at large). Unburdening my conscience, so to speak.
Of course, this place is evidently still around; as long as that remains true, anyone could crawl anew.



The Plots (and this is the fun part, with oversized pictures)

With that out of the way... let's indulge in some statistics:
(plotted with a Python script, using the JSON data assembled from the raw crawl data)
Spoiler:      SHOW
Posts per week:
Image

Registered users per week (counting only users who have posted at least once in the public fora):
Image

Top 100 posting users:
Image

Posts per week in the General forum:
Image

Posts per week in the Everything & Anything forum:
Image

Posts per week in the Creative Writing forum:
Image

Posts per week in the Games forum:
Image

Posts per week in the Dev Logs forum:
Image
(Edited to remove one General forum plot too many.)



The Code

The crawl itself can be carried out with any tool of your choosing.
The attachment contains the JavaScripts for extracting forum data from crawl results and for checking integrity, as well as a Python script for plotting.
I'm really more of a C++ (and graphics programming) person; that's my excuse for how shabby these scripts are.
Attachments
scripts.zip
(7.82 KiB) Downloaded 156 times
Post

Re: Forum backup (only metadata, for now)

#2
Rexirl wrote:
Thu Nov 26, 2020 2:19 pm
Interestingly enough, some integrity checks showed that nothing public is missing; explicit topic and post counts listed on forum and topic views do match up with the actual numbers of topics and posts extracted from the raw crawl. In other words, every forum page seems to be accounted for, and in turn all topics linked to on those pages also seem to be present. Presumably, the "missing" topics and posts belong to a hidden forum (or several), though I have no way of telling for sure.
Local policy is that "Deleted" posts and threads are just moved into the trashcan subforum, which isnt public :)

and on the conscience part: do it like any good copyright fighting person, put up a torrent, wait until your seed ratio reaches 1.1 and bug out.
Post

Re: Forum backup (only metadata, for now)

#4
Cornflakes_91 wrote:
Thu Nov 26, 2020 3:33 pm
Local policy is that "Deleted" posts and threads are just moved into the trashcan subforum, which isnt public :)

and on the conscience part: do it like any good copyright fighting person, put up a torrent, wait until your seed ratio reaches 1.1 and bug out.
Oh, I see (or, well, I see why I didn't see, you see). That explains things.

Hmm-mmh. You do have a point there, cereal poster, and I have been - and shall be - considering it. Though the backup might yet come to flow via non-torrent streams, so to say.

Silverware wrote:
Thu Nov 26, 2020 7:03 pm
... *FUCK* you've made me want to remake the limit-theory table of elements!
You know, you can't just show something that cool and expect to get away without having the idea and much of the style shamelessly appropriated. Not that you claimed to be against that, in fairness.

Way too much time was spent on this - I guess I'm just obsessive like that:

Image


And see what happens when an element is selected and the mouse hovers above it:
Spoiler:      SHOW
Image
Image
Image

(edited to add a substitution rule for "oxidizing", producing "silverwizing" with the current ordering)
(Those images might have stepped just a little a little over to the heavy side, but... who can stand image compression artefacts.)


If you want to play with it yourselves: https://rexirl.net/miscellanea/limit-th ... -elements/
(Less than a MB to load, though this page is - relatively - horribly inefficient; it reloads all user data from the scrape and then rebuilds the table in JavaScript, regenerating element names and running various heuristic substitution rules to produce wacky descriptions with some superficial likeness of internal consistency.)

There's keyboard navigation (using the arrow keys and escape). Maybe I ought to have added something for keyboarding one's way to the tooltips. Hmm...

The source code is neither obfuscated nor minified, so feel free to look. I won't promise cleanliness.
Post

Re: Forum backup (only metadata, for now)

#5
Seeing as that was EXACTLY how I did it, and was intending to redo it using the same method... (But my scraping tool I originally used is lost, and I couldn't get my new one to play nice the the PHPBB login process)

This though? This is awesome, better than my original one in every conceivable way! :D
°˖◝(ಠ‸ಠ)◜˖°
WebGL Spaceships and Trails
<Cuisinart8> apparently without the demon driving him around Silver has the intelligence of a botched lobotomy patient ~ Mar 04 2020
console.log(`What's all ${this} ${Date.now()}`);

Online Now

Users browsing this forum: No registered users and 26 guests

cron