It would be a shame to see this place disappear into oblivion. I don't think there's been any indication of impending closure; in fact, as far as I'm aware there has been no indication at all, and that absence is worrying in itself. Therefore, as alluded to in The End (page 36), I ran
wget
(*) to download everything public on this forum. This post doesn't provide the backup; instead, let's start with an outline of how this can be accomplished, and some metadata.(*) (Actually, there was more than one run, at least if you count partial runs. A great many redundant requests were showing up (such as per-post views, even though posts were already included in per-topic/per-page views), so the crawl was restarted 2-3 times with tightened URL rejection patterns.)
For reference, this
wget
parameter should filter out the redundant pages:
Code: Select all
--reject-regex 'posting|search|print|ucp|feed|viewprofile|p=|hilit|memberlist'
There are probably better alternatives than
wget
, such as those mentioned in Rad's reply in The End page shortly after the post linked to above.The Crawl
Raw results:
- 1.9 GB raw crawl data (337 MB as a .tar.gz)
- 27 831 HTML files deemed non-redundant (i.e., the rejection pattern does not match their URLs)
Retrieving this would likely take approximately 10-12 hours with the final rejection regex pattern above and a 1 second delay between requests.
(It's an inexact estimate because the restarts to tighten rejection left partial results, which the subsequent runs built on.)
The Scrape
Some JavaScript code got to parse these one-by-one into a more handy JSON representation of the forum structure and topics and posts, which turned it into this:
- 195 MB on disk (42 MB as a .zip, or 39 MB as a .tar.gz - excluding avatars and other images)
- 4875 topics (about 75% of the index number)
- 167 319 posts (about 98% of the index number)
Interestingly enough, some integrity checks showed that nothing public is missing; explicit topic and post counts listed on forum and topic views do match up with the actual numbers of topics and posts extracted from the raw crawl. In other words, every forum page seems to be accounted for, and in turn all topics linked to on those pages also seem to be present. Presumably, the "missing" topics and posts belong to a hidden forum (or several), though I have no way of telling for sure.
The Woe
A personal backup of public fora might fall within fair use, or thereabouts. And I would gladly share this copy with all the rest of you, if not for two issues:
- The legalities: personal use seems fair, but distribution at scale seems rather like it'd run afoul of some copyright consideration (since all participating forum users have of their own volition agreed to let the forum display their posts, but not to let anyone else repackage and reupload anywhere and everywhere). My user name is easily linked to other parts of my identity (and while I'm a nobody right now, I'm also a hopelessly aspiring indie game developer), so I am loath to risk association with copyright infringement by hosting this on my website. Though, for the record: I detest copyright law.
- I'm resigned to the mediocrity of an unreliable wireless internet connection in a fibreless forest, which does not permit much self-hosting from home.
Maybe it can be uploaded somewhere, with a link finding its way to one or several moderators so they may readily partake of the data and do anything whatsoever (hopefully including a good way of sharing with the community at large). Unburdening my conscience, so to speak.
Of course, this place is evidently still around; as long as that remains true, anyone could crawl anew.
The Plots (and this is the fun part, with oversized pictures)
With that out of the way... let's indulge in some statistics:
(plotted with a Python script, using the JSON data assembled from the raw crawl data)
Spoiler: SHOW
The Code
The crawl itself can be carried out with any tool of your choosing.
The attachment contains the JavaScripts for extracting forum data from crawl results and for checking integrity, as well as a Python script for plotting.
I'm really more of a C++ (and graphics programming) person; that's my excuse for how shabby these scripts are.