As I mentioned in the final recap, one thing occupying me over the past few months has been the pursuit of classic website restoration. We already have car restoration and computer restoration, but despite websites being similarly satisfying and full of moving parts, I don’t see anyone trying to rebuild old websites and return them to their original browsable condition. With the Somnolescent Archives, I have the perfect reason to do just that.
I wanna ramble about that for a bit, tell you my working methods for getting assets (from the Wayback Machine or otherwise), reassembling them, cleaning things up, and why I find it so enjoyable. Hopefully, you do as well!
Why the time machine?
archives was originally a file dump for formerly very online things that were no longer available online. Archived websites have since become its main focus, specifically the websites we make here at Somnolescent. It only makes sense; web hosting is just uploading some files to a server and then browsing to the right directory on the server with a web browser. These might’ve originally been available on entirely different domains, but they can just as easily be put in a folder on the same domain and exist side-by-side with one another.
In July 2019, I started to keep monthly (and now seasonal) backups of every site on our site network. Thing is, only I have access to these backups, so while I can retrieve files for people, they can’t browse through themselves. Given how much time we spend building these sites, debugging them, and drawing and creating assets for them, it seemed like a total waste to effectively lose all that work every time we came up with a new design and replaced the old one.
Now, most people use the Wayback Machine through the Internet Archive for looking at earlier versions of web pages. This is a fantastic resource, with saved copies of pages no one else has, but it’s not without its limitations.
- The Wayback Machine is slow. By necessity, to keep the world at large from taxing the infrastructure, download speeds through the Wayback Machine are throttled. This makes already bloated pages feel that much more sluggish.
- The Wayback Machine doesn’t perfectly grab websites. Web spiders do their best to download an entire page, but when you have to download millions of millions of sites, invariably, you only grab half of it sometimes. At best, images go missing. At worst, entire sections of sites don’t get saved.
Enter the Great Somnolescent Time Machine. The goal is to gather up all our old sites and site designs, keeping links equal, making them function as close to, if not exactly like they did when they were live, and keep them loading as fast as they did, if not faster, than they did at the time. The results? See for yourself.
That said, this isn’t exclusive to Somnolescent sites. I also used this method to restore a backup of an old site called The Forge, which was dedicated to Quake level design using a level editor called Worldcraft. I paired my Wayback grabs with the downloads from the old Planet Quake FTP server and now have the only fully working version of the site in existence. For the purposes of this blog post, though, I’m focusing on making our old sites extant again.
Gathering up assets
So, if the site design I want to archive is available in my files, I just start from that. I already then have all the images, scripts, and other things and can just go about patching them and optimizing images and so forth. If not, I have to rely on the Wayback Machine, which is functional, if not the speediest thing. I’ll assume I don’t have the files for this first bit.
I have to take two passes at the site design with wget, one for the original, unadulterated page markup and then one for the assets. It’s not visible from the frontend, but there is a way to get the original page the Wayback Machine originally saved, and that’s by appending
id_ to the end of the timestamp in the URL. If this is the page you normally get out of the Wayback Machine (page source helps prove my point), this is the URL you’d grab for the unedited page markup. The links and image URLs are all exact for when the grab was done.
Often, the unadulterated page will not look right at all, as the URLs haven’t been patched to their Wayback Machine equivalents, but that’s fine. I then run wget once for each page (with a wait time of three seconds to be courteous):
wget --page-requisites -w 3 https://web.archive.org/web/20190126225223id_/http://somnolescent.net/ wget --page-requisites -w 3 https://web.archive.org/web/20190126225223/http://somnolescent.net/
And end up with all the files needed to make that particular page tick on my computer. (The edited page, which gets downloaded with the asset grab, I simply toss as it’s not really needed and it’s bloated with scripts and long URLs that make it much harder to read.)
There’s probably a way to grab across an entire site, but I prefer to do things piecemeal so I know where everything goes.
Sometimes, I can get assets from another place, especially if the Wayback Machine didn’t get them all. As said, all the downloads for The Forge were on an FTP server, meaning the Wayback Machine couldn’t grab them. Thankfully, a separate archive of that server existed (over HTTP this time), meaning I was able to grab the files from that person’s mirror and marry them to the grabbed pages, meaning a fully working site, even though I had to pull from two separate mirrors.
Another good example comes from when I was trying to put back together Caby’s old Neocities sites. Sometimes, the full versions of her drawings (as she’s always had an art gallery in some form or fashion) didn’t get saved, but she’d still have her own copies that she could then pass onto me. Sometimes, you just have to get lucky. Sometimes, a full restoration isn’t possible now, but then we discover an extra copy of something missing in some files and it will one day be possible. Just gotta work with what you can get for now.
Reassembling the assets
Now that I have the files for the site on my computer, I can start putting it back together. This is partially a matter of looking through the unadulterated page source and seeing where the page author expected the assets to be. Often, the Wayback Machine will toss things into
js directories. If I wanted to, I could just change the page to look for those, but I err on the side of accuracy instead and try to recreate the original folder structure for the site in question.
This just takes time. Often, I’ll restore a page, test out all the links I can, find dead ones, and then fix them in a second pass. It’s akin to editing a document generated through OCR. There’s errors in there a computer can’t fix, even if it can catch them. It’s up to me to reassemble the site so it works as it should.
Links, both onsite and offsite, are a constant problem with site restorations. If someone links to something offsite, depending on the quality of the grab and the stability of that other site, it’s either likely there when nothing else is, or it’s not there when everything else is. Going back to the example of Caby using Tumblr, even when the rest of her gallery went missing, the Tumblr copies still existed, meaning I could still grab them and properly inline them into the restoration.
Conversely, if the other site is long gone, and there’s not an equivalent grab on the Wayback Machine I might be left with a total inability to link to anything. In that case, I’ll usually keep the link marked as a link, but set the
href attribute to go nowhere:
<a href="">This makes, effectively, a dummy link.</a>
If part of a site hasn’t been archived, I’ll also do that for parts of a site that didn’t get grabbed, so as to not affect the look of the restored page.
For the Great Somnolescent Time Machine, where I’ve restored so much of our old web presences that I can link between them, this gets extra fun. I made myself a small chart to keep track of when folks updated their sites, and because all the restorations exist in the same directory on archives (
/web/), I can link between restorations like I’d link to sites on other domains. Before this, I’d simply link to the current site of the person in question, but people leave, domains change, and that’s a little anachronistic anyway.
So for example, Caby had the first version of capy.somnol up when I had the first version of mari.somnol up, in late 2018. Therefore, I simply link to
/web/caby_v1/ instead of
capy.somnolescent.net (which only exists as a redirect now anyway), and the timeline is kept equal. A Find in Files (find and replace across an entire directory, basically) in Notepad++ makes this a cinch, especially if I’ve already got all the files handy through a backup rather than a Wayback Machine grab.
I think my favorite example of link patching came when I was uploading Devon’s old matfloor sites to archives. She was using somnol.net’s hotlinked ad script, which has changed massively since the version she used in 2021. Because of my Find in Files link patching, any links to
somnolescent.net already instead pointed to
/web/somnol_v3/, where an original copy of the link script she was using resides. Even if we kill the ads on our live sites, the ad script can still be found there, and will still work like the hotlinked one from back then.
document.write() or something.
The goal is for experience accuracy, not markup or script accuracy. For our WordPress blogs, I’ll render out a static version of them using a plugin and then upload the static pages to archives. They work identically on the user’s end, but without anything actually being grabbed from a database, theoretically massively speeding up page loads and, of course, not requiring a database.
If necessary (say, a chunk of site like the aforementioned blogs being moved to archives), I’ll set up
.htaccess redirects on the original sites to make sure all links stay working, even when the files themselves have been moved. Thankfully,
.htaccess makes this really simple:
Redirect 301 /blog/ http://archives.somnolescent.net/web/scratchpad/
This single line is needed to make any URLs that point to anything inside
/blog/ redirect properly to the archived copy. Any posts, any images, even other redirects, so long as the folder structure remains intact, will go to the correct place. I think this is a majorly important and underrated part of being a webmaster. If you move something, you should make sure it can always be found at the original links, even if you moved the literal files.
One last trick involving redirects: fake directories. Caby’s currently on version five of her Somnolescent site, and she hasn’t updated the design since early 2021, I think. There’s some sites on archives that, when live, linked to version five of her site. It’d make sense to link to the live caby.somnol as a result (since we’re still on version five as of writing it), but when she eventually redesigns her site, not only will it then link to a later site revision, but I’ll have to pull the archived sites back down, patch the links to the archived version five, and reupload.
Instead, on the currently archived sites, I’ll link to a nonexistent directory on archives (in this case,
/web/caby_v5/), and set that to redirect to caby.somnol in archives’
Redirect 301 /web/caby_v5/ http://caby.somnolescent.net/
Thus, right now, it’ll go to her live site (seriously, try it). When she redesigns, I’ll put her current design in
/web/caby_v5/, remove the redirect, and the version of the site will still be accurate.
The result of all that work is…a restored site! It works exactly as it used to, perhaps even a little better. Sometimes, I’ll correct very blatant errors in someone’s markup or a bad or malformed link and unlock functionality that didn’t work in the original version that was live. Best still, if I’m working off a backup, that means that anything that was unlisted or hidden in the original site is preserved in the restoration. Stuff Caby and I hid in our old sites for each other? Still there, and they’ll be there for all eternity.
And that’s the goal here. No matter what gets moved, no matter what we decide to do with our sites–the Great Somnolescent Time Machine will be there, safe and sound away from people accidentally deleting stuff or intentionally deleting stuff (note whose sites are up there <w<).