July 15th, 2014 10:10 pm by Vincent Flanders
Vincent Flanders’ comments: There has been a lot of positive press around the Internet Archive’s Wayback Machine’s (IAWM) grabbing of its 4 billionth web page. Well, it should be negative press. The Wayback Machine does a crappy job of grabbing pages. It’s a freaking joke. I know you don’t believe it, but I’m restraining my condemnation of this sh*t-hole website.
Here’s The New Jersey State Policemen’s Benevolent Association’s home page back in May 2012 per the Wayback Machine. What a sucky job. There are graphics that didn’t get captured. Oh. Here’s the Wayback Machine’s first capture of the website from June 1, 1997. Sucks. IAWM claims the problem is with the site’s using robots.txt file to exclude graphics. I don’t know if this is true. If it is, then why are you grabbing the pages? They’re worthless without images.
Actually, I hate to hesitate to call these people liars, but I went to a recent Daily Sucker (MGBD Parts and Services) and looked for their robots.txt file and didn’t find one. It could be hidden (and it’s also a WordPress-based site) so maybe they’re telling the truth. However, Google doesn’t like graphics and text to be hidden by robots.txt and I can’t imagine any website would hide pictures on purpose (I can’t see anybody hotlinking because they want the images on their website). If you go to the IAWM 2014 MGBD capture page, you’ll see some pictures came through and others didn’t.
I don’t know what is causing the problems, but whatever is going really makes the site useless.