on may 15th, 2005
tagged nerd, web
and
never commented on
share this page
i really don't like the wayback machine. it says to remove things from its archive, you can create a robots.txt file which will:
except that it doesn't remove all documents from their archive, it just temporarily blocks access to them. every time someone requests pages from their cache for your domain, their spider checks your site's robots.txt file in real-time. if robots.txt doesn't exclude their "ia_archiver" at that specific time, access is granted to the archives.
so, if you are blocking access to their archives of your site and your site becomes unavailable at any point, their site fails open and allows access to the archives. i tested this by blocking their spider's IP and tried to access the archives of a specific site that has them excluded in robots.txt. sure enough, they couldn't reach my site and allowed access to all of the archives. i unblocked them, reloaded the cached page, it checked robots.txt again, and gave me an error that access was denied.
and so, if you had a site that no longer exists, but you did have a robots.txt file preventing them from spidering it while it was still up, all of those old archives are available now (and forever, if a domain squatter snatches up the expired domain and sits on it forever without blocking access to the archiving spider).
i e-mailed a removal request for that old site to the archive.org people to permanently delete the archives. i am curious to see how long it will take and whether it will really be permanent.
leave the first comment or contact me