Saturday, December 26, 2009
Google thinks I've been blogging for 173 years. I know this because on a regular basis, the google bot fetches blog entries from far past and future dates. The Google bot is slowly working it's way through time, month by month (Google is just being very thorough).
I noticed this in my web server logs (you don't scan your logs once in a while ?). I found multiple entries generated by fetches by the google bot. I saw a bunch that looked like this.
"GET /?month=11&year=2039 HTTP/1.1"
"GET /?month=12&year=2039 HTTP/1.1"
"GET /?month=1&year=2040 HTTP/1.1"
The earliest year was 1926 and I was about to break into the Y2.1K on the high end. I was amused at the time, Google was proactively fetching future blog entries in eager anticipation of their high quality and entertainment value. I fetched one of the future date pages and sure enough, a valid page is returned with no errors displayed and the calendar shows the correct future month and a link to previous and the next month as well. Ok, that explains why google is crawling forward in time - my silly blog software is publishing future links. I checked my front page and was surprised to see the link to the future date was not there - apparently the software is intelligent enough to not link into a future month if you were on the current month. It's just not smart enough to not link to future dates if you happen to fetch a future month. Sounds like a simple fix to me.
But we do have a mystery - how did google get that first link into the future if it's not published on my main page ? I first assumed that google was smart enough to know how to manipulate patterns like "year=2009" and do an increment to "year=2010" and walk through the entries that way but that's just asking for bad links so I doubt it. I think it's more likely that was is a timing window - daylight savings time or possibly when I moved from the old server HW to the new server, that a date mismatch caused a future link to be generated which started google on the path forward (and backwards). It does make you wonder, what do these guys do ? They generate future links from year to year and I suspect they (and many other calendar web pages) will have Google and other sites fetching pages through time.
So, Google follows these links into next month and prev month and by the time I caught up with it, 173 years of blogging had passed. For reference, in the full log set (6 months worth), there were 5300 GET's with past or future dates.
I decided instead of waiting to see "year=3001" in my blogs at some point, I should fix the code. Should be easy right ? Well, yes it was it was so easy I fixed it three times. I check the main page and see a juicy looking "calendar.js" script reference so I grab that file. Take a couple peeks and I notice the code has a check for the current date and doesn't insert a future link if you are in the current month (m==currentMonth && y==currentYear). Clearly only excluding the current month is bogus. I tack on the obvious (y > currentYear) and think to myself "that was easy". Time to test - and of course the change has no effect. I futz with it a few times and nope, the changes have no effect. I hack in some obvious visible changes "December" -> "HACK DECEMBER" so I could see it and know I was in the right file. Nope, this code is not running.
So, grep around (again) and voila ! find a *second* file that looks "calendar like" but not called calendar (hidden code crafty!) and again - I find the exact same code, apply same changes and retest. Nope - no effect. This is getting silly. So, first - to prove the files are the real ones, I use the old trick - rename the two files, lets be sure they are being used at all, rerun the tests - nope - deleting the files has no effect. Obi-wan speaks to me These are not the files you're looking for.
Grep around some more, and hey, even more calendar code ! (must have been a 3 for one sale), and voila!, the same code there too, with a incorrect comment that says "prevent future dates from being displayed" and of course the exact same bogus code. ()
Third times a charm, and the link into the future is gone, so next time Google grabs the last cached link, it will stop the eternal march into the future (and it did). I also fixed the eternal past links, choosing an arbitrary cutoff so that Google doesn't march it's way down to 1/1/0001 or something since those old blog entries are soooooo embarassing.