Thursday, February 15
The crawl rate keeps rising, and it’s starting to hammer pretty hard on the site. One section of the site is getting hit by SemrushBot, and it’s not well-behaved. It’s causing server load to rise. I’m seriously thinking about rewriting that part of the site to be less of a spider trap.
Sometimes, Google Search Console tells you what to fix. Shortly after correcting thousands of news urls, I saw this in my reports of duplicate pages:
That’s what I started to fix before, so they may drop off. Many of the other URLs in the reports were from the Calendar section, too.
The Cost of Fixing URLs
The last couple days were spent recovering from personal things, and working on the Calendar section of the site. There were a number of problems, but the biggest challenge was simplifying the event page URLs so they had only one parameter, “event_id”. The old URLs had the event_id, and three more fields of date information. That date information from the URL was used to construct a URL to link from the event back to the calendar. The link contained enough information to create a backlink. (That’s so 1990s. It’s also lame.)
(It may have been a potential XSS or CSRF risk, if someone got a link with malicious code, and then used it, and then clicked on the “back to Calendar” link. That’s not likely to happen, but I think there was a little bit of risk.)
This long URL was causing a problem, because the page was based on the event_id, not the other parameters. That meant that the parameters ?event_id=1234&day18 and ?event_id=1234&day999 and ?event_id=1234 produced the same page, causing Google to see duplicate content. I wanted to make the shortest URL the canonical URL.
The challenge was recovering the date information from the event, rather than from the URL. To this end, I could have queried the database for every event, but that could potentially consume a lot of time, because there’s multiple tables joined. The application was already set up to cache snippets of HTML to files; so I rewrote the caching layer to encapsulate all the caching code.
In addition to the HTML file, I saved out a JSON file with the date information, so I could recover the date info without much effort.
There were also a few other improvements: spreading out the cache files across subdirectories, so we don’t hit the 15,000 file slowdown; rendering the HTML fragments when they’re requested; and adding exceptions and exception handling to deal with bad event_ids more gracefully.
It was an ugly retrofit, but it worked after a while. It was a time-consuming fix.
Lesson: get the URLs right at the start of a project… not 18 years later.