Sunday, February 18
Time to look at some stats. These were just copied into the spreadsheet by hand. The dates are when the data was copied, and not the date of the latest data in the system. The data lags by around four days.
The overall size of the index has held pretty steady, growing just 0.35%, but some of the reasons for exclusion have changed considerably.
Duplicate pages without canonical tags have decreased 7.6%. This is probably due to adding canonical tags to pages.
Pages with redirects have increased 26%! Of course, that’s because we added a ton of redirects for /news/ URLs that didn’t end in .php or .html.
I also added a redirect from www.la to plain la. That may have spiked the value.
A number of URLs that moved from one month to another month. I’m not sure why this is, but it probably has something to do with timezone settings, or the system time. Those would have caused a redirect.
The number of pages excluded by “noindex” shop up, going to nearly 4 times the value. That’s because we have more “noindex” pages now. All the calendar_edit_delete.php pages are now noindex. That script needs to be modified to be a POST form, so the URL doesn’t have a parameter. I also added “noindex” to the tags pages (but not the article lists for a tag).
Indexed but blocked by robots.txt has declined 37%. I’m not sure why this is, but I have removed most of the URLs from robots.txt.
Soft 404s spiked, due to server load issues. I eventually found a fix, and it was adding a database table index.
The total indexed pages has increased steadily, growing by 7%. Given that the total indexed has held steady, it means more URLs are in the usable index, and fewer are excluded. Yay!
The crawl rate has also increased – but we cannot be sure about that until we wait.
Overall, I think things are getting better. I hope, as Google revisits more pages, it’ll make better decisions about which page is the real canonical page, and which is a duplicate, and retain one page.