November Linkscape Update is Live; Binary Files Issue Dramatically Reduced

Posted by randfish

On Thursday (November 3rd) of this past week, Linkscape’s index updated (in record time – just 3 weeks). New link data’s once again available in OpenSiteExplorer, via the SEOmoz API and in the Mozbar. Here are the stats for this latest index update (our 46th index update):

  • 43,077,387,028 (43 billion) URLs
  • 480,597,551 (480 million) Subdomains
  • 105,570,741 (105 million) Root Domains
  • 356,255,241,471 (356 billion) Links
  • Followed vs. Nofollowed
    • 2.18% of all links found were nofollowed
    • 58.21% of nofollowed links are internal, 41.79% are external
  • Rel Canonical – 10.46% of all pages now employ a rel=canonical tag
  • The average page has 77.28 links on it (down .19 from last index)
    • 64.86 internal links on average
    • 12.42 external links on average

Since August, we’ve been struggling with the particularly devious problem of binary files in the index messing up link counts and showing links that Google + Bing probably are not counting. In September’s crawl, we put a black list on these files and saw a reduction of ~40% in binary files. This time, we’ve made even more progress (though it’s tough to know exactly how much – we’re continuing to evaluate) and you should see a signifcant reduction in these binary files.

Reduction in Binary Files Means More Accurate Link Counts

In part because of the reduction in these files, processing time for the Linkscape index was reduced, enabling us to produce a much faster index update. However, we’re planning in December to produce a much larger index and thus anticipate processing time to rise back up. On the plus side, this will mean a lot more link data. In 2012, we’re aiming to reach into the 100billion+ URL index size, closer to what we’ve heard Bing + Google keep in their main indices (~120-140 billion URLs).

As always, feedback on the new index is greatly appreciated – if you’re seeing stuff we’ve missed, files we shouldn’t have crawled or metrics that feel wrong, please let us know. Our engineers would love to hear from you.

Do you like this post? Yes No

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.