58 billion URLs in the Latest, Largest Linkscape Index Update Yet

Posted by randfish

I've got good news. Today marks a new Linkscape index (only 14 days after our previous index rollout) which means new data in Open Site Explorer, the Mozbar, the Web App and the Moz API. It's also more than 60% larger than our previous update in early January and shows better correlations with rankings in Google.com; I'm pretty excited.

For the past couple years, SEOmoz has focused on surfacing quality links and high quality, well-correlated-with-rankings metrics to help provide a link graph that shows off a large sample of the web's link graph. However, we've heard feedback that this isn't enough and may not be exactly what many who research links are seeking (or at least, it's not fulfilling all the functions you need). We're responding by moving, starting with today's launch, to a new, consistently larger link index.

Today's data is different from how we've done Linkscape index updates in the past. Rather than take only those pages we've crawled in the past 3-4 weeks, we're using all of the pages we've found since October 2011, replacing anything that's been more recently updated/crawled with a newer version and producing an index more like what you'd see from Google or Bing (where "fresh" content gets recrawled more frequently and static content is crawled/updated less often). This new index format is something that will let us expose a much larger section of the web ongoing, and reduces the redundancies of crawling web pages that haven't been updated in months or years.

Below are two graphs showing the last year of Linkscape updates and their respective sizes in terms of individual URLs (at top) and root domains (at bottom):

Linkscape Index Size Over Time

As you can see, this latest index is considerably larger than anything we've produced recently. We had some success growing URL counts over the summer, but this actually lowered our domain diversity (and hurt some correlation numbers of metrics) so we rolled back to a previous index format until now.

This means you'll see more links pointing to your sites (on average, at least) and to those of your competitors. Our metrics' correlations are slightly increased (I hope to show off more detailed data on that in a future post with help from our data scientist, Matt), which was something we worried about with a much larger index, but we believe we've managed to retain mostly quality stuff (though I would expect there'll be more "junk" in this index than usual). The oldest crawled URLs included here were seen 82 days ago, and the newest stuff is as fresh as the New Year.

Despite this mix of old + new, the percent of "fresh" material is actually quite high. You can see a histogram below (ignore the green line) showing the distribution of URLs from various timeframes going into this new index. The most recent portion, crawled in the last 2/3rds of December, represents a solid majority.

Histogram of crawl for Index 49

Let's take a look at the raw stats for index 49:

  • 58,316,673,893 (58 billion) URLs
  • 639,806,598 (639 million) Subdomains
  • 135,392,083 (135 million) Root Domains
  • 617,554,278,005 (617 billion) Links
  • Followed vs. Nofollowed

    • 2.10% of all links found were nofollowed
    • 56.50% of nofollowed links are internal
    • 43.50% are external
  • Rel Canonical – 11.79% of all pages now employ a rel=canonical tag
  • The average page has 87.36 links on it

    • 73.06 internal links on average
    • 14.29 external links on average  

In addition to this good news, I have some potentially more hilarious and/or tragic stuff to share. I've made a deal with our Linkscape engineering group that if they release an index with 100+ billion URLs by March 30th (just 72 days away), I will shave/grow my facial hair to whatever style they collectively approve*. Thus, you may be seeing a Whiteboard Friday with a beardless or otherwise peculiar-looking presenter in the early Spring. 🙂

As always, feedback is welcome and appreciated on this new index. If some of the pages or links are looking funny, please let us know.

* 20th century European dictator mustaches excluded

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.