Matt Cutts on BigDaddy, linking, crawling etc

Matt Cutts just recently wrote a major post, Indexing timeline. In it he explains a great deal about the new BigDaddy infrastructure, new rules for crawling and how relevancy in regards to links now has become a lot more important.

I will quote below what I personally find very interesting from his post and his comments.

- After looking at the example sites, I could tell the issue in a few minutes. The sites that fit "no pages in Bigdaddy" criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling. The Bigdaddy update is independent of our supplemental results, so when Bigdaddy didn't select pages from a site, that would expose more supplemental results for a site.

Considering the amount of code that changed, I consider Bigdaddy pretty successful in that I only saw two complaints. The first was one that I mentioned, where we didn't index pages from sites with less trusted links, and we responded and started indexing more pages from those sites pretty quickly.

- Okay, let's check one from May 11th. The owner sent only a url, with no text or explanation at all, but's let's tackle it. This is also a real estate site, this time about a Eastern European country. I see 387 pages indexed currently. Aha, checking out the bottom of the page, I see this:

[Image of low quality footer links]

Linking to a free ringtones site, an SEO contest, and an Omega 3 fish oil site? I think I've found your problem. I'd think about the quality of your links if you'd prefer to have more pages crawled. As these indexing changes have rolled out, we've improving how we handle reciprocal link exchanges and link buying/selling.

Some folks that were doing a lot of reciprocal links might see less crawling. If your site has very few links where you'd be on the fringe of the crawl, then it's relatively normal that changes in the crawl may change how much of your site we crawl. And if you've got an affiliate site, it makes sense to think about the amount of value-add that your site provides; you want to provide a reason why users would prefer your site.

Quotes from his comments:

With Bigdaddy, it's expected behavior that we'll crawl some more pages than we index. That's done so that we can improve our crawling and indexing over time, and it doesn't mean that we don't like your site.

arubicus, typically the depth of the directory doesn't make any difference for us; PageRank is a much larger factor. So without knowing your site, I'd look at trying to make sure that your site is using your PageRank well. A tree structure with a certain fanout at each level is usually a good way of doing it.

BTW, CrankyDave, your site seems like an example of one of those sites that might have been crawled more before because of link exchanges. I picked five at random and they were all just traded links. Google is less likely to give those links as much weight now. That's the simple explanation for why we don't crawl you as deeply, in my opinion.

To the degree that search engines reflect reputation on the web, the best way to gather links is to offer services or information that attract visitors and links on your own. Things like blogs are a great way to attract links because you're offering a look behind the curtain of whatever your subject is, for example.

On the other hand, don't expect that just listing a sitemap is enough to get a domain crawled. If no one ever links to your site, that makes Googlebot less likely to crawl your pages.

graywolf, it's true that if you had N backlinks and some fraction of those are considered lower quality, we'd crawl your site less than if all N were fantastic. Hope that makes sense. Light crawling can also mean "we just didn't see many links to your domain" as well though.

There's SEO and there's QUALITY and there's also finding the hook or angle that captivates a visitor and gets word-of-mouth or return visits. First I'd work on QUALITY. Then there's factual SEO. Things like: are all of my pages reachable with a text browser from a root page without going through exotic stuff. Or having a site map on your site. After you're site is crawlable, then I'd work on the HOOK that makes your site interesting/useful.

BTW, just because Yahoo reports nofollow links in the Site Explorer, I wouldn't assume that those links are counting for Yahoo!Rank (or whatever you want to call it ).

For example, if you don't have as much PageRank relative to other sites, you may see fewer pages crawled/indexed in our main results; that would often be visible by having more supplemental results listed.

"Sounds like Google is now actually penalizing for poor quality inbound links." Mike, that isn't what's happening in the examples that I mentioned. It's just that those links aren't helping the site.

David Burdon, no, off-topic links wouldn't cause a penalty by themselves. Now if the off-topic links are spammy, that could cause a problem. But if a hardware company links to a software package, that's often a good link even though some people might think of the link as off-topic.

To go to your other question. I wouldn't be thinking in terms of "if I like to Yahoo/Google/ODP/whatever, I'll get some cred because those sites are good." If it's natural and good to link to a particular site because it would help your users, I'd do it. But I wouldn't expect to get a lot of benefit from linking to a bunch of high-PageRank sites.

Jack Mitchell, you said "Saying you can't do reciprocal linking is just sheer idiocy. How does Google expect you to get back links?" I'm not saying not to do reciprocal links. I only said that in the cases that I checked out, some of the sites were probably being crawled less because those reciprocal links weren't counting as much. As far as how to get back links, things like offering tools (robots.txt checkers), information (newsletters, blogs), services, or interesting hooks (e.g. seobuzzbox doing interviews) can really jumpstart links. Building up a reputation with a community helps (doing forums on your own site or participating in other forums can help). As far as hooks, I'd study things like digg, slashdot, reddit, techmeme, tailrank to get an idea of what captures people's attention. For example, contests and controversy attract links, but can be overused. That would be my quick take.

Spam Reporter, a lot of the time we'll give a relatively short penalty (e.g. 30 days) for the first instance of hidden text. You might submit again because sometimes we'll decide to take stronger action.

Technorati Tags: matt cutts, bigdaddy

Matt Cutts on BigDaddy, linking, crawling etc

Other posts