Categories
Blogroll
Useful Sites
Archives
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- November 2008
Friendly Links!
(Add your link here!)Tags
New Insights into Googlebot
07/23/10
Posted by rolfbroer
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Google has found an intelligent way to arrange the results for a search query. But an interesting question is – where we can find that intelligence? A lot of people have conducted research into the indexing process and even more have tested ranking factors on their weight, but we wondered how smart Googlebot itself is. To make a start, we took some statements and commonly used principles and tested how Googlebot handled them. Some results are questionable and should be tested on a few hundred domains to be sure, but it can give you some ideas.
Speed of The Crawler
The first one we tested was Matt Cutts on his following statement: “… the number of pages that we crawl is roughly proportional to your PageRank".
This brings us to one of the challenges large content sites are facing – the problem of getting all pages indexed. You can imagine if Amazon.com was a new website, it would take a while for Google to crawl all 48 million pages and if Matt Cutts’s statement is true, it would be impossible without any incoming links.
To test it, we took a domain with no history (never registered, no backlinks) and made a page with 250 links on it. Those links refer to pages that also have 250 links (and so on…). The links and URLs were numbered from 1 to 250, in the same order as they appeared in the source code. We submitted the URL via “addurl” and waited. Due to the fact that the domain has no incoming links, it has no or at least a negligible PageRank. If Matt Cutts’s statement is correct Googlebot would soon stop crawling.

As you can see in the graph, Googlebot started crawling the site with a crawl rate of approximately 2500 nodes per hour. After three hours, it slowed down to a crawl rate of approximately 25 pages per hour and maintained that rate for months. To verify this result we did the same test with two other domains. Both tests came up with nearly the same results. The only difference is the lower peak at the beginning of Googlebot’s visit.

Impact of Sitemaps
During the tests, the sitemap manifested itself as a very useful tool to influence the crawl rate. We added a sitemap with 50,000 uncrawled pages in it (indexation level 0). Googlebot placed the pages which were added to Google by sitemap on top of the crawl queue. This means that those pages got crawled before the F-levelled pages. But what’s really remarkable is the extreme increase in crawl rate. At first, the number of visits was stabilized at a rate of 20-30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1257.78 pages per hour which is an increase of no less then 4630.27%. The increase of crawl rate doesn’t stop by the pages included in the sitemap. Also the other F- and 0-levelled pages take advantage of the increase in crawl rate.

It’s quite remarkable that Google suddenly uses more of it’s crawl capacity to crawl the website. At the point where we submitted the sitemap the crawl queue was filled with F-pages. Google probably attaches a lot of value to the submitted sitemap.

This brings us to Matt Cutts’s statement. After only 31 days Googlebot crawled about 375,000 pages of the website. If this is proportional to it’s PageRank (which is 0) this would mean that it will crawl 140,625,000,000 pages of a PageRank 1 website in just 31 days. Remember that PageRank is exponential. In other words, this would mean you never have to worry about your PageRank even if you own the largest website on the web. In other words don’t simply accept everything Matt says.
Amount of Links
Rand Fishkin says: “…you really can go above Google’s recommended of 100 links per page, with a PageRank 7.5 you can think about 250-300 links” ( http://www.seomoz.org/blog/whiteboard-friday-flat-site-architecture )
The 100 links per page advice has always been a hot topic especially for websites with a lot of pages. The reason the advice originally was given is the fact that Google used to index only 100 kilobytes per page. On a 100 kb page the amount of 100 links seemed reasonable. If a page was any longer, there was a chance that the page would be so long that Google would truncate the page and wouldn’t index the entire page. These days, Google will index more than 1.5MB and user experience is the main reason for Google to keep the “100 links” recommendation in their guidelines.
As was described in the previous paragraph, Google does crawl 250 links, even on sites with no incoming links. But is there a limit? We tried the same set-up as the websites described with 250 links on it but instead we used 5,000 links per page. When Googlebot visited that website something remarkable happened. Googlebot requested the following pages:
- http://example.com/1/
- http://example.com/10/
- http://example.com/100/
- http://example.com/1000/
On every level Google visits, we see the same page requests. It seems like Googlebot doesn’t know how to handle such a large amount of links and tries to solve it as a computer.
Semantic Intelligence
One of the SEO myths used on almost every optimised website are the links placed in heading tags. Recently it was mentioned again as one of the factors of the “Reasonable Surfer patent”. If Google respects semantics, it definitely attaches more value to those “heading” links. We had our doubts and put it to the test. We took a page with 250 links on it and marked some with heading tags. This was done a few levels deep. After a few weeks of waiting nothing pointed in the direction that Googlebot preferred the “heading” links. This doesn’t mean Googlebot doesn’t use semantics in it’s algorithm, it just doesn’t use headings to give links more weight than others.
Crawling JavaScript
Google says it keeps getting better in recognizing and executing JavaScript. Although JavaScript is not a good technique to use if you want to be sure that Google does follow your links, it’s used quite a lot to reach the opposite goal. When used for PageRank sculpting the purpose of using JavaScript links is to make those links only visible for users. If you use this technique for this purpose it’s good to keep yourself updated on what Google can and can’t recognize and execute. To test Googlebot on it’s JavaScript capabilities we took the JavaScript codes as described in “The professional’s guide to PageRank optimization” and put them to the test.
The only code Googlebot executed and followed during our test was the link in a simple “document.write” line. This doesn’t exclude the possibility that Googlebot is capable of recognizing and executing the more advanced script. It is possible that Google needs an extra trigger (like incoming links) to put more effort into the JavaScript crawling.
Crawling Breadcrumbs
Breadcrumbs are a typical element on a webpage specially created for users. Sometimes they are used to support the site structure as well. Last month we encountered some problems where the Googlebot was not able to crawl it’s way up, so we did some tests.
We made a page a few levels deep with some content and links to higher levels on it ( http://example.com/lvl1/lvl2/lvl3/ ). We gave the page some incoming links and waited for Googlebot. Although the deep page itself was visited 3 times by the crawler, the higher pages didn’t get a visit.

To verify this result, we did the same test on an other domain. This time the test page was a few levels deeper in the site structure (http://example.com/lvl1/lvl2/lvl3/lvl4/lvl5/). This time Googlebot did follow some links which referred to pages higher on the site structure. Despite the fact that Googlebot does follow the links, it doesn’t seem to be a good method to support a site structure. After a few weeks Google still didn’t crawl all the higher pages. It looks like Googlebot rather crawls deeper into the site structure then higher pages.
Takeaways
In short, the lesson learned is that one can influence the crawl rate with a sitemap. This doesn’t mean that you should always upload a sitemap for your websites. You only want to increase the crawl rate if the bulk of your crawled pages get indexed. It takes longer for a crawler to return to an “F”-levelled page than to return to an indexed page. So if most of your pages get crawled, but dropped from the index you might want to consider getting more incoming links before using a sitemap. Best thing to do is to monitor for every page when Googlebot last visited it. With this method you can always identify problems in your site structure.
The amount of links isn’t limited to 250 links (even if you have no incoming links) although 5000 seems too much. We haven’t found the exact limit yet, but if we do, we will give you an update.
Links in heading tags for crawl purpose seems to be a waste of time. Though you can use them for usability purposes, because you’re used to it or because WordPress does it anyway and maybe if you’re lucky it’s still a ranking factor.
Another conclusion we can make is that the Googlebot isn’t very good in crawling breadcrumbs. So don’t use them for site structure purposes. Google just doesn’t crawl up as good as it crawls down. In contrast to breadcrumbs, you can use JavaScript for site sculpting purposes. Googlebot isn’t top of the bill if we’re talking about recognizing and executing JavaScript links. Remember to keep yourself updated on this subject, but for now you definitely can use some “advanced” JavaScript to do sculpting.
A last result that came up while performing research on the crawl process was the influence of the URL length. A short URL gets crawled earlier than long URL’s, therefore always consider the need for indexation and the need to be crawled if you choose your URL.
Matt Cutts Movie Marathon
06/25/10
Posted by Dr. Pete
This post is the culmination of two of my lifelong dreams: (1) To spend an entire day on YouTube and call it "work", and (2) To Photoshop Matt Cutts’ face on cartoon food. Early in 2009, Matt Cutts, Google’s most visible anti-spam engineer, began releasing a series of short Webmaster Help videos. You’ve probably seen some of these videos, but what you may not know is that there are currently over 200 of them, with more than 70 posted in 2010 alone.
From time to time, I’ve been amazed at the details that slip out during these videos, many of which don’t get much play in the blogosphere. So, I decided to watch all of the 2010 videos and report back on what I learned. This post contains my Top 10 picks along with a few interesting tidbits and one SHOCKING CONSPIRACY.
Obligatory Disclaimers
Let’s get this out of the way, as Matt seems to be a lightning rod for controversy. I’m a nice guy, but if you don’t read this section, don’t expect me to reply to your comments.
I don’t speak for Matt
Other than having played a couple of hands of Search Spam with Matt over the years (I think we’re 1-and-1), I don’t know him and I’m not trying to put words in his mouth. I’ve used the original video titles, for reference, but the rest is paraphrased. I strongly encourage you to watch the originals.
Don’t believe everything you hear
Matt, like everyone, has vested interests, and Google doesn’t have any motivation to tell us every detail about how the algorithm works.
Don’t disbelieve everything, either
I don’t think Matt stays up nights scheming about how to deceive SEOs. I think he’s a smart, decent guy who cares about search quality.
My Top 10 Picks
One quick note, before I reveal my picks (counting down from 10 to 1). If you want to get Matt to answer your questions, it apparently helps to have a cool-sounding name, like "Magico" or "Youser". From now on, I will have my Muppet Intern Yoozer submit all of my help questions.
10. Should I spend time on meta keywords tags? (Apr 19)
Matt says: "I wouldn’t spend even 0 minutes on it, personally".
I know most of you know this, but it’s good to hear it from the source. Google does not use the keywords meta tag for ranking. Meta description still has value for other reasons (Watch the video – 1:21).
9. How does URL structure affect PageRank (Apr 6)
Matt says: "Google doesn’t worry so much about how deep a set of directories is."
This post raises an important distinction – URL structure is not link structure. We get this confusion frequently in Q&A. Let’s say you have a URL like this:
http://www.example.com/year/month/day/topic/blog-post-title
That page isn’t 5 levels deep, just because it’s 5 /s behind the root domain in the URL. The depth of the page is determined by your internal architecture and link structure. URL length may affect the power of keywords in the URL and the click-through of the URL, but the crawlers don’t really care when it comes to finding your pages. What matters is if this page is one hop from the home-page or 10 hops away (Watch the video - 2:04).
Note: SEOmoz correlation data has shown that deeper folder structure may correlate with worse rankings. Deep folder structures can be an indication of other issues, including information architecture problems.
8. Can I make sure Google always uses my meta description tags? (Mar 24)
Matt says: "The short answer is ‘no’."
I hear this complaint a lot. Google will sometimes rewrite its own snippets for relevance. You can block the ODP and you can write relevant, unique meta descriptions, but you can’t completely control what Google does (Watch the video – 1:52).
7. Can having dofollow comments on my blog affect its reputation? (Feb 22)
This is an interesting two-parter. First off, outbound links to spammy sites can have a negative impact on your reputation. Manage your outbound links and nofollow if you have to. Individual, inbound spammy links will typically not harm you, on the other hand, because they’re beyond your control (although, in my experience, a pattern of inbound spammy links can be a different story). Matt has some great comments at the end about the value of commenting on dofollow blogs (Watch the video – 2:35).
6. Is cross-linking websites bad? (Jan 25)
Matt says: "I would ask yourself: are these websites really related in any kind of sense?"
When Matt wants to read cartoons, links to auto insurance and coffee tables make him sad. Cross-linking 3 sites probably isn’t a big deal, but 30 or 300 could likely get you into trouble. Relevance is the key, and footer cross-links are often low-value (Watch the video – 2:00).
5. How can I get Google to index more of my Sitemap URLS? (Mar 23)
Matt says: "I wouldn’t get hung up on just how many pages have been indexed…"
We hear this one from frustrated webmasters every day. Google does not guarantee that pages in your XML sitemap will be indexed. Indexation has a lot to do with your authority and trust – an authoritative site will get more love from the crawlers, plain and simple (Watch the video – 1:31). Check out Rand’s recent post diving deeper into Matt’s comments on the indexation cap.
4. Will changing hosts cause any SEO concerns? (Feb 9)
Matt says: "Most people can switch their IP address and never have any issue whatsoever."
This is a common fear that is usually unfounded. As long as your domain name and hosting country stay the same, switching from one reliable host to another should have no SEO impact. Matt gives a nice briefing on how to change DNS servers and set your TTL that’s worth watching (Watch the video – 1:53).
Note: Although I implied this in the recap, it deserves repeating. If you’re changing your domain name and/or hosting country, that can definitely affect your ranking and is a much more complex issue. Consider the risks and plan accordingly, in those cases.
3. Is Google Analytics data a factor in a page’s ranking? (Feb 2)
Matt says: "I promise you, my team will never ask the analytics team to use their data."
I don’t think you’ll hear a more direct answer from Matt than that. Conspiracy theories abound, but there are 3 separate videos in 2010 where Matt states that the quality team does not use Google Analytics data. Of course, that doesn’t mean that user metrics (click-through rate, etc.) aren’t a factor, but these are more likely coming from other sources, such as SERP tracking (Watch the video – 1:17).
2. Can you give us an update on rankings for long-tail searches? (May 30)
This is a discussion of the so-called "Mayday" update. Matt clearly states that Mayday is a deliberate, algorithmic change to improve the quality of long-tail searches, and it is not temporary. It is not related to Caffeine, although the roll-out timeline overlaps somewhat (Watch the video – 2:39).
1. Should I be obsessing about load times? (May 5)
Matt says: "We have considered in 2010 using page speed…"
There are a couple of important points here. First, Google hadn’t even finalized the decision to use page speed as a ranking factor until this spring*. Second, page speed is just one of over 200 ranking factors. All else being equal, a fast site is good for users and good for search, but an occasional server glitch isn’t going to kill your rankings. If you can speed up your site with a few simple changes, though, why not do it (Watch the video – 2:28)?
*Edit: As Lindsay points out below, Matt’s April 9th blog post does suggest that page speed was incorporated as a ranking factor. One of the issues with the dates on the videos is that they’re often recorded a bit before they’re released. On the May 5th video, Matt suggests that Google hadn’t made a final decision on using page speed, but the reality is that that decision was probably made in March or April.
Honorable Mentions
3. How many bots does Google have? (Feb 30)
This is a nice review of what bots/spiders actually are. They aren’t real robots that come knocking on your door. It’s a good, short primer for new SEOs (Watch the video – 1:30).
2. State of the Index 2009 (Jan 20)
This is a long one, and it’s slightly out of date, but it’s a good review of some of what happened in 2009. It has a solid explanation of rel=canonical, as well as the parameter blocking and fetch as Googlebot features in Webmaster Tools. It ends with a brief explanation of what Caffeine is all about (Watch the video – 25:59).
1. How many search algorithm changes were made in 2009? (Apr 22)
Google makes a change to the algorithm on the order of ONCE PER DAY. These changes may be batched and rolled out in chunks, but another video confirmed a number of roughly 400 algorithm changes in 2009. If you think May-Day and Caffeine are the only things that have happened in 2010, think again. Google is constantly evolving. This video also includes a statement you don’t hear from Matt every day – Good content is necessary, but not sufficient (Watch the video – 1:53).
The Shocking Conspiracy
Of course, it wouldn’t be a post about Matt Cutts without a conspiracy. If you watch the 2010 videos, you’ll see a shocking transformation, where Matt goes from having hair to no hair back to hair again almost instantaneously. I’ve graphed this phenomenon below:

Matt claims this has something to do with the timing of the videos and filming them in batches, blah blah blah, but those of us who are savvy are forced to reach one of two conclusions:
- Google has discovered the secret of re-growing hair and refuses to share it.
- Matt is, as I’ve often suspected, a cybernetic extension of the Google algorithm.
So, there you have it. My Top 10 picks of 2010 (so far), a few highlight reels, and one shocking conspiracy, as promised. By the way, if you’re a beginner or are interested in general SEO tips like these, make sure to check out our completely revised, free Beginner’s Guide to SEO.
When people talk about the future of search, they often include factors such as mobile, social, real-time, and other buzz-type words. But it is not very often that they offer an explanation as to how these elements will impact search moving forward. However, in this interview with WebProNews, search veteran Bruce Clay tells that side of the story.
In the early days, Clay says SEO was easy. He goes on to say that it was somewhat defined even 5 years ago, but social, mobile, and local are not defined at all. Now, SEO is more difficult and targeted and will get even harder over time. He calls the top 3 search results the new first page.
“You can’t be good at SEO, you have to be great,” says Clay.
In the next 18 months, he believes the hottest topics in SEO circles will be local, social media, conversions, and somewhat surprisingly, only some discussion about mobile. The reason for this lack of mobile discussion is because people do not like the mobile browser.
Clay thinks the mobile device will become an operating system with the ability to connect apps directly to the Web, which would eliminate the need for a browser. Although he believes a “find” app will be dominant over a search app, he doesn’t believe that mobile will replace search.
In regards to Google’s recent MayDay update, Clay says he saw nothing but good results for sites that optimized for the long tail. While sites that had casual long tail results lost some traffic, he pointed out that it didn’t impact their conversions.
Google Caffeine is another update that has been receiving a lot of attention of late and Clay had a lot to share about it as well. Last year, Google said that it was rolling Caffeine out to one data center and would slowly roll it out to the others. After having a conversation with Google’s Matt Cutts, Clay believes Caffeine is completely rolled out now but just not in 100 percent of the queries.
He goes on to say that advantages of Caffeine are the near real-time page index updates and increased spam filters. In addition, he says there are several behind-the-scenes factors that make it even more interesting. Although Google has not officially announced it, users can now buy Unicode characters in urls and the search engine supports it.
He also brings up a point about how Google recently said that it has 200 variables in the algorithm. As a result, search results were slower and behavioral search was penalized. Moving forward, Clay believes that multiple disjointed queries will determine search results but says it can’t be done without a faster index.
One of the big details that Google has emphasized about Caffeine is its faster index. According to Clay, if behavioral search works, PPC ads will be better and more targeted, which means that ROI will increase. As the ROI increases, the bid will also increase, which would ultimately generate more revenue for Google. All that said, the searchers would win as well since they would be getting better results.
Clay has given us a lot to think about. How do you feel about his projections?
Posted by randfish
This past week during the SMX Advanced conference in Seattle, I presented some correlation data alongside Janet Driscoll-Miller, Sasi Parthasarathy of Bing & Matt Cutts of Google. Matt in particular was quite vocal in expressing a desire to see additional data points from our research, primarily around the prominence/visibility of particular elements in the results. This post is intended to help make that available.

I must say that I don’t agree with Matt on the importance of the raw visibility/counts over the ranking correlations. My feeling is that SEOs in these spaces are more interested in answering the question – "what features predict a result will rank higher vs. lower on page 1?" – rather than the more straightforward – "does this feature appear more frequently on page 1 at Google or Bing?" However, I certainly agree that both are relevant and interesting.
If you’re trying to wrap your head around how to understand this prominence/visiblity data vs. our earlier data on the correlation with rankings, here’s how we’d best describe it:
- Correlation w/ rankings data helps to answer the question, "when this feature appears in results on the first page of Google/Bing, who ranks it higher and by what amount?" Those correlation numbers were derived by looking at the liklihood that a result would rank above another when it contained the target attribute.
- Visibility/prominence of an element helps to answer the question, "is this element more likely to appears on the first page of Google’s/Bing’s results?" This simply looks at the number of times we saw a result (or multiple results) ranking on page 1 containing the target attribute.
We’re looking at the latter one in this post, but before we dive in, there are a few critical items to understand:
- This isn’t correlation data and there’s no standard error or deviation numbers here. It’s simply how many times we saw the element in the results we gathered, divided by the total number of results (SERPs or URLs depending on the chart) to get a percentage.
- This data is from page 1 of results from 11,351 search results, gathered from Google’s AdWords categories. This means the terms and phrases vary somewhat in search quantity (from sub-100 searches per month to tens or hundreds of thousands) but generally have a commercial focus and a intent. They generally don’t include brand names, long tail phrases or vanityname searches. Overall, we picked them because they’re precisely the kinds of queries most SEOs care about when they’re doing competitive SEO for their companies and clients. We also ignore the second result in a SERP from the same domain to avoid effects of indented results (which was important for our earlier statistics, but not those in this post).
- The results were collected the week of May 31st and thus, include post-"Mayday" update SERPs and likely results from after the "caffeine" launch as well (though Google did not announce when exactly that rollout occurred – it may not have much bearing as caffeine supposedly is an infrastructure, rather than an algorithmic change).
- Each feature contains two pie charts, one showing the percentage of results that contained at least 1 URL with this feature and another showing the percentage of total URLs in all results (102,296 for Google and 109,966 for Bing – note that some SERPs will fluctuate the quantity of standard web results they show on page 1). These are labeled as "(feature) in SERPs" and "(feature) in URLs," respectively.
In gathering this data, we did not optimize to share it in this fashion. In fact, Ben & I both feel that if we wanted to do it this way, we should gather the first 3-5 pages of results, not just the 1st page. The way, one could compare the counts on page 1 with the counts on page 2. However, since we’ve got the data and Matt, Sasi and several other folks expressed interest, we’re sharing anyway. Hopefully in the future we can do more on this front.
Let’s dive in!
Exact Match Domains
These are domains that precisely matched the keywords in the query – e.g. for the query "dog collars" only a domain that matched *.dogcollars.* would be included.

You can see that Bing has slightly more exact match domains appearing in at least one result of the SERPs we collected and in the overall count of results (all the URLs from all the SERPs).
Exact Match .com Domains
Similar to exact match domains, exact match .com domains had to contain the exact query in the domain name and have a .com TLD extension.


Again, Bing showed a slight preference for displaying results from these sites in the SERPs and URLs we observed.
Exact Match .net Domains
As above, but replace ".com" with ".net."


The similarity is much closer in the number of total URLs we saw with .net exact match, but Bing is showing a preference in the SERPs count.
Exact Match .org Domains
In the .org TLDs, we start to see a bit of what we observed in the ranking correlation data:


This is the first exact match domain TLD where Google actually had more SERPs containing a result of this type. Bing, however, had a very tiny amount more URLs with this feature.
Exact Hyphenated Match Domains
One of Matt Cutts’ complaints centered around how Google vs. Bing handled exact hyphenated match domains. When we observed them in ranking correlations, it appeared that, when Google listed them, they would rank them higher than Bing did when they appeared on that first page of results. However…


As I called out in the presentation and the prior post, Bing has quite a few more SERPs where exact match domains appear and somewhat more URLs, too. This is another data point that should make us all think carefully about the fallacy of presuming correlation = causation. Bing might have a preference for exact hyphenated match domains, but the ranking correlations suggest to me there’s more going on here – maybe something to do with anchor text or where those types of sites tend to get links or something else we haven’t considered?
It’s critical to keep in mind that we’re just looking at individual factors here – not trying to explain why they exist or correlate (at least, not in the data).
Results that Include All Keywords in the Domain Name
Here we looked for domains that contained the keyword query in the domain, even if the match wasn’t exact. For example, mydogcollar.com would now match for the phrase "dog collar."


Again, it’s Bing that shows a higher number of these types of domains in their results.
Results that Include All Keywords in the Subdomain Name
We’ve previously shown some data suggesting that subdomains might have some ranking influence, but not as much as root domains (this was done using our rank modeling / machine learning process). Here’s some raw data on the number of times we observed keyword matching subdomains:


Perhaps not surprisingly, Bing again is showing more of these results in their SERPs and individual URLs.
.com Domains
For this feature and all the TLDs below, we’re just looking at any URL that has the domain extension.


It looks like Bing has very slightly more .coms in their results vs. Google.
.org Domains
Let’s see what happens for .org domains, recalling Google’s apparent preference for them in the ranking correlations.


Oddly, Bing again seems to have more .org pages in the SERPs and URLs.
.net Domains
URLs with .net probably won’t surprise you much:


Yet again, Bing is showing a small number more than their Googly competitors.
.edu Domains
Recall how, in the correlation data, the numbers were small(ish) but negatively correlated? Let’s see what the number of results shows:


True to the stereotype, Google is slightly ahead on number of .edu domains in the SERPs & URLs.
.gov Domains
Given the previous charts, this one likely won’t surprise you:


Google has more .edus and more .govs, too.
Keywords in the Title Element
Not surprisingly, nearly every set of SERPs had at least one result where the title tag contained the keywords:


Bing shows up with more results that contain title tag to keyword matching. One thing that is worth mentioning is that we didn’t observe the titles the engines chose to show, but rather the page titles from the results themselves. Hence, if a result was showing a DMOZ title or a brand title (which Goole will sometimes insert), we ignored those and just saw the title element on the page itself.
Keywords in the URL
This one actually surprised me, if only because there were even fewer results with keywords in the URL than in the title!


Bing again has more results with keyword-matching URLs, though remember that some of that is probably from keyword matching domains, too.
Keywords in the H1
The ranking correlations suggested that the H1 tag isn’t much of a differentiator, yet lots of people still swear by them:


The results would bear out that this is a much less frequent item than URLs or Titles for those ranking on page 1. Bing seems to show more of them than Google, though.
Keywords in the Alt Attribute
Alt attributes looked interesting last fall when we collected ranking information and once again provde worth a look in the correlation data from SMX Advanced. Let’s see what the raw couts show:


Bing is showing slightly more of these, but if the positive correlation means something, these numbers certanly suggest there’s lots of opportunity left for good alt attribute practices.
Homepages
Who lists homepages vs. deep pages in the results more?


My word! It’s Google by a good margin. Bing’s show of internal pages actually surprises me a bit, though perhaps that’s an old stereotype I need to abolish.
And with that, we’re done!
One important point to notice is that I’ve not included data on link results, as these would be hard to interpret and likely non-useful. Every page of results had pages with links to them and nearly every individual ranking URL also had links (a good sign for Linkscape’s index, but not super valuable as a data point). There were a few other data pieces like this that wouldn’t make sense here (keyword prominence in the body tag, word tokens in the body tag, domain name length, etc) and have thus been excluded.
I’ve done less analysis on these results in general, as I think the data is a bit less ideal for the purpose, but it’s still interesting and hopefully, illustrative of general prominence. I look forward to seeing your interpretations and discussion!
p.s. If you email Ben at SEOmoz dot org, he will send you a lot of numbers in a TSV which is for each query the metrics for each result that we used in these posts. You can also find raw results in a public Google spreadsheet doc here. Feel free to play around and let us know if you see anything else cool and interesting.
Is Google MayDay Affecting You?
06/11/10
Any time Google makes an update to its algorithm, it’s a big deal for webmasters. Following this trend, the Web community has reacted strongly to MayDay, a recent algorithm update from the search engine.
As Google’s Matt Cutts explains to WebProNews, one of the primary goals of MayDay was to address the people who do the “bare minimum” to avoid being classified as spammers. This type of content is often referred to as content farms. Due to the many complaints Google received about these content farms, the search company made changes to its algorithm to ensure that it returns the best sites for users.
“We’re trying to spot what are the signals for quality for pages or sites that really are going to be good for users,” says Cutts.
If webmasters find themselves affected by these changes, Cutts suggests that they re-evaluate their content to make sure they are providing the highest quality content. According to Cutts, the sites most readily affected are those with auto-generated pages.
On the topic of Caffeine, Cutts compares the index changes as moving from a bus to a limo. Back in 2003, the updates were slow, but with Caffeine, the index is faster, fresher, and richer. He says as soon as a document is documented, it is indexed.
In this interview, Cutts also encourages webmasters to submit video sitemaps. Just as regular sitemaps are important to help Google discover pages, the search company wants to be able to have a comprehensive view of all the videos on the Web as well.
Posted by great scott!
We’ve got a very special bonus video for you today. Our buddy-and the Googliest spam cop to ever walk the webz-Matt Cutts stopped by to do a quick interview in front of ye olde whiteboard. Watch in wonder and amazement as Rand and Matt discuss headers, status codes, how much of the web is worth indexing, porn, redirect chains, URL structures, geo targeting, leaking link juice, and amateur beekeeping!

Before you get all cynical on me and assume all you’ll hear in this interview is, "design content for users, not for engines," give it a chance. Matt only brings up his trademark catchphrase once in the whole ~20 minute interview, and he is exceedingly candid and forthcoming throughout. I promise you’re gonna walk away from this knowing some things about Google you didn’t know before. If you don’t, I’ll stand on my head. Maybe. Not really. BUT I won’t have to because you’re going to be all super-smart and educated by the end of the video. So put on your learning pants and hit play, you uppity whipper-snapper, or, if you’re like Steve Jobs and are incompatible with Flash video, read the recap below…
If you need a refresher or you’re scared of moving images and prefer the company of fluffy, harmless typing, here’s a little recap of what Matt and Rand discussed.
Should Webmasters Use the ‘If Modified Since’ Header?
The ‘If Modified Since’ header can be used to manually indicate to Google whether or not you’ve made changes to content on the page. According to Matt, they started supporting it in 2003 when bandwidth was a big issue, but nowadays, it’s not very important. That said, he still advises it as a good standard practice, but also notes that it won’t necessarily help you get crawled faster.
Should Webmasters Use 503 Status Codes for Downtime?
503s can help avoid getting a page that’s under construction or experiencing problems crawled and indexed, which can be a big problem especially for large, popular sites (watch the video for Rand’s example of Disney running into this issue). Matt advocates using 503s in this case. You can’t specify when you’d like Google to re-crawl, but they will come back and won’t index the maintenance content of the page.

Does the Number of Outbound Links from a Page Affect PageRank?
For instance, to conserve "link juice" and/or funnel it more discretely, does it matter whether I have three outbound links versus two? In the original PageRank formula, yes, juice flowed out in a simple formula of Passable PR divided by number of outbound links. But nowadays, Matt says it is a much more cyclical, iterative analysis and, "it really doesn’t make as much difference as people suspect." There’s no need to hoarde all of your link juice on your page and, in fact, there may be benefit to generously linking out (not the least of which is the link-building power of good will).
If Google’s seen a Trillion URLs, How Many Do They Pay Attention To?
Since Google crawls in PageRank order, they see the "best" stuff first and avoid a lot of the serious crap. The biggest issue is discovering duplicate or previously banned content. Matt said that about 28% of what they see is duplicate. He also made the careful distinction between "quality" content and "popular" content, further illustrating that traffic isn’t a significant ranking factor: "PR does not reflect popularity in the sense that porn is very popular, but nobody links to porn…(those sites) don’t have the PageRank you’d expect if you went by usage."
Is a Trailing / Important in URL Structure?
Seems like a minor thing right? Do you use url.com/folder of url.com/folder/ in your URL structure? Matt says he would slightly advocate for using a trailing slash simply because it clearly indicates that a URL is a folder and not a document. That said, Google is quite good at differentiating so it’s not a huge deal.

Does Google Crawl from Multiple Geopgraphic Locations?
Should I be displaying geo-specific content based on user IP? It’s a very popular question among SEOs dealing with international sites and users; but how does it affect what Google sees and what shows up in the SERPs?
Matt confirmed that, "Google basically crawls from one IP address range worldwide because (they) have one index worldwide. (They) don’t build different indices, one for each country."
This means it’s very important to avoid showing significantly different content to users from different countries. As Matt says, "The problem is if you’re showing different content-like French content to French IPs-Googlebot may not see that."
Thus, you want to be sure to send everyone to the same content initially and allow them to navigate to geo-specific areas of your site. While Google has gotten better at submitting dropdowns, working with JavaScript, etc., it is still strongly advised that you provide this geo-targeted navigation via static links.
Is It a Bad Idea to Chain Redirects (e.g. 301–>301–>301)?
"It is, yeah."
Matt was very clear that Google can and usually will deal with one or two redirects in a series, but three is pushing it and anything beyond that probably won’t be followed. He also reiterated that 302s should only be used for temporary redirects…but you already knew that, right?
What’s with the Bees?
It’s true, there are bees in Mountain view. A rash of amateur apiculture has sprung up on the Google campus and a few members of the Web Spam Team have caught stinger fever (though not Matt, he prefers cats). Apparently they’ve ven gone so far as to color all of the hives in the apiary in Google’s traditional primary colors…what a bunch of geeks
Well, that was a whole pile of great stuff we were able to get out of Mr. Cutts (and we didn’t even have to ply him with booze)! Now, go venture forth and use your new nuggets of searchy goodness to clobber your competitors.
Another huge thanks to Matt for taking the time to answer our questions so thoughtfully!
Google’s Matt Cutts always offers helpful advice, and our conversation with him at Google I/O was no exception. Cutts catches us up on a variety of search items including Google Squared, PageRank, and the recent redesign to Google’s search results page.
Google Squared is a new tool that puts search results into a spreadsheet-like list. It essentially organizes the results into facts, so users don’t have to click on multiples sites to find what they need. Cutts refers to it as a “sideways query” and points out that it could provide new information for users that they would not have previously found using traditional search.
When we spoke with Cutts earlier this year, he mentioned the growing obsession that SEOs and webmasters have with PageRank. We asked him about it in the above video, and while he did say it was important, he was quick to point out that it was only one of the more than 200 signals Google takes into consideration. He says content, title, url, and proximity are a few of the factors that have additional influence.
Users have also probably noticed the new redesign to the search results page. Cutts says the left-handed navigation was present for a while before the company decided to surface it for all search results.
Interestingly enough, the options are different based on each query. A search for Tom Cruise, for example, would probably return image results in the navigation. On the other hand, a search for President Obama would return real-time results and updates. Cutts says it creates more opportunities for webmasters and SEOs.
Lastly, Cutts did say that Caffeine was coming along nicely and indicated that there would be some announcements regarding it coming soon. Keep watching WebProNews for all the latest details on it.
Late last year, Google’s Matt Cutts told WebProNews that site performance would be a critical factor this year. Since that time, site performance has been a hot topic in the SEO community. Incidentally, Maile Ohye, also from Google, calls this area an “uncharted SEO territory” in a recent interview with WPN.
According to her, simple changes to the front end, such as how you order the style sheets and JavaScript files, can have a big impact on speed and, ultimately, conversions. She references a test that Strangeloop conducted in which it compared the site performance of an optimized site to the site performance on a non-optimized site. The test found that the optimized site had a 16 percent increase in conversions over the non-optimized site.
Ohye explains the importance of ordering style sheets and JavaScript files since it could save visitors seconds when visiting your site. She suggests having statements at the top that bring in the style sheets first followed by the JavaScript files.
For images, she advises webmasters to use image sprites, which are essentially single files that can have multiple images listed throughout the file. This eliminates the task of making file requests for each image. With sprites, webmasters can use CSS to choose which images should display where.
Although the topic of “speed as a ranking factor” has also been getting a lot of press lately, Ohye says users will not wake up one day and find the fastest sites with the highest rankings. She goes on to say that this element of ranking is more suited for sites that are so slow the users are dissatisfied.
Ohye also tells webmasters they can check their own site’s performance by applying the site performance feature in the labs section of Google’s webmaster tools. This tool will tell users how their site compares with all the other sites on the Web.
How is your site’s performance?
Posted by Tom_C
There’s been some talk recently in the SEO industry about ‘crawl allowance’ – it’s not a new concept but Matt Cutts recently talked about it openly with Eric Enge at StoneTemple (and you can see Rand’s illustrated guide too). One big question however is how do you understand how Google is crawling your site? While there are a variety of different ways of measuring this (log files is one obvious solution) the process I’m outlining in this post can be done with no technical knowledge – all you need is:
- A verified Google webmaster central account
- Google Analytics
- Excel
If you want to go down the log-file route then these two posts from Ian Laurie on how to read log files & analysing log files for SEO might be useful. It’s worth pointing out however that just because Googlebot crawled a page it doesn’t necessarily mean that it was actually indexed. This might seem weird but if you’ve ever looked in log files you’ll see that sometimes Googlebot will crawl an insane number of pages but it often takes more than one visit to actually take a copy of the page and store it in it’s cache. That’s why I think the below method is actually quite accurate, by using a combination of URLs receiving at least 1 visit from Google and pages with internal links as reported by webmaster central. Still, taking your log file data and adding it into the below process as a 3rd data set would make things better (more data = good!).
Anyway, enough theory, here’s a non technical step by step process to help you understand which pages Google is crawling on your site and compare that to which pages are actually getting traffic.
Step 1 – Download the internal links
Go to webmaster central and navigate to the "internal links" section:
Then, once you’re on the internal links page click "download this table":
This will give you the table of pages which Google sees internal links to. Note – for the rest of this post I’m going to be treating this data as an estimate of Google’s crawl. See a brief discussion about this at the top of the post. I feel it’s more accurate than using a site: search in Google. It does have some pitfalls however since what this report is actually telling you is the number of pages with links to them, not the pages which Google has crawled. Still, it’s not a bad measure of Google’s index and only really becomes inaccurate when there are a lot of nofollowed internal links or pages blocked by robots.txt (which you link to).
Step 2 – Grab your landing pages from Google Analytics
This step should be familiar to all of you who have Google Analytics – go into your organic Google traffic report from the last 30 days, display the landing pages and download the data.
Note that you need to add "&limit=50000" into the URL before you hit "export as CSV" to ensure you get the as much data a possible. If you have more than 50000 landing pages then I suggest you either try a shorter date range or a more advanced method (see my reference to log files above).
Step 3 – Put both sets of data in excel
Now you need to put both of these sets of data into excel – I find it helpful to put all of the data into the same sheet in Excel but it’s not actually necessary. You’ll have something like this with link data for your URLs from webmaster central on the left and the visits data from Google Analytics on the right:
Step 4 – Vlookup ftw
Gogo gadget vlookup! The vlookup function was made for data sets like this and easily lets you look up the values in one data set against another data set. I advise running a vlookup twice for each data set so we get something like this:
Note – that there may be some missing data in here depending on how fresh the content is on your site (this is possibly enough room for a whole separate post on this topic) so you should then find and replace ‘#N/A’ with 0.
Step 5 – Categorise your urls
Now, for the purposes of this post we’re not interested in a URL by URL approach, we’re instead looking at a high level analysis of what’s going on so we want to categorise our URLs. Now, the more detail you can go into at this step the better your final data output will be. So go ahead and write a rule in excel to assign a category to your URLs. This could be anything from just following a folder structure or it could be more complex based on query string etc. It really depends on how your site structure works as to the best way of doing it so I can’t write this rule for you unfortunately. Still, once this is done you should see something like this:
If you’re struggling to build an excel rule for your pages and your site follows a standard site.com/category/sub-category/product URL template then a really simple categorisation would be to just count the number of ‘/’s in the URL. It won’t tell you which category the URL belongs to but it will at least give you a basic categorisation of which level the page sits at. I really do think it’s worth the effort to a) learn excel and b) categorise your URLs well. The better data you can add at this stage the better your results will be.
Step 6 – Pivot table Excel Ninja goodness
Now, we need the magic of pivot tables to come to our rescue and tell us the aggregated information about our categories. I suggest that you pivot both sets of data separately to get the data from both sources. Your pivot should look something like this for both sets of data:
It’s important to note here that what we’re interested in is the COUNT of the links from webmaster central (i.e. the number of pages indexed) rather than the SUM (which is the default). Doing this for both sets of data will give you something like the following two pivots:
And:
Step 7 – Combine the two pivots
Now what we want to do is take the count of links from the first pivot (from webmaster central) and the sum of the visits from the second pivot (from Google Analytics), to produce something like this:
Generating the 4 columns on the right is really easy by just looking at the percentages and ratios of the first 3 columns.
Conclusions
25% of the crawl allowance accounts for only 2% of the overall organic traffic
So, what should jump out at us from this site here is that the ‘search’ pages and ‘other’ pages are being quite aggressively crawled with 25% of the overall site crawl between them yet they only account for 2% of the overall search traffic. Now in this particular example this might seem like quite a basic thing to highlight – afterall a good SEO will be able to spot search pages being crawled by doing a site review but being able to back this up with data makes for good management-friendly reports and will also help analyse the scope of the problem. What this report also highlights is that if your site is maxing out it’s crawl allowance then reclaiming that 25% of your crawl allowance from search pages may lead to an increase in the number of pages crawled from your category pages which are the pages which pull in good search traffic.
Update: Patrick from Branded3 has just written a post on this very topic – Patrick’s approach using separate XML sitemaps for different site sections is well worth a read and complements what I’ve written about here very nicely.
Earlier this week, Google’s Matt Cutts announced that he was challenging himself to not answer outside email for 30 days. To keep you from going into Cutts withdrawal, then, here’s an interview conducted at SMX West. Cutts even took questions from a virtual crowd during the course of it.
This wasn’t quite an all-search, all-the-time study session, though. Cutts and Mike McDonald spent a little while talking about college basketball at the beginning, and by the end of the talk, had both expressed their appreciation of Star Wars.
As for what happened in between, Cutts fielded questions individuals posed via Twitter. One question related to local business center quality, and Cutts indicated that members of his team are working on it and steady improvements can be expected.
Another inquiry touched on the idea of meta tags at the bottom of a page, and Cutts responded, “Normally, your meta tags need to go in the head section of the HTML. That’s definitely the preferred way to do it.” Otherwise, random people can leave metatags in the comments.
Then one more question addressed up the importance of backlinks for SEO. Cutts said, “They don’t hurt. . . . I definitely recommend having good page content, as well, but backlinks can certainly help with your search rankings.”
