The Google Panda update – as well as the hundreds of other algorithm tweaks Google has been making lately – was ostensibly done in the name of improving the “quality” of their search results. But although Google has given us plenty of clues as to what constitutes a quality site (see Amit Singhal’s now iconic list of 23 quality-based questions for more details), the question remains whether or not Google’s engineers will ever be able to effectively codify the concept of site quality into measurable metrics.
Think about this challenge for a second… Imagine a high quality website in your mind. What makes the site high quality? Is it the number of words on the page? Maybe, but if this metric was the only one implemented into Google’s algorithm as the basis for a quality analysis, spammers and black hat site owners would have a field day launching pages full of jibberish.
Instead, you might use terms like “relevant”, “applicable”, “interesting” or “authoritative” to describe content that you define as good. But keep in mind that the search engine spiders don’t have a “gut feeling” and they don’t have the luxury of classifying sites based on these qualitative assessments. Instead, their decisions must be based on numbers or other filters that can be easily and arbitrarily applied.
This challenge is at the heart of search science, and the number of scraper sites that continue to dominate some search queries shows that while Google Panda may have made a few improvements to the quality of the SERPs, the overall problem hasn’t been resolved yet.
But in order to understand how Google is approaching this problem of codifying search quality, we first need to look at some of the processes the company uses when determining the order of the different results that appear in the SERPs. From that, we can extrapolate some variables that could eventually play a role in quantifying search quality. By attempting to understand what Google’s future moves will be, we can better prepare our sites to succeed in the search engines both in the present and in the future.
The key to understanding this process is to understand the role of machine learning in Google’s algorithm. According to Ninja Bonnie of the Internet Marketing Ninjas:
“Machine learning is using a computer to recognize patterns in data, to then make predictions about new data, based on the pattern recognized or learned from prior chosen training datasets.”
Essentially, machine learning involves training an algorithm based on defined datasets. In Google’s case, there are two types of information that can be used in this process – the vast history of usage and behavior data that Google has accumulated over the years, as well as new datasets created by the groups of test users and raters that Google uses to manually identify quality sites.
Interestingly enough, Google hasn’t always used the process of machine learning in its live search results. In a 2008 post on Datawocky, Anand Rajaraman describes a meeting he had with Google’s former Director of Search Quality, Peter Norvig, saying:
“The big surprise is that Google still uses the manually-crafted formula for its search results. They haven’t cut over to the machine learned model yet.”
In the article, Rajaraman posits that the reason machine learning hadn’t yet gone live was that testing had revealed weaknesses in the model, resulting in “catastrophic errors” when it was used to predict results for phenomena that fall outside of the standard bell curve model. Basically, because the algorithm could only be trained on a small set of potential and past searches, Google’s engineers weren’t confident the model would hold up under real-world circumstances.
Obviously, things have changed, as the launch of the Panda algorithm represents the biggest shift thus far in Google’s integration of machine learning models into its search results. According to Rand Fishkin, speaking in a video on the SEOMoz blog, much of this change can be attributed to Google engineer Navneet Panda who helped to make the machine learning model more scalable and applicable to a wider range of potential search queries.
“Basically before Panda, machine learning scalability at Google was at level X, and after it was at the much higher level Y.“
So what we know about Google’s machine learning integration is this: Google has begun to use machine learning models to generate live search results and it is basing these models, in part, off of data generated by user groups assigned to assess the quality of a site based off of Singhal’s 23 questions (although there are undoubtedly other factors at play as well).
But, because Singhal’s questions revolve around qualitative assessments (for example, “Would you trust the information presented in this article?” or “Is the site a recognized authority on its topic?”), it’s safe to assume that the machine learning models are translating the different elements found on the “good” pages versus the bad into numeric signals that can be applied across large volumes of search queries.
Of course, Google won’t release these exact signals, but we can make some extrapolations based off of the sites that suffered most in the past six months of Panda rollouts. Keep in mind, these are only assumptions and that what works on one site may not work on another. When making changes to your site, always test and track your results to ensure that these suppositions hold true for your site.
Spectrum of Topics on Your Site
The Google Panda update is supposed to be all about “authority” – that is, rewarding pages built by legitimate authorities in their industries. Since it’s nearly impossible to be an authority in every single industry, sites that focus on one narrow topic could be ranked above those addressing a wide spectrum of issues. This could be one of the reasons for the major hit to content aggregators like Hubpages, EzineArticles and Demand Media.
The natural result of applying Singhal’s 23 questions to a set of websites will be that those sites that focus on providing a good experience for their visitors will be ranked highest. And according to most SEO experts this comes down to three things: site design, site navigation and site speed (all of which could be easily quantified by the search engine algorithms).
Put simply, if your site hasn’t been touched in the 10 years since you put it up, chances are you aren’t catering to your users’ experience as much as Google would like you to. And if your site is difficult to navigate and slow to load, you can bet that you won’t fare well when machine learning algorithms trained off of user-generated quality datasets decide where you should rank.
Social, Social, Social!
We’ve beaten it into death here, but there’s no doubt that social signals play a role in both the current post-Panda SERPs and in Google’s machine learning models. Imagine that you were one of Google’s website testers. Wouldn’t you be more likely to rate a site higher on the “authority” scale if the webmaster had an active social networking presence, than if he or she had no presence on these sites?
Social signals can be quantified in terms of the number of inbound links pointing to your site from social media sites, the number of fans or followers you have on a given network, or even how frequently you post to these sites. And again, although this is all speculation as to what factors Google could be quantifying in order to shape its new post-Panda algorithm, it’s worth testing changes to these elements in order to bring your site into better alignment with Google’s long term goals.