Google has multiple ways to detect duplicate content
Google does not like duplicate content and often penalizes pages and websites that have duplicate content.
But how does Google detect duplicate content?
Well, the obvious method is for search engine crawlers to crawl each web page, read and analyze the contents of the page, and decide if the page has duplicate content.
But that is not the only method Google uses.
In order to prevent unnecessary crawling by search engine crawlers, Google also uses a predictive algorithm method that predicts and detects duplicate content based on the URL patterns.
This vital piece of information was recently shared by Google’s John Mueller in a recent Google Search Central SEO hangout.
In this blog post, we share what John Mueller said, how Google’s predictive detection method works, and what SEO professionals and content marketers can do to ensure their content does not get incorrectly flagged as duplicate content.
John Mueller on detecting duplicate content
Here is what Google’s John Mueller said while explaining how Google predicts duplicate content:
“What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has different content, we should treat them as separate pages.
The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.”
As mentioned earlier, John Mueller explained that the purpose of this predictive method is to save crawling resources:
“Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases.”
John also shared a few examples, i.e., automobile websites that use almost similar content with different cities’ names in the URL. Google’s predictive algorithm can detect such processes (using cities in the URL when there is no need to do so) and correctly flag those pages as duplicate content.
What can SEOs do?
Now comes the big question: what can SEOs do to make sure their content is safe.
John shared a few best practices:
“What I would try to do in a case like this is to see if you have this kind of situation where you have strong overlaps of content and to try to find ways to limit that as much as possible …
That could be by using something like a rel=canonical on the page and saying, well, this small city that is right outside the big city [in case you have an events website with each page discussing multiple events happening nearby], I’ll set the canonical to the big city because it shows exactly the same content.
So that really every URL that we crawl on your website and index, we can see, well, this URL and its content are unique and it’s important for us to keep all of these URLs indexed.