Accidentally creating duplicate content in Drupal is like... a cold:
Catching it is as easy as falling off a log.
All it takes is to:
- further submit your valuable content on other websites, as well, and thus challenging Google with 2 or more identical pieces of content
- move your website from HTTP to HTTPs, but skip some key steps in the process, so that the HTTP version of your Drupal is still there, “lurking in the dark”
- have printer-friendly versions of your Drupal site and thus dare Google to face another duplicate content “dilemma”
So, what are the “lifebelts” or prevention tools that Drupal “arms” you with for handling this thorny issue?
Here are the 4 modules to use for boosting your site's immunity system against duplicate content.
And for getting it fixed, once the harm has already been made:
1. But How Does It Crawl into Your Website? Main Sources of Duplicate Content
Let's get down to the nitty-gritty of how Drupal 8 duplicate content “infiltrates” into your website.
But first, here are the 2 major categories that these sources fall into:
The first ones include all those scenarios where spammers post content from your website without your consent.
The non-malicious duplicate content can come from:
- discussion forums that create both standard and stripped-down pages (for mobile devices)
- printer-only web page versions, as already mentioned
- items displayed on multiple pages of the same e-commerce site
Also, duplicate content in Drupal can be either:
- or similar
And since it comes in “many stripes and colors”, here are the 7 most common types of duplicate content:
1.1. Scraped Content
Has someone copied content from your website and further published it? Do not expect Google to distinguish the copy from its source.
That said, it's your job and yours only to stay diligent and protect the content on your Drupal site from scrapers.
1.2. WWW and non-WWW Versions of Your Website
Are there 2 identical version of your Drupal website available? A www and a non-www one?
Now, that's enough to ring Google's “duplicate content in Drupal” alarm.
1.3. Widely Syndicated Content
So, you've painstakingly put together a list of article submission sites to give your valuable content (blog post, video, article etc.) more exposure.
And now what? Should you just cancel promoting it?
Not at all! Widely syndicated content risks to get on Google's “Drupal 8 duplicate content” radar only if you set no guidelines for those third-party websites.
That is when these publishers don't place any canonical tags in your submitted content pointing out to its original source.
What happens when you overlook such a content syndication agreement? You leave it entirely to Google to track down the source. To scan through all those websites and blogs that your piece of content gets republished on.
And often times it fails to tell the original from its copy.
1.4. Printed-Friendly Versions
This is probably one of the sources of duplicate content in Drupal that seems most... harmless to you, right?
And yet, for search engines multiple printer-friendly versions of the same content translates as: duplicate pages.
1.5. HTTP and HTTPs Pages
Have you made the switch from HTTP to HTTPs?
Or are there:
- backlinks from other websites still leading to the HTTP version of your website?
- internal links on your current HTTPs website still carrying the old protocol?
Make sure you detect all these less obvious sources of identical URLs on your Drupal website.
1.6. Appreciably Similar Content
Your site's vulnerable to this type of duplicate content “threat” particularly if it's an e-commerce one.
Just think of all those too common scenarios where you display highly similar product descriptions on several different pages on your eStore.
1.7. User Session IDs
Users themselves can non-deliberately generate duplicate content on your Drupal site.
How? They might have different session IDs that generate new and new URLs.
2. 4 Modules at Hand to Identify and Fix Duplicate Content in Drupal
What are the tools that Drupal puts at your disposal to detect and eliminate all duplicate content?
2.1. Redirect Module
Imagine all the functionality of the former Global Redirect module (Drupal 7) “injected” into this Drupal 8 module!
In fact, you can still define your Global Redirect features by just:
- accessing the Redirect module's configuration page
- clicking on “URL redirects”
- create new redirects
- identify broken URL paths (you'll need to enable the “Redirect 4040” sub-module for that)
- set up domain level redirects (use the “Redirect Domain” sub-module)
- import redirects
Summing up: when it comes to handling duplicate content in Drupal, this module helps you redirect all your URLs to the new paths that you will have set up.
This way, you avoid the risk of having the very same content displayed on multiple URL paths.
How about “fighting” duplicate content on your website at a vocabulary level?
In this respect, this Drupal 8 module:
- prevents you from saving a taxonomy term that already exists in that vocabulary
- is configurable for every vocabulary on your Drupal site
- allows you to set custom error messages that would pop up whenever a duplicate taxonomy term is detected in the same vocabulary
2.3. PathAuto Module
Just admit it now:
How much do you hate the /node125 type of URL path aliases?
They're anything but user-friendly.
And this is precisely the role that Pathauto's been invested with:
To automatically generate content friendly path aliases (e.g. /blog/my-node-title) for a whole variety of content.
Let's say that you want to modify the current “path scheme” on your website with no impact on the URLs (you don't want the change to affect user's bookmarks or to “intrigue” the search engines).
The Pathauto module will automatically redirect those URLs to the new paths using any HTTP redirect status.
Personalization is key when you strive to prevent duplicate content in Drupal, right?
And this is precisely what this module here does: it helps you personalize content on your website.
How? Through its 3 main functionalities delivered to you as sub-modules:
- auto tagging
- text summarizing
- detecting plagiarized content
Leveraging Natural Language Processing, this last sub-module scans content on your website and alerts you of any signs of duplicity detected.
Word of caution: keep in mind that the module is not yet covered by Drupal's security advisory policy!
3. To Sum Up
Setting a goal to ensure 100% unique content on your website is as realistic as... learning a new language in a week.
Instead, you should consider setting up a solid strategy ”fueled” by (at least) these 4 modules “exposed” here. One that would help you avoid specific scenarios where entire pages or clusters of pages get duplicated.
Now, that's a far less utopian goal to set, don't you think?