Let me skip the waffle – show me the template

Have you ever looked for one of your pages in the search results and been unable to find it?

And then you head over to Google Search Console and realise it’s not indexed? And then that many pages aren’t indexed?

A screenshot of Google Search Console showing 11.3k URLs not indexed and 2.1k URLs indexed.

In GSC, Google shares many reasons why a page might not have been indexed, but all of those rely on Google having found them in the first place.

To clear up some immediate issues, we at PushON use our indexability analysis template to plug in information from a crawl and GSC.

Keep reading to learn more about what may be happening, what it all means, and how to fix it!

So, why isn’t Google indexing my page?

Or any other search engine, for that matter.

The trouble is there are many reasons your page might not be indexed. They range from technical reasons, like whether or not your page can be found by search engine crawlers, all the way through to content reasons. Today, we are only going to focus on the technical reasons your page might have been left out of the index, so if you realise it’s not any of those, you may want to revisit your content!

An important thing to remember here is that just because the page has the index meta-robots tag (more on that later), other things may be stopping it from getting indexed. Getting everything out on one page – or one sheet – helps determine where the problem may be.

How do search engines index pages?

Before we get ahead of ourselves, let’s briefly discuss how search engines visit and index pages in the first place. The emphasis here is on brief, so if you want more in-depth guidance, please look at Google’s documentation or this great breakdown from Lumar.

Like a regular human user, search engines will visit a web page and look at the content. They may not see all of the visual beauty of your web page, like a regular human will, but they will see the page code. There, they’ll look for links (both internal to your website and external to your website) and head over to look at those. Outside of those regular web pages, search engines will also visit your robots.txt file and sitemap.

With each page they visit, the search engine makes a note of the content on the page so that they know what this page is about, but they will also pay attention to a series of technical signals that identify whether this page can even be indexed in their search results.

What are those technical indexing signals?

There are many ways you can directly and indirectly tell search engines whether to index a page, and we’ll cover some of the core ones below. We’ve split these into hard signals, i.e. search engines tend to obey these explicitly, and soft signals, i.e. search engines take these as a hint or a suggestion. Both are important!

Hard Indexing Signals

Status Codes

We won’t dwell on these, as explaining the specifics of each status code is its own article.

Essentially, a status code is the server response when requesting a page. Only one is firmly indexable, a 200 status. It is saying, “Yes! Here is the page; the request is successful”.

Client error codes are in the 400-499 range, and you’ll be the most familiar with the 404 “Not Found” response. These won’t be indexable.

Redirects fall under 300-399, and some of these are indexable. Kinda. A 302 and a 307 means the requested page has a temporary change or redirect, so carry on as usual; a 304 relates to caching, so again, carry on as normal.

Anything in the 500-599 indicates a server issue, and it certainly isn’t indexable.

Meta Robots & X Robots

These tags are a piece of code that provides the crawler with instructions on how and whether to crawl and index a page.

The meta robots sits within the HTML code, whereas the X Robots lives in the HTTP Header.

They will usually list a series of rules bots like Googlebot will obey (although if they can’t reach the page code, they won’t be able to obey them). It’s worth noting that the rules not to index a page will be followed more firmly than the rules to index a page, as other reasons may be affecting a page’s indexability – after all, it’s what this article is about!

Some of the rules include:

Rule Meaning
Noindex Do not show this page in search results.
Nofollow Do not follow the links included on this page.
Index Please do index this page.
Follow Please do follow the the links on this page.
Nosnippet Do not show a text preview or video preview of this page.

Robots.txt

At PushON, we like to explain this as the highway code of your website. This file sits at the root of your site and contains rules bots must follow. Here, you can “disallow” a page, which will stop a crawler from ever visiting the page.
Search engine crawlers like Googlebot will obey this completely. However, other crawlers and bots – especially those with nefarious intent – may disregard it.

Canonicals

Canonicals can sit in the HTML of the page or within the HTTP header and are essentially a link that points to a page’s “main” version.

Sometimes, that’s the page we are currently on, in which case the canonical is self-referential.

Other times, it may point towards a different URL, in which case the current page is “canonicalised”. When a page is canonicalised, search engines will disregard it and instead look at the page it is pointing at.

Why might we need to do this? Well, sometimes we may have two similar pages, and we only really need to have one being indexed.

For example, on an eCommerce site that sells dresses, a user may have used the filters to sort by black dresses or by price, creating URLs that look like this:

/clothes/dress?colour=black
/clothes/dress?sort=price:low

Both of these should canonicalise back to /clothes/dress, the main page on which we’d want to appear in search.

Soft Indexing Signals

Crawl Depth

Crawl depth is a figure you can get from tools like Screaming Frog that will tell you how many links it took to reach that page. A crawl depth of 0 means it took no links to get there, likely your homepage. A crawl depth of 1 means it took one link, so this was reachable from your homepage – and so forth.

Internal Links

Sometimes referred to as inlinks, internal links fall into the same realm as crawl depth. Where crawl depth is whether there is a single link from one page to another, internal links are how many links point to a single page.
The more links there are, the more critical a page seems. This signals to search engines that these pages are your priority, and the higher the priority, the more likely they are to be in search.

Sitemap

The sitemap goes hand in hand with the robots.txt. Where the robots.txt is the highway code, the sitemap is the literal map of our website.

This isn’t an ordinance survey map; however, it is more like a map of a theme park where you highlight all of the fantastic and lovely attractions you have. You want to include all of the URLs you want people to visit, but you don’t want to include the CMS pages your employees use to keep the site running. You also don’t want to include the private accounts section, which may consist of private information.

The sitemap should only include the pages you want to appear in search; however, this is still a soft signal. Search engines may or may not respect it.

Why not just report on “Indexability” from Screaming Frog?

In your Screaming Frog crawls, you’ll have a column labelled “Indexability” and “Indexability Status”:

A screenshot from Screaming Frog showing their indexability metric.

And this is useful! The trouble is, this only refers to those hard signals, and only some of them. These will not report on the sitemap, inlinks or crawl depth.

To remedy this, we use a template to input this data and look at all the things that may affect it at once, comparing it against indexation data from Google Search Console.

How to conduct your own indexability analysis

The tools you’ll need:

Step 1: Crawl the website

Use your chosen tool and ensure you have “Ignore robots.txt but report status” selected, found under configuration > Crawl Config > Content > Robots.txt. Don’t forget to crawl the sitemap as well, found under spider > crawl. Once the crawl is done, run a Crawl Analysis as well to get the sitemap info.

Step 2: Export the HTML URLs & paste them into the sheet

Navigate to Internal > HTML and export that list of URLs.

A Screaming Frog screenshot of their Internal URL report

Paste the entire table into the “Insert Crawl Here” tab and make sure your headings line up with the headings already there. This may mean deleting or adding some columns.

A screenshot of PushON's template, showing how to use the headers. Place your pasted data underneath them, headers included.
You should see that some of the “Analysis” tab has been populated!

Step 3: Export “URLs in Sitemap” & paste them into the sheet

As above, navigate to the Sitemap > URLs in Sitemap section and export those and place them into the spreadsheet under the “Insert Sitemap File” tab. Make sure those headings line up as well.

A screenshot of PushON's template demonstrating the headers, paste the data underneath, headers included.

Step 4: Export indexed URLs from Google Search Console and paste them into the sheet

Head over to your Google Search Console property and navigate to Indexing > Pages > “View data about indexed pages”:

A screenshot of the Indexed Pages report from Google Search Console, demonstrating where to get the indexed URLs.

Hit export as Excel in the top right, and copy and paste the “table” tab into the “Insert Indexed Pages” tab in the sheet.

A screenshot of the PushON template demonstrating how to use the headers

And that’s all of our data! Now, we head into analysis.

Step 5: Analysing your data

Now everything has been added, the “Analysis” tab should have been fully populated and anything coloured red identifies a problem with indexing.

First, we recommend filtering out everything that is indeed indexed according to Google Search Console – there are clearly no problems there!

A screenshot of PushON's template in action, with the data shown in red when it relates to reducing indexability Now you’ll have a list of URLs that aren’t indexed and the potential reasons why. Some you may be able to cull straight away, for example, redirected URLs, but others may be URLs that serve a 200 status and, for some reason, aren’t indexed:

For example, we can identify an article above that is indexable but not indexed. It looks like the crawl depth is relatively high, and there aren’t many internal links pointing to it, so we may want to investigate there and make the article easier to find. It also doesn’t seem to be in the sitemap, so we’d need to examine why that may be the case.

If nothing seems to be the issue, it’s time to investigate your content – which is another article!

Good luck!