Ever wonder how sites get into search engines? And how do search engines manage to give us tons of information in seconds?The secret of such lightning-fast work is in the search index. It can be compared to a huge and perfectly organized catalog-archive of all web pages. Getting into the index means that the search engine has seen, evaluated and remembered your page, so it can show it in search results.
I suggest that you understand the indexing process from scratch to understand how sites get into Google, whether you can manage this process and what you need to know about indexing resources with different technologies.
What is a scanning and indexing?
Site page scanning is a process where a search engine sends its special programs (we know them as search robots, crawlers, spiders) to collect data from new and changed pages of sites.
Indexing site pages is scanning, reading data and adding them to the index (catalog) by search robots. The search engine uses the information received to find out what your site is about and what is on its pages. After that, it can define keywords for each scanned page and save a copy of them in the search index. For each page it stores the URL and content information.
As a result, when users enter a search query on the Internet, the search engine quickly browses its list of scanned sites and shows only relevant pages in the output. Just as a librarian who searches for the right books in the catalog – alphabetically, thematically and by the exact title.
Website indexing on Google
When we google something, the data is not searched by sites in real time, but by the Google index, which contains hundreds of billions of pages. When you search, different factors are taken into account – your location, language, device type, etc.
In 2019, Google changed its core principle of indexing a site – you’ve probably heard of Mobile-first. The main difference of the new method is that the search engine now stores the mobile version of the pages in the index. Previously, the desktop version was primarily taken into account, but now the first Googlebot for smartphones comes to your site – especially if the site is new. All other sites are gradually moving to a new way of indexing, which owners will learn about in Google Search Console.
There are a few other main indexing differences in Google:
– the index is constantly updated;
– the process of indexing the site takes from a few minutes to a week;
– poor quality pages are usually downgraded, but not removed from the index.
The index includes all scanned pages, but on top of search there are only the highest quality websites and pages. Before showing the user some web page on demand, the search engine checks its relevance on more than 200 criteria (ranking factors) and selects the most suitable.
How will search robots know about your site
If this is a new resource that has not been indexed before, you should “present” it to search engines. Having received an invitation from your resource, search engines will send to the site of their crawlers to collect data.
You can invite search bots to the site, if you place a link to it on a third-party Internet resource. But keep in mind: for search engines to find your site, they must scan the page on which this link is placed.
You can also use one of the options below:For Google
Create a Sitemap file, add a link to it in robots.txt, and submit a Sitemap to Google. Submit a request to index a page with changes in the Search Console. Every Sitemap dreams of having its site indexed as quickly as possible, covering as many pages. But no one, not even your best friend who works at Google, can influence this. The speed of scanning and indexing depends on many factors, including the number of pages on the site, the speed of the site itself, settings in the webmaster and the crawl budget. In short, the crawling budget is the number of URLs on your site that the search robot wants and can scan.
How to control a search robot
The search engine downloads information from the site, considering robots.txt and sitemap. And that is where you can recommend the search engine, what and how to download or not download on your site.
This is a plain text file that contains basic information – for example, which search robots we are accessing (User-agent) and what we are not allowed to scan (Disallow). The instructions in robots.txt help search robots navigate and do not spend their resources on scanning insignificant pages (e.g. system files, authorization pages, trash can contents, etc.). For example, the line Disallow:/admin will prevent search robots from viewing pages whose URL starts with the word admin, and Disallow:/*.pdf$ will prevent them from accessing PDF files on the site. Also in robots.txt it is necessary to specify the address of a site map to specify its location for search robots. To check if robots.txt is correct, use a separate tool in the Google Search Console.
Another file that will help you optimize the process of site scanning by search robots is a Site Map (Sitemap). It indicates how the content on your site is organized, which pages are indexed, and how often the information on them is updated. If there are several pages on your site, the search engine will probably find them itself. But when the site has millions of pages, he has to choose which of them to scan and how often. And then the site map helps in their prioritization among other factors.
To ensure that no important page of your site is left without the attention of the search engine robot, the game comes into play navigation in the menu, “breadcrumbs”, internal linking. But if you have a page that does not lead to either external or internal links, it will help to find it is the sitemap.
And you can also specify in Sitemap:
– the frequency of updating a particular page – the tag <changefreq>;
– canonical version of the page – attribute rel=canonical;
– versions of pages in other languages – attribute hreflang.
Site map is also a great help to understand why there are difficulties in indexing your site. For example, if your site is very large, it creates many site maps by category or page type. And then it is easier to understand in the Google Search Console, which pages are not indexed and further to deal with them already. You can check whether the Sitemap file is correct in the Google Search Console on your site in the Sitemap Files section.
So, your site has been sent for indexing, robots.txt and sitemap are checked, it’s time to find out how the indexing of the site and what the search engine found on the resource.
How to check the indexing of the site
There are several ways to check the site’s indexing:
1. Through the operator “site”. This operator does not give the exhaustive list of pages, but will give the general understanding about what pages in an index. Produces results for the main domain and subdomains.
2. Through Google Search Console. The console of your site has detailed information on all pages – which of them are indexed, which are not and why.
Why is my website not indexed in Google?
Reason #1: the site is closed for indexing
The most popular reason is this text in the robots.txt file:
With this setting, no search engine will find a way to your site. Disallow: / need to be removed.
Why else the site can be hidden from search robots:
- incorrect operation of the tag noindex: along with unnecessary pages from the indexing the needed pages are closed too;
- private settings in CMS;
- scanning blocked in .htaccess file.
Reason #2: the search robot does not know about the existence of the site or page.
First of all, this is typical for young sites: if this is about you, it is not surprising that the site is poorly indexed by Google. Especially if the registration of the site in the search took a long time and it is not even in line for scanning. Give search engines time to detect it, at least 2 weeks.
Also the robot may not know about your site because it is updated rarely and there are no links to it. So when you add new pages, do not forget about linking and links to reputable external resources.
Reason #3: the site is banned
Google impose sanctions for various “search crimes”: such web resources are blacklisted by bots, and no one comes to index them.
The problem is that it’s not always obvious to the owners and webmasters of the sites. In the case of Google to determine sanctions as the cause of poor indexing of the site without a SEO specialist will not be easy.
This usually leads to the imposition of filters:
– irrelevant and poor quality content;
– annoying advertising blocks;
– link sales or link spam;
– spammed semantic kernel;
– malicious code;
Reason #4: technical errors
Some technical parameters are so elementary and critical at the same time, that their correction immediately neutralizes bad site indexing. For example:
- incorrect HTTP headers;
- incorrect redirects (using 302 instead of 301, rel=”canonical” with the same canonical page for everything);
- incorrect encoding, which the robot displays with a set of unreadable characters;
- scanning errors indicated by the search engines themselves in their webmasters panels (Google Search Console);
- unstable operation of the hosting;
- incorrect setting of the sitemap.xml file.
Reason #5: poor page quality
In some cases, poor quality, such as content, will be so blatant that Google will put a sanction on the scanning site – and that’s it, the site is not indexed because of the ban.
But more often than not, poor quality pages, because of which Google poorly indexes the site, means only that your competitors have a better website. That is, your resource loses its position in the issue not as an individual, but in comparison.
What grounds do search engines have for pessimization:
- non-unique content (pages with the same content that is already in the SERP (Search Engine Result Page), it makes no sense to add it again);
- non-unique header structure, the same meta tags;
- lots of 404 errors;
- slow loading speed due to heavy images and generally unoptimized content.
To sum up
Search engines are ready to index as many pages of your site as you need. Just think of it, Google’s index exceeds 100 million gigabytes – hundreds of billions of indexed pages that are growing every day.
But often it is up to you to determine the success of this event. Understanding the principles of indexing search engines, you will not hurt your site. If you all correctly specified in robots.txt and sitemap, took into account the technical requirements of search engines and took care of the availability of quality and useful content, search engines will not leave your site without attention.
Remember that indexing – this is not about whether your site will be indexed or not. Much more important – how many and what pages will be in the index, what content on them will be scanned and how it will be ranked in the search.
The reasons listed above most often explain why the site is poorly indexed. If this list does not contain what caused the pessimization of your Internet resource, it is better to turn to a SEO specialist or directly to me – I’m always ready to help you! 🙂 So do not hesitate to contact me!