In Search Podcast: 5 Ways to Use Logfiles for SEO With Gerry White
How are you taking advantage of logfiles to improve your SEO?
That’s what we’re going to be talking about today with a man with over 20 years of experience in the SEO industry working at brands and agencies, including the BBC, Just Eat, and Rise at Seven. A warm welcome to the In Search SEO podcast, Gerry White.
In this episode, Gerry shares five ways to use logfiles for SEO, including:
- Seeing how Google looks at your site
- Parameters
- Are there subdomains consuming your crawl budget
- JavaScript and CSS files
- Response codes
How to Use Logfiles for SEO
Gerry: Hey, glad to be here.
D: Good to have you on. You can find Gerry by searching Gerry White on LinkedIn. So Gerry, should every SEO be using logfiles?
G: No, I know that sounds controversial when I say that logfiles, we’ve got huge amounts of information. But honestly, a lot of the time it’s diminishing returns. And often you can generally find a lot of information before you go into logfiles. What I mean by that is, if you take a look in Google Search Console information, there are huge amounts of information there. When I’ve been looking into logfiles, it’s when I’ve first exhausted a lot of other places first. I always recommend crawling a site using anything like Screaming Frog or whichever desktop crawler you’ve got, and then looking at Google Search Console before you start to look at logfiles.
The reason I say that, and the reason that I sound almost anti-logfiles when I’m going to be talking about how useful they are, is the fact that they’re actually quite challenging to work with initially. And it does take a little bit of skill, knowledge, and experience to really get your hands on to them, and even to get access to them. But one great thing about today is the fact that now, we actually have more access to logfiles than almost ever before. Initially, when I started out, we didn’t have Google Analytics or any analytic software like we have today. Logfile analysis was how we looked at how people visited websites. Now, we never look at logfiles rarely for how people look at websites, unless we’re doing something with InfoSec. Or we’re doing something to diagnose something really weird and wonderful.
But actually, a lot of the time, we have much better analytics software. This might change because actually, one weird thing is the fact that a lot of websites can’t track how many people go to a 404 page, because a lot of the time, you never click that you’ll accept cookies on a 404 page. Suddenly, logfiles are coming back again to go answer some very strange questions like that.
But the main reason that I’m talking about logfiles today is for SEO purposes. So yes, if you’ve got problems with large sites, if you’ve got a large e-commerce website, if you’ve got an international, multilingual, huge site with faceted navigation, then logfiles is something that definitely should be taken into account and definitely should be looked at down the line as soon as possible.
D: So today, you’re sharing five ways that SEO should be using logfiles. Starting off with number one, seeing how Google looks at your site.
1. Seeing How Google Looks at Your Site
G: Yeah, Google is fairly unpredictable, almost like an unruly child. It’s strange because although I say we can look at sites and we can use crawling tools to have a look at how Google should be looking at the site, we’re often surprised to find out that Google got obsessed with one set of pages or going down some strange route somewhere. Or more recently, I’ve been working with for the last year for a supermarket called Odor, and one of the things we found was that the Google bot has been looking very much at kind of the analytics configuration and creating artificial links from it. Google’s finding broken links. And for a long time, I was trying to figure out why it was finding tens of 1000s of 404s that were not on the page at all. But it turns out it’s been looking at the analytics configuration and creating a link from that. So we’re looking at how much of an impact that’s had. And if we’re looking at the fact that Google is finding all of these 404s, that might not be a massive problem. But now we want to know is how much time it is spending on those 404s, and if we fix this one tiny problem, will it mean that the crawling of the rest of the site will increase by 20-30%? What’s the opportunity if we fix it there? It’s all about looking at why Google’s looking at the site like that, and what it’s finding that it really shouldn’t be finding.
2. Parameters
The other thing that we often look at is parameters. I don’t know if you know, but SEO folks always to link through to the canonical version of the page. What I mean is, there are often multiple versions of a page that sometimes have some kind of internal tracking or external tracking. There are so many ways in which we can link through to a page and often a product, for instance, can sit in multiple places in a site. A good example of this is I worked on a site, which was Magento. And every product seemed to sit under every single category so it was amazing when we found out that there were about 20 versions of every product, and every product was crawlable. So from there, we knew that Google was also spending a huge amount of time crawling through the site. And what’s interesting is, if you remove a product, Google will kind of go “Oh, but I’ve got 19 Other versions of this product” so it’ll take a while for the actual page to almost disappear if you’ve used a 404 or something like that because of the way in which Google works. Google will see that this is a canonical version of this page. But if you remove the canonical version, then it will start to use different ones. And this is the kind of information that logfile gives us. The ability for us to look at the site the way in which Google is.
And it also allows us to look at things like status codes. A great example of this is there is a status code that says I have not been modified. And for the life of me right now, I can’t think what it is, I should have written this down before this podcast. But basically, the “I’ve not been modified” massively improves the crawling rate of a website. And when I find out that this was something that Google was respecting, what I can do was with all of the images, all of the products, and all of these bits and pieces that don’t get modified very regularly, if we can use a not modified, and we can improve the speed at which Google’s crawling, improve the effectiveness, and reduce the load on the server, we can then significantly improve the way in which Google is finding all of the different products.
The way in which Google looks at stuff, we want, server admins want, and everybody wants, is the server to be as fast and as efficient as possible. Again, going back to the logfiles side of it, nowadays, we couldn’t use logfiles at all effectively, for many years. Because with CDNs, you’d often find that there’d be multiple places in which a page would be hit. And the CDN often didn’t have a log file itself. So we’ll be looking at all these different places and see how much load is there on this server and how much load is on that server. And we try and piece everything together and the logfiles will be in a different format. Now with CDNs, we can actually start to understand the effectiveness of a CDN. Suddenly, things like PageSpeed is massively impacted and improved by the fact that if we use logfiles, we can start to understand the fact that the image, for instance, by canonicalization of images, so if there’s one image being used across multiple pages, as long as the URLs consistent, the CDN works, and Google crawls it better. Yeah, there are so many different ways in which logfiles help improve PageSpeed, caching, and serving users and search engines much more efficiently.
D: I’m reviewing your five points that you were going to share. And there are different elements of them that you’ve shared already. You remind me of someone that I can just ask one question to and they give me a 15-minute podcast episode without asking any further questions. So there’s one person that can probably do that, even more than you. And that’s probably Duane Forrester. Duane and I’ve joked about him doing that me just asking him one question and me walking off and just leaving him to share the content for the rest of the episode. But you talked about parameters a little bit. I don’t know if you touched upon point number three, which is discovering if there are subdomains that are consuming crawl budget, as there shouldn’t be.
3. Are there subdomains consuming your crawl budget?
G: This actually goes back to Just Eat. At one point, we discovered that the website was replicated on multiple different subdomains, and all of these were crawlable. Now, interestingly, these had no visibility according to tools like Citrix. And the reason that they didn’t was because it was all canonicalized. So when we found out that although these duplicates were out there, Google was spending somewhat less 60 to 70% of its budget crawling these subdomains. And because of the way in which these weren’t cached in the same way because of the CDNs and other technology, this was actually creating a lot of server loads. So it was something which was fascinating for us, because we were just ignoring this as a problem that needs to be fixed up sometime in the very future. Because we knew about the problem. We knew there was a kind of issue, and I’d spoken about it. But I’d deprioritized it until we started looking at the logfiles.
We saw that Google’s spending a lot of energy, time, and resources here. How much server load is it creating? How much of an impact was it? And we couldn’t understand how much of a server load it was because of the way in which the server was not able to interpret the different sources. So it was fascinating that when we got the logfiles, we could improve the reliability of the website by a considerable amount. So we knew about the subdomains, we just didn’t know how much of a problem it was until we started looking into the logfiles. And then suddenly, we saw that this needs to be fixed up ASAP. It was it was one of those things that we knew how to fix it up, it was just prioritization. It was at the bottom of the queue and it was bumped up to number two.
4. JavaScript and CSS files
D: You touched upon canonicalization but you also said that, specifically, JavaScript and CSS files can be an issue. Why is that?
G: One of the things that we often do is we break the cache by adding a parameter to the CSS file. The reason we do this is what happens if you use a CDN or something similar, is that whenever you update the CSS, you’re creating new pages, or something, then the problem there is that you have a CSS file which is cached and new pages won’t be able to use it. And we have long cache times on all of these different JavaScript and CSS files. So within the page, as soon as we add something which needs the JavaScript or the CSS to be updated, you just change the parameter within it slightly. From there, what we had to make sure was all of the different servers were using the same parameters version going forwards. And that was something where if you’re working across multiple different teams, multiple different websites, the one better JavaScript that powers the entire thing, we always made sure it was the right version. And logfiles was one way that we made sure that all of the different pages were consistently hitting the right JavaScript version because maybe we had to update an API key or something similar. There was so many different ways in which we had to do it. And this was something that was a massive task for the developers.
One of the things that we were looking at in the logfiles was, was the old one being hit, where it was being hit from, and could we fix it up? We also found that there are many different ways in which you could write the path to the JavaScript file. For instance, it was in a subdomain was it did we use a different host name, because, interestingly, if you work across multiple different websites, you often find that there are different URLs or different domain names that actually access the same server. And often if you’re using a CDN or using a subdirectory then sometimes it can be very inconsistent. And from a user point of view, if you’re hitting that same JavaScript file six or seven different ways within a journey, then you’re loading it up six or seven different ways. And while that might not seem like a lot, cumulatively, that adds some megabytes to your journey. And that, of course, slows down the whole experience, and it makes the servers less efficient. And there’s much more to it. So ensure that the right version of the JavaScript, CSS and other bits and pieces are always being hit. And also make sure that there’s no reason for the JavaScript to be hid in with parameters or something. There are so many ways in which spider traps can be created, which include the JavaScript files, where, for instance, something gets tagged into it, where maybe they don’t use the right absolute reference to the JavaScript. So it’s located in a different directory to other times. It’s surprising all the different ways in which you can spot when JavaScript is being loaded up slightly differently by multiple different pages. So yeah, it’s a very simple one. But it’s surprisingly expensive when it comes to analysis.
5. Response codes
D: Also ensuring that response codes are being delivered in a manner that you would want. An example of that is through TOS sometimes being seen or not being seen by Google that should or shouldn’t be. So why would that happen?
G: Again, we always visit web pages using the same browser, the same technology, the same experience and everything. I try to make sure that I use other tools other than what I usually use, as everybody does a Screaming Frog audit, so I try to use all sorts of bits and pieces. But we always pretend that we’re kind of like a computer. So we never pretend we’re Googlebot, we never pretend that we’re all of these different things. So if you look at how Google bots accessing a particular file from a different IP address… a lot of technology like CloudFlare, if you pretend you’re Googlebot, and you’re trying to access it using Screaming Frog, it knows you’re not Googlebot, you’re actually this. And so it treats you differently to how you would treat Googlebot. And so often, servers are configured to pre-render stuff to do all bits and pieces. And it’s just making sure that everybody gets the right response code from the server at that point.
And it seems quite simple but when you’re scaling up across international… When you’ve got geo redirects, if a user or search engine can’t access a particular page because somebody’s put in a geo redirect to say that if you visit this website from Spain, then go and load this subdirectory up… It can’t therefore look at the root versions or the alternative versions. That’s why things like response codes being correct is absolutely critical. And it is surprising how often you go through these things and you assume everything is correctly set up. Because time and time again, we know how it should be set up. We give this to somebody, somebody interprets it, another person implements it, and somebody else goes through it. And then somebody else clicks a button on the CDN, which says, “Oh, we can geolocate somebody at this particular place.” It’s not so much the fact that any one person’s done something wrong is so much that there’s something down the chain which has effectively broken it slightly.
The Pareto Pickle – Low-Hanging Fruit
D: Let’s finish off with the Pareto Pickle. Pareto says that you can get 80% of your results from 20% of your efforts. What’s one SEO activity that you would recommend that provides incredible results for modest levels of effort?
G: My favorite thing at the moment is I have a very basic Google Data Studio dashboard, which allows me to take a look at what I call the low-hanging fruit. Now, everybody hates buzzword bingo. But this is my thing where I look at things that are not quite ranking as well as they should. I look at all of the keywords where they’re ranking for a particular set of pages, or recipes, or products, or something. A good example is, at the moment, I’m working across tens of 1000s of products, I look at all the pages which have got high impressions, but there may be at position six, and I can work them up to position 3. And nine times out of ten you can do this by just making sure the title tags improved and the internal linking has improved. Very simple stuff to find out which of the keywords with the high search volume can be bumped up just a little bit more to increase the click-through rate.
D: I’ve been your host, David Bain. You can find Gerry by searching Gerry White on LinkedIn. Gerry, thanks so much for being on the In Search SEO podcast.
G: My pleasure. Thank you for your time.
D: And thank you for listening.
The #1 keyword research tool
Give it a try or talk to our marketing team — don’t worry, it’s free!