There are plenty of causes you would possibly need to find many of the URLs on a web site, but your actual aim will figure out That which you’re trying to find. For instance, you might want to:
Determine every single indexed URL to research difficulties like cannibalization or index bloat
Acquire current and historic URLs Google has viewed, especially for web-site migrations
Locate all 404 URLs to recover from put up-migration faults
In Each and every situation, an individual Instrument won’t Offer you all the things you need. Unfortunately, Google Look for Console isn’t exhaustive, and also a “internet site:example.com” lookup is proscribed and hard to extract data from.
In this particular post, I’ll wander you thru some applications to build your URL record and prior to deduplicating the data employing a spreadsheet or Jupyter Notebook, depending on your internet site’s dimensions.
Outdated sitemaps and crawl exports
For those who’re on the lookout for URLs that disappeared through the Reside web-site just lately, there’s a chance anyone in your staff might have saved a sitemap file or possibly a crawl export ahead of the changes were made. If you haven’t previously, check for these files; they can often deliver what you require. But, in case you’re studying this, you most likely didn't get so Blessed.
Archive.org
Archive.org
Archive.org is a useful Software for SEO tasks, funded by donations. Should you try to find a site and select the “URLs” alternative, you may entry approximately 10,000 listed URLs.
Having said that, Here are a few restrictions:
URL limit: You can only retrieve as many as web designer kuala lumpur ten,000 URLs, which is insufficient for larger web-sites.
High-quality: Numerous URLs could be malformed or reference resource documents (e.g., photographs or scripts).
No export alternative: There isn’t a built-in way to export the record.
To bypass the lack of an export button, utilize a browser scraping plugin like Dataminer.io. Nonetheless, these constraints mean Archive.org may well not offer an entire Alternative for larger web pages. Also, Archive.org doesn’t reveal irrespective of whether Google indexed a URL—but if Archive.org uncovered it, there’s a fantastic likelihood Google did, far too.
Moz Pro
When you could usually make use of a connection index to discover external sites linking to you, these tools also uncover URLs on your website in the process.
How you can use it:
Export your inbound hyperlinks in Moz Pro to get a quick and straightforward listing of target URLs from a web-site. In case you’re coping with a large Web page, consider using the Moz API to export info outside of what’s manageable in Excel or Google Sheets.
It’s vital that you note that Moz Professional doesn’t confirm if URLs are indexed or uncovered by Google. Nevertheless, due to the fact most websites implement the same robots.txt principles to Moz’s bots because they do to Google’s, this process frequently works nicely being a proxy for Googlebot’s discoverability.
Google Research Console
Google Lookup Console delivers quite a few valuable resources for setting up your listing of URLs.
Back links studies:
Much like Moz Professional, the Backlinks section delivers exportable lists of target URLs. Sad to say, these exports are capped at one,000 URLs Each and every. You can use filters for distinct web pages, but considering the fact that filters don’t implement to your export, you may perhaps really need to rely upon browser scraping instruments—restricted to 500 filtered URLs at a time. Not suitable.
Effectiveness → Search Results:
This export will give you a list of pages getting research impressions. While the export is limited, You can utilize Google Lookup Console API for much larger datasets. Additionally, there are totally free Google Sheets plugins that simplify pulling a lot more extensive details.
Indexing → Internet pages report:
This area provides exports filtered by challenge kind, nevertheless these are generally also limited in scope.
Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a superb supply for gathering URLs, that has a generous limit of a hundred,000 URLs.
A lot better, you are able to implement filters to generate various URL lists, properly surpassing the 100k limit. By way of example, if you'd like to export only website URLs, comply with these measures:
Move one: Add a phase to your report
Phase 2: Click “Make a new phase.”
Phase 3: Outline the segment which has a narrower URL pattern, for example URLs containing /website/
Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply useful insights.
Server log data files
Server or CDN log files are Probably the final word Instrument at your disposal. These logs capture an exhaustive listing of each URL route queried by customers, Googlebot, or other bots over the recorded period.
Things to consider:
Data dimension: Log files could be massive, a lot of internet sites only keep the last two weeks of data.
Complexity: Analyzing log data files is usually complicated, but many instruments can be obtained to simplify the method.
Blend, and fantastic luck
When you’ve collected URLs from all these resources, it’s time to combine them. If your web site is sufficiently small, use Excel or, for much larger datasets, instruments like Google Sheets or Jupyter Notebook. Make sure all URLs are continually formatted, then deduplicate the list.
And voilà—you now have a comprehensive listing of current, aged, and archived URLs. Good luck!