Originally posted on August 3, 2006 3:08 PM
This mini howto is for information on getting started with Google Sitemaps. Using Google Sitemaps isn’t difficult, but I felt like writing a simple walkthrough for everyone just in case they might be having problems getting started. Please note these instructions assume your web site is hosted on Linux and you have shell access to the server. Google offers a free service called Google Sitemaps that lets you submit all your web site URLs so their spiders know when they have changed and can produce smarter and fresher search results. Google Sitemaps actually does more than that. In fact it tells you if you have any broken links, what search terms your links are showing up under in the search engine result pages (SERPs), and which terms people are clicking on. It’s a neat little service.
http://www.google.com/webmasters/sitemaps/
Here’s how it works in a nutshell.
1. Create a free account with Google or if you have a Gmail account you can use those credentials to log in.
2. Once you log in, you tell Sitemaps your web site address and then it asks you to verify it’s your web site.
3. To verify that the web site is indeed yours, Google gives you a couple of options to verify. One verification method is creating a html file in your webspace. The other is adding a meta tag to a web page. I opt for the file creation so I don’t have to do any editing.
Create an empty file and give it the name of the Google HTML. In this case it’s google4c293907f2980933.html. If you have SSH or Telnet access to your web site, you can also log in and use the touch command to create the file.
touch google4c293907f2980933.html
After you upload or touch the HTML file, click the Verify button. Once verified, click on the Sitemaps tab.
4. Now you need to add a sitemap for your web site. You can create your sitemap by hand or use the Google Sitemap Generator Python script. It’s easier to use the script because all you do is create a URL list and run the sitemap generator and it’ll create the XML file for you.
You can download the Google Sitemap Generator (sitemap_gen-1.4.tar.gz) to your webspace, extract the contents, and then configure it.
tar zxvf sitemap_gen-1.4.tar.gz cd sitemap_gen-1.4
5. You need to configure sitemap generator for your web site. First you should make a copy of the example files (example_config.xml and exemple_urllist.txt) to names without the preceding ‘example_’.
cp example_config.xml config.xml cp example_urllist.txt urllist.txt
Since this is a mini howto I won’t go over every single option or advanced configurations. The purpose is to set up a simple, yet working, sitemap configuration. Open config.xml in a text editor. You’ll see a LOT of comments which you should read. To make your life easier, here is a simple configuration that will work. You can delete everything in config.xml and paste this in its place.
<?xml version="1.0" encoding="UTF-8"?> <site base_url="http://www.example.com/" store_into="/var/www/docroot/sitemap.xml.gz" verbose="1" > <urllist path="urllist.txt" encoding="UTF-8" /> <filter action="drop" type="wildcard" pattern="*~" /> <filter action="drop" type="regexp" pattern="/\\.[^/]*" /> </site>
The items in bold you replace with your information. To get your webspace root directory path you can use the pwd command or ask your hosting provider. When you’re done you should save your changes.
Now edit your URL list. Open urllist.txt in your text editor and start adding your URLs. Here is an example.
... # Example contents of the file: # # http://www.example.com/foo/bar # http://www.example.com/foo/xxx.pdf lastmod=2003-12-31T14:05:06+00:00 # http://www.example.com/foo/yyy?x=12&y=23 changefreq=weekly priority=0.3 # Your new URLs http://www.example.com/ http://www.example.com/index.html http://www.example.com/contact.html
If you require more advanced usage then follow the examples in the urllist.txt file. When you’re done adding your URLs, save your changes. It’s time to generate the actual Sitemap XML file.
6. To build your Sitemap XML file you must have Python on your server since the Google Sitemap Generator is a Python script. Here is the command to build your XML.
python sitemap_gen.py --config=config.xml
Here you can see it in action
[root@develbox sitemap_gen]# python sitemap_gen.py --config=config.xml Reading configuration file: config.xml Opened URLLIST file: urllist.txt Sorting and normalizing collected URLs. Writing Sitemap file "/web/domain/html/sitemap.xml.gz" with 3990 URLs Notifying search engines. Notifying: www.google.com Count of file extensions on URLs: 2999 .html 990 .php 1 / Number of errors: 0 Number of warnings: 0 [root@develbox sitemap_gen]#
To make life easier like I did you can make a bash script with the build command and just execute that. Save it to a file named build_sitemap and chmod 755 it.
#!/bin/bash python sitemap_gen.py --config=config.xml
7. Back in the Google Sitemaps web site, you need to specify the URL to the Sitemaps XML file so Google can check your XML file on a periodic basis.
On the Sitemaps tab, select Add General Web Sitemap from the dropdown control and in item 3, type in the URL of your sitemap. Google provides you an example below. Click the Add Web Sitemap button.
When you build your sitemap, the XML file is gzip compressed so a .gz extension is added which makes it sitemap.xml.gz. Also, depending on the path you specified in config.xml for site:store_into, that will determine your XML file’s URL. It’s easier to keep your Sitemap XML file in top level of your webspace so its URL can be http://www.yourdomain.com/sitemap.xml.gz. Besides, it’s not like you’re storing national secrets if someone other than Google downloads your sitemap file.
8. You’re done. Now you can play the waiting game. It may take a while for Google’s spiders to get to all of your pages, but they will eventually. Browse around Google Sitemaps and check out the features. It’s a nifty tool.
Remember, whenever you update your site you should update your URL list (urllist.txt) too, such as removing deleted pages and adding the new ones. Anytime you make a change you should rebuild the Sitemap XML file. Once it’s rebuilt, the Python script will notify Google and your fresh sitemap will be downloaded in a minute or so.
