You'll be better able to see what's happening if you turn on debugging.
This is a URL which I have set up to permanently redirect to my Atom feed at http://diveintomark.org/xml/atom.xml.
Sure enough, when you try to download the data at that address, the server sends back a 301 status code, telling you that the resource has moved permanently.
The server also sends back a Location: header that gives the new address of this data.
urllib2 notices the redirect status code and automatically tries to retrieve the data at the new location specified in the Location: header.
The object you get back from the opener contains the new permanent address and all the headers returned from the second request (retrieved from the new permanent
address). But the status code is missing, so you have no way of knowing programmatically whether this redirect was temporary
or permanent. And that matters very much: if it was a temporary redirect, then you should continue to ask for the data at
the old location. But if it was a permanent redirect (as this was), you should ask for the data at the new location from
now on.
This is suboptimal, but easy to fix. urllib2 doesn't behave exactly as you want it to when it encounters a 301 or 302, so let's override its behavior. How? With a custom URL handler, just like you did to handle 304 codes.
Example 11.11. Defining the redirect handler
This class is defined in openanything.py.
class SmartRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_301(
self, req, fp, code, msg, headers)
result.status = code return result
def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_302(
self, req, fp, code, msg, headers)
result.status = code
return result
Redirect behavior is defined in urllib2 in a class called HTTPRedirectHandler. You don't want to completely override the behavior, you just want to extend it a little, so you'll subclass HTTPRedirectHandler so you can call the ancestor class to do all the hard work.
When it encounters a 301 status code from the server, urllib2 will search through its handlers and call the http_error_301 method. The first thing ours does is just call the http_error_301 method in the ancestor, which handles the grunt work of looking for the Location: header and following the redirect to the new address.
Here's the key: before you return, you store the status code (301), so that the calling program can access it later.
Temporary redirects (status code 302) work the same way: override the http_error_302 method, call the ancestor, and save the status code before returning.
So what has this bought us? You can now build a URL opener with the custom redirect handler, and it will still automatically
follow redirects, but now it will also expose the redirect status code.
Example 11.12. Using the redirect handler to detect permanent redirects
First, build a URL opener with the redirect handler you just defined.
You sent off a request, and you got a 301 status code in response. At this point, the http_error_301 method gets called. You call the ancestor method, which follows the redirect and sends a request at the new location (http://diveintomark.org/xml/atom.xml).
This is the payoff: now, not only do you have access to the new URL, but you have access to the redirect status code, so you
can tell that this was a permanent redirect. The next time you request this data, you should request it from the new location
(http://diveintomark.org/xml/atom.xml, as specified in f.url). If you had stored the location in a configuration file or a database, you need to update that so you don't keep pounding
the server with requests at the old address. It's time to update your address book.
The same redirect handler can also tell you that you shouldn't update your address book.
Example 11.13. Using the redirect handler to detect temporary redirects
This is a sample URL I've set up that is configured to tell clients to temporarily redirect to http://diveintomark.org/xml/atom.xml.
The server sends back a 302 status code, indicating a temporary redirect. The temporary new location of the data is given in the Location: header.
urllib2 calls your http_error_302 method, which calls the ancestor method of the same name in urllib2.HTTPRedirectHandler, which follows the redirect to the new location. Then your http_error_302 method stores the status code (302) so the calling application can get it later.
And here you are, having successfully followed the redirect to http://diveintomark.org/xml/atom.xml. f.status tells you that this was a temporary redirect, which means that you should continue to request data from the original address
(http://diveintomark.org/redir/example302.xml). Maybe it will redirect next time too, but maybe not. Maybe it will redirect to a different address. It's not for you
to say. The server said this redirect was only temporary, so you should respect that. And now you're exposing enough information
that the calling application can respect that.