20.5. urllib — Open arbitrary resources by URL — Python v2.7.2 documentation
.4. urllib Restrictions¶
Currently, only the following protocols are supported: HTTP, (versions 0.9 and
1.0), FTP, and local files.The caching feature of urlretrieve() has been disabled until I find the
time to hack proper processing of Expiration time headers.There should be a function to query whether a particular URL is in the cache.
For backward compatibility, if a URL appears to point to a local file but the
file can’t be opened, the URL is re-interpreted using the FTP protocol. This
can sometimes cause confusing error messages.The urlopen() and urlretrieve() functions can cause arbitrarily
long delays while waiting for a network connection to be set up. This means
that it is difficult to build an interactive Web client using these functions
without using threads.The data returned by urlopen() or urlretrieve() is the raw data
returned by the server. This may be binary data (such as an image), plain text
or (for example) HTML. The HTTP protocol provides type information in the reply
header, which can be inspected by looking at the Content-Type
header. If the returned data is HTML, you can use the module htmllib to
parse it.The code handling the FTP protocol cannot differentiate between a file and a
directory. This can lead to unexpected behavior when attempting to read a URL
that points to a file that is not accessible. If the URL ends in a /, it is
assumed to refer to a directory and will be handled accordingly. But if an
attempt to read a file leads to a 550 error (meaning the URL cannot be found or
is not accessible, often for permission reasons), then the path is treated as a
directory in order to handle the case when a directory is specified by a URL but
the trailing / has been left off. This can cause misleading results when
you try to fetch a file whose read permissions make it inaccessible; the FTP
code will try to read it, fail with a 550 error, and then perform a directory
listing for the unreadable file. If fine-grained control is needed, consider
using the ftplib module, subclassing FancyURLopener, or changing
_urlopener to meet your needs.This module does not support the use of proxies which require authentication.
This may be implemented in the future.Although the urllib module contains (undocumented) routines to parse
and unparse URL strings, the recommended interface for URL manipulation is in
module urlparse.20.5.5. Examples¶
Here is an example session that uses the GET method to retrieve a URL
containing parameters:>>> import urllib >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) >>> print f.read()The following example uses the POST method instead:
>>> import urllib >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) >>> print f.read()The following example uses an explicitly specified HTTP proxy, overriding
environment settings:>>> import urllib >>> proxies = {'http': 'http://proxy.example.com:8080/'} >>> opener = urllib.FancyURLopener(proxies) >>> f = opener.open("http://www.python.org") >>> f.read()The following example uses no proxies at all, overriding environment settings:
>>> import urllib >>> opener = urllib.FancyURLopener({}) >>> f = opener.open("http://www.python.org/") >>> f.read()