首先导入模块,用help查看相关文档
>>> from urlparse import urljoin >>> help(urljoin) Help on function urljoin in module urlparse: urljoin(base, url, allow_fragments=True) Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.
意思就是将基地址与一个相对地址形成一个绝对地址,然而讲的太过抽象
接下来,看几个例子,从例子中发现规律。
>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html") 'http://www.google.com/1/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html") 'http://www.google.com/1/2/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html") 'http://www.google.com/2/bbbb.html' >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html") 'http://www.google.com/3/ccc.html' >>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html") 'http://www.google.com/ccc.html' >>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)") 'javascript:void(0)'
规律不难发现,但是并不是万事大吉了,还需要处理特殊情况,如链接是其本身,链接中包含无效字符等
url = urljoin("****","****")
### find()查找字符串函数,如果查到:返回查找到的第一个出现的位置。否则,返回-1
if url.find("'")!=-1:
continue
### 只取井号前部分
url = url.split('#')[0]
### 这个isindexed()是我自己定义的函数,判断该链接不在保存链接的数据库中
if url[0:4]=='http' and not self.isindexed(url):
###newpages = set(),无序不重复元素集
newpages.add(url)