zoukankan      html  css  js  c++  java
  • 初步的百度爬虫

    from bs4 import BeautifulSoup
    import urllib2
    import urllib
    import re
    import urlparse
    
    param = raw_input('Please input what your want search')
    #   www.baidu.com/s?&wd=kkkkkkkkkkkk
    yeshu = int(raw_input('Please input page number 1-10'))
    #www.baidu.com/s?wd=11111&pn=20
    for i in range(yeshu):
        i = i * 10
        url = 'http://www.baidu.com/s?&wd='+param+'&pn='+str(i)
        try:
            req = urllib2.urlopen(url)
        except urllib2.URLError,e:
            continue
        content = req.read()
    
        soap = BeautifulSoup(content)
    
        link = soap.find_all(class_ = 't')
    
        href = []
        for i in range(len(link)):
            pattern = re.compile('href="(.+?)"')
            rs = pattern.findall(str(link[i]))
            if len(rs) == 0:
                break
            href.append(str(rs[0]))
    
        for t in range(len(href)):
            try:
                ss = urllib2.urlopen(href[t])
            except urllib2.URLError,e:
                continue
            real = ss.geturl()
            domain = urlparse.urlparse(real)
            realdomain = domain.netloc
            fp = open('url.txt','a+')
            fp.write(realdomain+'
    ')
            fp.close()
    
        
  • 相关阅读:
    zpf 视图
    html5本地存储
    sqlite 使用记录
    百度开放平台
    PHP exit() 输出
    yum笔记
    手动编译安装软件
    while循环
    linux下面测试网络带宽 (转载)
    软件包管理器的核心功能(笔记)
  • 原文地址:https://www.cnblogs.com/elliottc/p/5024437.html
Copyright © 2011-2022 走看看