zoukankan      html  css  js  c++  java
  • Python3 web Crawler

    Python3 网络爬虫

    1. 直接使用python3

    一个简单的伪代码

    以下这个简单的伪代码用到了set和queue这两种经典的数据结构, 集与队列. 集的作用是记录那些已经访问过的页面, 队列的作用是进行广度优先搜索.

    这里用到的Set其内部原理是采用了Hash表, 传统的Hash对爬虫来说占用空间太大, 因此有一种叫做Bloom Filter的数据结构更适合用在这里替代Hash版本的set. 

    简单的webSpider实现

     1 from html.parser import HTMLParser
     2 from urllib.request import urlopen
     3 from urllib import parse
     4 
     5 class LinkParser(HTMLParser):
     6     def handle_starttag(self, tag, attrs):
     7         if tag == 'a':
     8             for (key, value) in attrs:
     9                 if key == 'href':
    10                     newUrl = parse.urljoin(self.baseUrl, value)
    11                     self.links = self.links + [newUrl]
    12     
    13     def getLinks(self, url):
    14         self.links = []
    15         self.baseUrl =  url
    16         response = urlopen(url)
    17         if response.getheader('Content-Type')=='text/html':
    18             htmlBytes = response.read()
    19             htmlString = htmlBytes.decode("utf-8")
    20             self.feed(htmlString)
    21             return htmlString, self.links
    22         else:
    23             return "", []
    24 
    25 def spider(url, word, maxPages):
    26     pagesToVisit = [url]
    27     numberVisited = 0
    28     foundWord = False
    29     while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
    30         numberVisited = numberVisited + 1
    31         url = pagesToVisit[0]
    32         pagesToVisit = pagesToVisit[1:]
    33         try:
    34             print(numberVisited, "Visiting:", url)
    35             parser = LinkParser()
    36             data, links = parser.getLinks(url)
    37             if data.find(word) > -1:
    38                 foundWord = True
    39             pagesToVisit = pagesToVisit + links
    40             print("**Success!**")
    41         except:
    42             print("**Failed!**")
    43         
    44         if foundWord:
    45             print("The word", word, "was found at", url)
    46             return
    47         else:
    48             print("Word never found")
    View Code

    附:(python赋值和module使用)

    • 赋值
    # Assign values directly
    a, b = 0, 1
    assert a == 0
    assert b == 1
      
    # Assign values from a list
    (r,g,b) = ["Red","Green","Blue"]
    assert r == "Red"
    assert g == "Green"
    assert b == "Blue"
      
    # Assign values from a tuple
    (x,y) = (1,2)
    assert x == 1
    assert y == 2
    

      

    •  使用该module

            在同级目录下打开python,输入执行以下语句

    $ import WebSpider
    WebSpider.spider("http://baike.baidu.com",'羊城',1000)
    

      

    2. 使用scrapy框架

    安装

    环境依赖:

    openSSL, libxml2

    安装方法: pip install pyOpenSSL lxml

     $pip install scrapy
     cat > myspider.py <<EOF
    import scrapy
    
    class BlogSpider(scrapy.Spider):
        name = 'blogspider'
        start_urls = ['http://blog.scrapinghub.com']
    
        def parse(self, response):
            for url in response.css('ul li a::attr("href")').re(r'.*/dddd/dd/$'):
                yield scrapy.Request(response.urljoin(url), self.parse_titles)
    
        def parse_titles(self, response):
            for post_title in response.css('div.entries > ul > li a::text').extract():
                yield {'title': post_title}
    EOF
     scrapy runspider myspider.py
    

      

    参考资料:

    https://jecvay.com/2014/09/python3-web-bug-series1.html

    http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/

    http://www.jb51.net/article/65260.htm

    http://scrapy.org/

    https://docs.python.org/3/tutorial/modules.html

  • 相关阅读:
    linux strace 命令详解
    Redis执行Lua脚本示例
    getconf
    rc.sysinit 解析
    Linux系统启动内幕
    syslinux 和 grub
    isolinux.cfg 文件是干什么的
    C++中构造函数调用构造函数
    static和extern的作用域--题目
    构造函数与析构函数不能被继承
  • 原文地址:https://www.cnblogs.com/7explore-share/p/4943230.html
Copyright © 2011-2022 走看看