zoukankan      html  css  js  c++  java
  • python爬虫

    1 简单方案(广度优先遍历):https://fossbytes.com/how-to-build-a-basic-web-crawler-in-python/

    import sys, thread, Queue, re, urllib, urlparse, time, os, sys
    dupcheck = set()  
    q = Queue.Queue(100) 
    q.put(sys.argv[1]) 
    def queueURLs(html, origLink): 
        for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I): 
            link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] 
            if link in dupcheck:
                continue
            dupcheck.add(link)
            if len(dupcheck) > 99999: 
                dupcheck.clear()
            q.put(link) 
    def getHTML(link): 
        try:
            html = urllib.urlopen(link).read() 
            open(str(time.time()) + ".html", "w").write("%s" % link  + "
    " + html) 
            queueURLs(html, link) 
        except (KeyboardInterrupt, SystemExit): 
            raise
        except Exception:
            pass
    while True:
        thread.start_new_thread( getHTML, (q.get(),)) 
        time.sleep(0.5)

     思路: 利用队列(Queue),进行广度优先遍历

    2. 简单方案,搜索某个词语:http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/

    from html.parser import HTMLParser  
    from urllib.request import urlopen  
    from urllib import parse
    
    # We are going to create a class called LinkParser that inherits some
    # methods from HTMLParser which is why it is passed into the definition
    class LinkParser(HTMLParser):
    
        # This is a function that HTMLParser normally has
        # but we are adding some functionality to it
        def handle_starttag(self, tag, attrs):
            # We are looking for the begining of a link. Links normally look
            # like <a href="www.someurl.com"></a>
            if tag == 'a':
                for (key, value) in attrs:
                    if key == 'href':
                        # We are grabbing the new URL. We are also adding the
                        # base URL to it. For example:
                        # www.netinstructions.com is the base and
                        # somepage.html is the new URL (a relative URL)
                        #
                        # We combine a relative URL with the base URL to create
                        # an absolute URL like:
                        # www.netinstructions.com/somepage.html
                        newUrl = parse.urljoin(self.baseUrl, value)
                        # And add it to our colection of links:
                        self.links = self.links + [newUrl]
    
        # This is a new function that we are creating to get links
        # that our spider() function will call
        def getLinks(self, url):
            self.links = []
            # Remember the base URL which will be important when creating
            # absolute URLs
            self.baseUrl = url
            # Use the urlopen function from the standard Python 3 library
            response = urlopen(url)
            # Make sure that we are looking at HTML and not other things that
            # are floating around on the internet (such as
            # JavaScript files, CSS, or .PDFs for example)
            if response.getheader('Content-Type')=='text/html':
                htmlBytes = response.read()
                # Note that feed() handles Strings well, but not bytes
                # (A change from Python 2.x to Python 3.x)
                htmlString = htmlBytes.decode("utf-8")
                self.feed(htmlString)
                return htmlString, self.links
            else:
                return "",[]
    
    # And finally here is our spider. It takes in an URL, a word to find,
    # and the number of pages to search through before giving up
    def spider(url, word, maxPages):  
        pagesToVisit = [url]
        numberVisited = 0
        foundWord = False
        # The main loop. Create a LinkParser and get all the links on the page.
        # Also search the page for the word or string
        # In our getLinks function we return the web page
        # (this is useful for searching for the word)
        # and we return a set of links from that web page
        # (this is useful for where to go next)
        while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
            numberVisited = numberVisited +1
            # Start from the beginning of our collection of pages to visit:
            url = pagesToVisit[0]
            pagesToVisit = pagesToVisit[1:]
            try:
                print(numberVisited, "Visiting:", url)
                parser = LinkParser()
                data, links = parser.getLinks(url)
                if data.find(word)>-1:
                    foundWord = True
                    # Add the pages that we visited to the end of our collection
                    # of pages to visit:
                    pagesToVisit = pagesToVisit + links
                    print(" **Success!**")
            except:
                print(" **Failed!**")
        if foundWord:
            print("The word", word, "was found at", url)
        else:
            print("Word never found")

    充分利用HTMLParser的一些特性

  • 相关阅读:
    webService基本概念、元素及简单编码实现
    云服务器、vps、虚拟主机的区别
    SOAP和WSDL的一些必要知识
    密码学基础
    oracle执行计划
    dubbo学习笔记:快速搭建
    dubbo和zookeeper的关系
    查看wifi密码
    自动保存图表
    自定义颜色
  • 原文地址:https://www.cnblogs.com/Tommy-Yu/p/6412277.html
Copyright © 2011-2022 走看看