zoukankan      html  css  js  c++  java
  • python 网络爬虫

    f you're creating a search engine you'll need a way to collect documents. In this excerpt from Tony Segaran's Programming Collective Intelligence the author shows you how to set up a simple web crawler using existing tools.


    I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering.

    To do this, your code will have to download the pages, pass them to the indexer (which you'll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process.

    For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at http://kiwitobes.com/wiki.

    You're free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter.

    Using urllib2

    urllib2 is a library bundled with Python that makes it easy to download pages—all you have to do is supply the URL. You'll use it in this section to download the pages that will be indexed. To see it in action, start up your Python interpreter and try this:

    >>import urllib2

    >> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')

    >> contents=c.read()

    >>print contents[0:50]

    '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'

    All you have to do to store a page's HTML code into a string is create a connection and read its contents.

    Crawler Code

    The crawler will use the Beautiful Soup API, an excellent library that builds a structured representation of web pages. It is very tolerant of web pages with broken HTML, which is useful when constructing a crawler because you never know what pages you might come across.

    Using urllib2 and Beautiful Soup you can build a crawler that will take a list of URLs to index and crawl their links to find other pages to index. First, add these import statements to the top of searchengine.py:

    import urllib2

    fromBeautifulSoupimport*

    from urlparse import urljoin



    # Create a list of words to ignore

    ignorewords
    =set(['the','of','to','and','a','in','is','it'])

    Now you can fill in the code for the crawler function. It won't actually save anything it crawls yet, but it will print the URLs as it goes so you can see that it's working. You need to put this at the end of the file (so it's part of the crawler class):

    def crawl(self,pages,depth=2):

     
    for i in range(depth):

        newpages
    =set()

       
    for page in pages:

         
    try:

            c
    =urllib2.urlopen(page)

         
    except:

           
    print"Could not open %s"% page

           
    continue

          soup
    =BeautifulSoup(c.read())

         
    self.addtoindex(page,soup)



          links
    =soup('a')

         
    for link in links:

           
    if('href'in dict(link.attrs)):

              url
    =urljoin(page,link['href'])

             
    if url.find("'")!=-1:continue

              url
    =url.split('#')[0]# remove location portion

             
    if url[0:4]=='http'andnotself.isindexed(url):

                newpages
    .add(url)

              linkText
    =self.gettextonly(link)

             
    self.addlinkref(page,url,linkText)



           
    self.dbcommit()



            pages
    =newpages

    This function loops through the list of pages, calling addtoindex on each one (right now this does nothing except print the URL, but you'll fill it in the next section). It then uses Beautiful Soup to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

    This function can be defined recursively so that each link calls the function again, but doing a breadth-first search allows for easier modification of the code later, either to keep crawling continuously or to save a list of unindexed pages for later crawling. It also avoids the risk of overflowing the stack.

    You can test this function in the Python interpreter (there's no need to let it finish, so press Ctrl-C when you get bored):

    >>import searchengine

    >> pagelist=['http://kiwitobes.com/wiki/Perl.html']

    >> crawler=searchengine.crawler('')

    >> crawler.crawl(pagelist)

    Indexinghttp://kiwitobes.com/wiki/Perl.html

    Couldnot open http://kiwitobes.com...ramming%29.html

    Indexinghttp://kiwitobes.com...ry_Project.html

    Indexinghttp://kiwitobes.com...face.html

    You may notice that some pages are repeated. There is a placeholder in the code for another function, isindexed, which will determine if a page has been indexed recently before adding it to newpages. This will let you run this function on any list of URLs at any time without worrying about doing unnecessary work.

  • 相关阅读:
    HDU 2188 悼念512汶川大地震遇难同胞——选拔志愿者
    博弈论小结
    HDU 2149 Public Sale
    有上下界限制的网络流-总结
    loj #117. 有源汇有上下界最小流
    jquery中not的用法[.not(selector)]
    Assert随笔
    Maps.newHashMapWithExpectedSize(2)
    java1.8操作日期
    控制input只输入数字--- onkeyup="value=value.replace(/[^d]/g,'')"
  • 原文地址:https://www.cnblogs.com/UnGeek/p/2700684.html
Copyright © 2011-2022 走看看