zoukankan      html  css  js  c++  java
  • Python 抓取网页并提取信息(程序详解)

    最近因项目需要用到python处理网页,因此学习相关知识。下面程序使用python抓取网页并提取信息,具体内容如下: 

    #------------------------------------------------------------------------------
    
    import urllib2 # extensible library for opening URLs
    import re  # regular expression module
    
    #------------------------------------------------------------------------------
    def main():
        userMainUrl = "http://www.songtaste.com/user/351979/"
        req = urllib2.Request(userMainUrl)  # request
        resp = urllib2.urlopen(req)         # response
        respHtml = resp.read()     # read html
        print "respHtml =", respHtml
        #<h1 class="hluser">crifan</h1>
        foundH1user = re.search(r'<h1s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml)
        print "foundHluser =", foundH1user
        if foundH1user:
            h1user = foundH1user.group("h1user")
            print "hluser=", h1user
        
    ###################################################################################
    if __name__=='__main__':
        main()

    本程序实现目的,从http://www.songtaste.com/user/351979/网页源码中找到

    <h1 class="hluser">crifan</h1>

    再从上面的格式中提取“crifan”。

    从网络中读取网页,需要2个步骤:向网页服务器请求和服务器响应。下面对程序核心的部分进行解析,如下:
    foundH1user = re.search(r'<h1s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml)

    本语句使用正则表达式进行匹配字符串“<h1 class="hluser">crifan</h1>”。将<h1>与</h1>之间的内容归为一个group,group名为h1user。
    注意 “h1user”中‘1’是数字‘1’,不是字母‘l’
    程序中涉及到相关知识如下:

    1、
    re.search

    re.search(patternstringflags=0)

    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

    class re.MatchObject

    Match objects always have a boolean value of True.

    Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

    match = re.search(pattern, string)

    if match:

    process(match) 

     

    2、group([group1...])

    Match objects support the following methods and attributes: 

    group([group1...]) 

    Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

     

    3、(?P<name>...)

    (?P<name>...),用于对group命名,group名为name,从而可以通过group('name'),实现对此group进行访问。如程序中

    foundH1user.group("h1user")

    其中foundH1user为MatchObject instance,h1user为group名

    与正常的括号类似,但是按group匹配的子串可通过象征性的group名name访问。group名必须是有效的Python标识符,每个组名在正则表达式中只能定义一次。具有symbolic group name的组也是一个有编号的组,就好像这个group没有被命名一样

    Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

    4、程序中使用的正则表达式符号

    常用的元字符

    s      匹配任意的空白符

    .        匹配除换行符以外的任意字符

    常用的限定符

    +       重复一次或更多次

    ?       重复零次或一次

    由正则表达式的符号含义可知,程序中 "s+?" 完全可以用 "s+" 或 ”s?"替代 

     

    参考资料:

    1、http://www.crifan.com/crawl_website_html_and_extract_info_using_python/

    2、https://docs.python.org/2/library/re.html#re.MatchObject

    3、http://deerchao.net/tutorials/regex/regex.htm

  • 相关阅读:
    业务场景和业务用例场景的区别(作者:Arthur网友)
    svn 安装
    PHP has encountered an Access Violation at
    邀请大象一书的读者和广大网友写关于分析设计、建模方面的自愿者文章
    手机网页 复制信息方法 免费短信
    delphi Inno Setup 制作安装程序
    Special Folders
    Windows mobile上获取输入光标位置
    加壳程序无法准确读输入表的解决办法
    C++ PostMessage 模拟键盘鼠标
  • 原文地址:https://www.cnblogs.com/klchang/p/4508377.html
Copyright © 2011-2022 走看看