zoukankan      html  css  js  c++  java
  • Python-爬虫小例子-55

    import re
    from urllib.request import urlopen
    
    def getPage(url):
        response = urlopen(url)
        return response.read().decode('utf-8')
    
    def parsePage(s):
        ret = re.findall(
            '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>d+).*?<span class="title">(?P<title>.*?)</span>'
           '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',s,re.S)
        return ret
    
    def main(num):
        url = 'https://movie.douban.com/top250?start=%s&filter=' % num
        response_html = getPage(url)
        ret = parsePage(response_html)
        print(ret)
    
    count = 0
    for i in range(10):   # 10页
        main(count)
        count += 25
    
    # url从网页上把代码搞下来
    # bytes decode ——> utf-8 网页内容就是我的待匹配字符串
    # ret = re.findall(正则,带匹配的字符串)  #ret是所有匹配到的内容组成的列表
    import requests
    
    import re
    import json
    
    
    def getPage(url):
        response = requests.get(url)
        return response.text
    
    
    def parsePage(s):
        com = re.compile(
            '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>d+).*?<span class="title">(?P<title>.*?)</span>'
            '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
    
        ret = com.finditer(s)
        for i in ret:
            yield {
                "id": i.group("id"),
                "title": i.group("title"),
                "rating_num": i.group("rating_num"),
                "comment_num": i.group("comment_num"),
            }
    
    
    def main(num):
        url = 'https://movie.douban.com/top250?start=%s&filter=' % num
        response_html = getPage(url)
        ret = parsePage(response_html)
        print(ret)
        f = open("move_info7", "a", encoding="utf8")
    
        for obj in ret:
            print(obj)
            data = json.dumps(obj, ensure_ascii=False)
            f.write(data + "
    ")
    
    
    if __name__ == '__main__':
        count = 0
        for i in range(10):
            main(count)
            count += 25
    import re
    from urllib.request import urlopen
    
    def getPage(url):
        response = urlopen(url)
        return response.read().decode('utf-8')
    
    def parsePage(s):
        ret = re.findall(
            '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>d+).*?<span class="title">(?P<title>.*?)</span>'
           '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',s,re.S)
        return ret
    
    def main(num):
        url = 'https://movie.douban.com/top250?start=%s&filter=' % num
        response_html = getPage(url)
        ret = parsePage(response_html)
        #print(ret)
        f = open("move_info", "a", encoding="utf8")
    
        for obj in ret:
            print(obj)
            data = str(obj)
            f.write(data +"
    ")
        f.close()
    
    count = 0
    for i in range(10):   # 10页
        main(count)
        count += 25
    flags有很多可选值:
    
    re.I(IGNORECASE)忽略大小写,括号内是完整的写法
    re.M(MULTILINE)多行模式,改变^和$的行为
    re.S(DOTALL)点可以匹配任意字符,包括换行符
    re.L(LOCALE)做本地化识别的匹配,表示特殊字符集 w, W, , B, s, S 依赖于当前环境,不推荐使用
    re.U(UNICODE) 使用w W s S d D使用取决于unicode定义的字符属性。在python3中默认使用该flag
    re.X(VERBOSE)冗长模式,该模式下pattern字符串可以是多行的,忽略空白字符,并可以添加注释
    
    flags
  • 相关阅读:
    抽象工厂模式
    外观模式
    策略模式
    状态模式
    观察者模式
    装饰者模式
    模板方法模式
    适配器模式
    中介者模式
    组合模式
  • 原文地址:https://www.cnblogs.com/LXL616/p/10721135.html
Copyright © 2011-2022 走看看