zoukankan      html  css  js  c++  java
  • Python正则表达式匹配猫眼电影HTML信息

    爬虫项目爬取猫眼电影TOP100电影信息

    项目内容来自:https://github.com/Germey/MaoYan/blob/master/spider.py

    由于其中需要爬取的包含电影名字、电影海报图片、演员、上映时间等众多信息,正则表达式代码较为复杂

    在parse_one_page(html)获取HTML文本print(html)后得到以下信息:

    #划线为匹配内容
    <dd> <i class="board-index board-index-1">1</i>
    #电影排名 <a href="fim/1203"title="霸王别姬" class="image-link" data-act"boarditem-click" data-val="{movieId:1203}"> <img src="//ms0.meituan.net/mywww/image/Loading_2.e3d934bf.png" alt="" class="poster-default"/> <img data-src="http://p1.meeituan.net/movie/20803f59291c47e1e116c11963cee19e68711.ing160w_22h_1e_1c" alt="霸王别姬” class="board-img" /> #image </a> <div class="board-item-main"> <div class="board-item-content"> <diy classamovie-item-info> <p Class="name"><a href"/ films/1293 title-"露王別姬”data-act=" boorditem-cltck"data-val="{ moved:1283]">霸王别姬</a></p> #title、actor和name <p class-star> 主演:张国荣,张丰毅,巩俐 </p> <p classreleasetime">上映时间:1993-01-01〔中国香港)</p> </div> #time <div class="movie-item-numher score-num"> <p class=score><i class="integer">9.</i><i class="fraction">6</i></p></div> #integer和fraction分数

    详解正则表达式

    pattern = re.compile(
    '<dd>.*?board-index.*?>(d+)</i>      .*?data-src="(.*?)".*?name"><a'     #匹配电影排名index和电影海报image
    +
    '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'          #匹配电影名name、明星演员actor和上映时间time
    +
    '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘               #匹配integer和电影评分fraction
    , re.S)

     正则表达式为:

    def parse_one_page(html):
        pattern = re.compile('<dd>.*?board-index.*?>(d+)</i>.*?data-src="(.*?)".*?name"><a'
                             +'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                             +'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
        items = re.findall(pattern, html)
        for item in items:
            yield {
                'index': item[0],
                'image': item[1],
                'title': item[2],
                'actor': item[3].strip()[3:],
                'time': item[4].strip()[5:],
                'score': item[5]+item[6]
            }

    匹配成功之后输出的result.txt结果:

    {"title": "霸王别姬", "image": "http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c", "actor": "张国荣,张丰毅,巩俐", "time": "1993-01-01(中国香港)", "score": "9.6", "index": "1"}
    {"title": "肖申克的救赎", "image": "http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c", "actor": "蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿", "time": "1994-10-14(美国)", "score": "9.5", "index": "2"}
    {"title": "本杰明·巴顿奇事", "image": "http://p0.meituan.net/movie/48/2207789.jpg@160w_220h_1e_1c", "actor": "布拉德·皮特,凯特·布兰切特,塔拉吉·P·汉森", "time": "2008-12-25(美国)", "score": "8.8", "index": "71"}
    {"title": "哈利·波特与死亡圣器(下)", "image": "http://p0.meituan.net/movie/76/612928.jpg@160w_220h_1e_1c", "actor": "丹尼尔·雷德克里夫,鲁伯特·格林特,艾玛·沃森", "time": "2011-08-04", "score": "9.0", "index": "72"}
    {"title": "这个杀手不太冷", "image": "http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c", "actor": "让·雷诺,加里·奥德曼,娜塔莉·波特曼", "time": "1994-09-14(法国)", "score": "9.5", "index": "3"}
    {"title": "大话西游之大圣娶亲", "image": "http://p0.meituan.net/movie/b429501a792ae227deaa16bc25c2e07a122042.jpg@160w_220h_1e_1c", "actor": "周星驰,朱茵,罗家英", "time": "2014-10-24", "score": "9.4", "index": "73"}
    {"title": "致命魔术", "image": "http://p0.meituan.net/movie/12/2130469.jpg@160w_220h_1e_1c", "actor": "休·杰克曼,克里斯蒂安·贝尔,迈克尔·凯恩", "time": "2006-10-20(美国)", "score": "8.8", "index": "61"}
    {"title": "罗马假日", "image": "http://p0.meituan.net/movie/23/6009725.jpg@160w_220h_1e_1c", "actor": "格利高利·派克,奥黛丽·赫本,埃迪·艾伯特", "time": "1953-09-02(美国)", "score": "9.1", "index": "4"}
    {"title": "阿甘正传", "image": "http://p0.meituan.net/movie/53/1541925.jpg@160w_220h_1e_1c", "actor": "汤姆·汉克斯,罗宾·怀特,加里·西尼斯", "time": "1994-07-06(美国)", "score": "9.4", "index": "5"}
    {"title": "十二怒汉", "image": "http://p0.meituan.net/movie/86/2992612.jpg@160w_220h_1e_1c", "actor": "亨利·方达,李·科布,马丁·鲍尔萨姆", "time": "1957-04-13(美国)", "score": "9.1", "index": "62"}
    {"title": "倩女幽魂", "image": "http://p0.meituan.net/movie/85/3966083.jpg@160w_220h_1e_1c", "actor": "张国荣,王祖贤,午马", "time": "2011-04-30", "score": "9.1", "index": "74"}
    #省略
    Github地址:https://github.com/kumataahh
  • 相关阅读:
    nginx日志模块及日志定时切割
    Nginx学习笔记
    Nginx负载均衡和反向代理
    python--inspect模块
    Python--sys
    Docker 中 MySQL 数据的导入导出
    分布式监控-open-falcon
    《转载》脚本实现从客户端服务端HTTP请求快速分析
    《转载》日志大了,怎么办?用我的日志切割脚本吧!
    《MySQL》一次MySQL慢查询导致的故障
  • 原文地址:https://www.cnblogs.com/kumata/p/9078784.html
Copyright © 2011-2022 走看看