zoukankan      html  css  js  c++  java
  • python+requests+re匹配抓取猫眼上映电影信息

    python+requests抓取猫眼中上映电影,re正则匹配获取对应电影的排名,图片地址,片名,主演及上映时间和评分

    import requests
    import re, json
    
    
    def get_html(url):
        """
        获取网页html源码
        :return:
        """
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " 
                     "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        # 浏览器信息
        headers = {
            "User-Agent": user_agent
        }
        r = requests.get(url, headers=headers)  
        html = r.text
        # print(html)
        return html
    
    
    def parse_one_page(html):
        """
        正则匹配需要内容
        :param html:
        :return:
        """
        # 排名+图片地址+主演+上映时间+评分
        pattern = re.compile('<dd>.*?board-index.*?>(d+)</i>.*?data-src="(.*?)".*?name"><a'
                             + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                             + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    
        items = re.findall(pattern, html)
    
        for item in items:
            yield {
                "排名": item[0],
                "图片地址": item[1],
                "片名": item[2],
                "主演": item[3].strip()[3:],
                "上映时间": item[4].strip()[4:],
                "分数": item[5] + item[6]
            }
    
    
    # 数据存储
    
    def write_file(content):
        with open("result.txt", 'a+', encoding='utf-8') as f:
            f.write(json.dumps(content, ensure_ascii=False) + "
    ")
    
    
    def main():
        """
        主函数
        :return:
        """
        url = "http://maoyan.com/board/4"
        html = get_html(url)
        for item in parse_one_page(html):
            print(item)
            write_file(item)
    
    
    if __name__ == '__main__':
        main()
  • 相关阅读:
    Notes of Daily Scrum Meeting(12.18)
    Notes of Daily Scrum Meeting(12.17)
    Notes of Daily Scrum Meeting(12.16)
    Notes of Daily Scrum Meeting(12.8)
    Notes of Daily Scrum Meeting(12.5)
    Notes of Daily Scrum Meeting(12.3)
    Notes of Daily Scrum Meeting(11.12)
    Linux中profile、bashrc、bash_profile之间的区别和联系
    Linux GCC编译
    mysql 5.7.16 远程连接
  • 原文地址:https://www.cnblogs.com/CesareZhang/p/11027772.html
Copyright © 2011-2022 走看看