zoukankan      html  css  js  c++  java
  • pyppeteer爬虫例子

    如果在centos上使用,需要安装下面的依赖

    yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 -y
    

    执行代码

    import asyncio
    import pyppeteer
    from collections import namedtuple
    
    Response = namedtuple("rs", "title url html cookies headers history status")
    
    
    async def get_html(url, timeout=30):
        # 默认30s
        browser = await pyppeteer.launch(headless=True, args=['--no-sandbox'])
        page = await  browser.newPage()
        res = await page.goto(url, options={'timeout': int(timeout * 1000)})
        data = await page.content()
        title = await page.title()
        resp_cookies = await page.cookies()
        resp_headers = res.headers
        resp_history = None
        resp_status = res.status
        response = Response(title=title, url=url,
                            html=data,
                            cookies=resp_cookies,
                            headers=resp_headers,
                            history=resp_history,
                            status=resp_status)
        return response
    
    
    if __name__ == '__main__':
        url_list = ["http://www.10086.cn/index/tj/index_220_220.html", "http://www.10010.com/net5/011/",
                    "http://python.jobbole.com/87541/"]
        task = (get_html(url) for url in url_list)
    
        loop = asyncio.get_event_loop()
        results = loop.run_until_complete(asyncio.gather(*task))
        for res in results:
            print(res.title)
    
    
  • 相关阅读:
    js 所有事件列表
    ironpython
    BAT批处理基本命令总结
    cmd命令行大全 dos命令 cmd命令整理
    Oracle向MySQL迁移
    python html转pdf
    python3 图片验证码
    Python 发送邮件
    如何卸载虚拟机
    django开发网站 让局域网中的电脑访问你的主机
  • 原文地址:https://www.cnblogs.com/c-x-a/p/10001353.html
Copyright © 2011-2022 走看看