zoukankan      html  css  js  c++  java
  • Python爬虫之第一个爬虫

    爬虫的任务就是两件事:请求网页和解析提取信息

    爬虫三大库 Requests Lxml BeautifulSoup

    Requests库:请求网站获取网页数据

    import requests
    #from bs4 import BeautifulSoup
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
    res=requests.get("http://bj.xiaozhu.com/",headers=headers)
    #soup = BeautifulSoup(res.text, 'html.parser')
    try:
        #price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i")
        print(res)
        print(res.text)
        #print(soup.prettify())
        #print(price)
    except ConnectionError:
        print("拒绝连接")

    其中<Response [200]>表示请求网页成功

    User-Agent可以通过http://www.user-agent.cn/ 查看

    请求头
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
    为get()方法加入请求头
    res=requests.get("http://bj.xiaozhu.com/",headers=headers)

    post()方法用于提交表单来爬取需要登录才能获得数据的网站

    BeautifulSoup库:轻松的解析Requests库请求的网页,并把网页源码解析为Soup文档,以便过滤提取数据

    import requests
    from bs4 import BeautifulSoup
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
    res=requests.get("http://bj.xiaozhu.com/",headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    print(soup.prettify())

    BeautifulSoup库主要解析器的优缺点

    Soup文档可以使用find() find_all() selector()定位需要的元素

    Soup文档

    例子

    import requests
    from bs4 import BeautifulSoup
    headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
    res=requests.get("http://bj.xiaozhu.com/",headers=headers)
    soup = BeautifulSoup(res.text, 'html.parser')
    try:
        price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i")
        #print(res)
        #print(res.text)
        #print(soup.prettify())
        print(price)
    except ConnectionError:
        print("拒绝连接")

    其中li:nth-child(1)在Python运行时会报错需要改为 li:nth-of-type(1).

    也可以使用 get_text()方法获取中间的文字消息。

  • 相关阅读:
    单例模式
    Curator Zookeeper分布式锁
    LruCache算法原理及实现
    lombok 简化java代码注解
    Oracle客户端工具出现“Cannot access NLS data files or invalid environment specified”错误的解决办法
    解决mysql Table ‘xxx’ is marked as crashed and should be repaired的问题。
    Redis 3.0 Cluster集群配置
    分布式锁的三种实现方式
    maven发布项目到私服-snapshot快照库和release发布库的区别和作用及maven常用命令
    How to Use Convolutional Neural Networks for Time Series Classification
  • 原文地址:https://www.cnblogs.com/gaochunhui/p/11700629.html
Copyright © 2011-2022 走看看