zoukankan      html  css  js  c++  java
  • python爬虫之requests+selenium+BeautifulSoup

    前言:

    • 环境配置:windows64、python3.4
    • requests库基本操作:

    1、安装:pip install requests

    2、功能:使用 requests 发送网络请求,可以实现跟浏览器一样发送各种HTTP请求来获取网站的数据。

    3、命令集操作:

    import requests  # 导入requests模块
    
    r = requests.get("https://api.github.com/events")  # 获取某个网页
    
    # 设置超时,在timeout设定的秒数时间后停止等待响应
    r2 = requests.get("https://api.github.com/events", timeout=0.001)
    
    payload = {'key1': 'value1', 'key2': 'value2'}
    r1 = requests.get("http://httpbin.org/get", params=payload)
    
    print(r.url)  # 打印输出url
    
    print(r.text)  # 读取服务器响应的内容
    
    print(r.encoding)  # 获取当前编码
    
    print(r.content)  # 以字节的方式请求响应体
    
    print(r.status_code)  # 获取响应状态码
    print(r.status_code == requests.codes.ok)  # 使用内置的状态码查询对象
    
    print(r.headers)  # 以一个python字典形式展示的服务器响应头
    print(r.headers['content-type'])  # 大小写不敏感,使用任意形式访问这些响应头字段
    
    print(r.history)  # 是一个response对象的列表
    
    print(type(r))  # 返回请求类型
    • BeautifulSoup4库基本操作:

    1、安装:pip install BeautifulSoup4

    2、功能:Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

    3、命令集操作:

     1 import requests
     2 from bs4 import BeautifulSoup
    3 html_doc = """ 4 <html><head><title>The Dormouse's story</title></head> 5 <body> 6 <p class="title"><b>The Dormouse's story</b></p> 7 8 <p class="story">Once upon a time there were three little sisters; and their names were 9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 12 and they lived at the bottom of a well.</p> 13 14 <p class="story">...</p> 15 """ 16 17 ss = BeautifulSoup(html_doc,"html.parser") 18 print (ss.prettify()) #按照标准的缩进格式的结构输出 19 print(ss.title) # <title>The Dormouse's story</title> 20 print(ss.title.name) #title 21 print(ss.title.string) #The Dormouse's story 22 print(ss.title.parent.name) #head 23 print(ss.p) #<p class="title"><b>The Dormouse's story</b></p> 24 print(ss.p['class']) #['title'] 25 print(ss.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 26 print(ss.find_all("a")) #[。。。] 29 print(ss.find(id = "link3")) #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 30 31 for link in ss.find_all("a"): 32 print(link.get("link")) #获取文档中所有<a>标签的链接 33 34 print(ss.get_text()) #从文档中获取所有文字内容
     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 html_doc = """
     5 <html><head><title>The Dormouse's story</title></head>
     6 <body>
     7 <p class="title"><b>The Dormouse's story</b></p>
     8 <p class="story">Once upon a time there were three little sisters; and their names were
     9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    12 and they lived at the bottom of a well.</p>
    13 
    14 <p class="story">...</p>
    15 """
    16 soup = BeautifulSoup(html_doc, 'html.parser') # 声明BeautifulSoup对象 17 find = soup.find('p') # 使用find方法查到第一个p标签 18 print("find's return type is ", type(find)) # 输出返回值类型 19 print("find's content is", find) # 输出find获取的值 20 print("find's Tag Name is ", find.name) # 输出标签的名字 21 print("find's Attribute(class) is ", find['class']) # 输出标签的class属性值 22 23 print(find.string) # 获取标签中的文本内容 24 25 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" 26 soup1 = BeautifulSoup(markup, "html.parser") 27 comment = soup1.b.string 28 print(type(comment)) # 获取注释中内容
    • 小试牛刀:
    1 import requests
    2 import io
    3 import sys
    4 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码
    5 
    6 r = requests.get('https://unsplash.com') #像目标url地址发送get请求,返回一个response对象
    7 
    8 print(r.text) #r.text是http response的网页HTML

    参考链接:

    https://blog.csdn.net/u012662731/article/details/78537432

    http://www.cnblogs.com/Albert-Lee/p/6276847.html

    https://blog.csdn.net/enohtzvqijxo00atz3y8/article/details/78748531

  • 相关阅读:
    JWT验证
    SQLite报错: no such column:StamoRule(表名)
    .Net Core 程序报错 在上一个操作完成之前,在此上下文上启动了第二个操作。
    接口请求报错 504 Gateway Time-out
    未处理的异常:system.io.file load exception:无法加载文件或程序集“ 。。。。 找到的程序集的清单定义与程序集引用不匹配。
    好多年没回到这个园子
    模拟webpack 实现自己的打包工具
    微信小程序迁移到头条小程序工具
    手机端图片懒加载
    react系列一,react虚拟dom如何转成真实的dom
  • 原文地址:https://www.cnblogs.com/sunshine-blog/p/9268906.html
Copyright © 2011-2022 走看看