zoukankan      html  css  js  c++  java
  • 使用requests模块简单获取数据

    一、使用ruquests的基本步骤:

    • 指定url
    • 发起请求
    • 获取响应对象中的数据
    • 持久化存储
    1 #1
    2 url = 'https://www.sogou.com/'
    3 #2.
    4 response = requests.get(url=url)
    5 #3.
    6 page_text = response.text
    7 #4.
    8 with open('./sogou.html','w',encoding='utf-8') as fp:
    9     fp.write(page_text)

    二、爬取搜狗指定搜索

     1 import requests
     2 url = "'https://www.sogou.com/web"
     3 wd = input("请输入搜索关键字")
     4 param = {
     5     'query':wd
     6 }
     7 
     8 response = requests.get(url=url,params=param).content
     9 filename = wd+'.html'
    10 with open(filename,'w',encoding='utf8') as f1:
    11         f1.write(response)

    三、Ajax请求

    通过抓包,获取请求携带的参数,

    例如获取分页显示的数据,当点击下一页时,发送ajax请求,对此时的url请求可以动,这里我们定义好请求参数param,动态的指定页码和每页显示的数据,通过ajax请求,返回一组json数据

    存储每页获取的数据的id,编辑new_url,获取详情的信息

     1 import requests
     2 url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
     3 headers = {
     4     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
     5 }
     6 param = {
     7    "on":"true",
     8     "page":1,
     9     "pageSize":"15",
    10     "productName":"",
    11     "conditionType":"1",
    12     "applyname":"",
    13     "applysn":"",
    14 }
    15 id_list = []
    16 json_object = requests.post(url=url,headers=headers,params=param).json()
    17 print(json_object['list'])
    18 for i in json_object['list']:
    19     id_list.append(i['ID'])
    20     
    21 new_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    22 filename = 'yaojians.text'
    23 f = open(filename,'w',encoding='utf8')
    24 for id in id_list:
    25     param = {
    26         'id':id
    27     }
    28     content_json = requests.post(url=new_url,params=param,headers=headers).json()
    29     f.write(str(content_json)+'
    ')
    30     
    31     
    32     

    四、使用BeautifullSoup爬取数据

    bs4解析:
      pip install bs4
      pip install lxml

    解析原理
      1、将要进行解析的源码加载到bs对象

      2、调用bs对象中相关的方法或属性进行源码中的相关标签的定位

      3、将定位到的标签之间存在的文本或属性值获取

     1 import requests
     2 from bs4 import BeautifulSoup
     3 
     4 url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
     5 headers = {
     6     "User-Agent":'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
     7 }
     8 res = requests.get(url=url, headers=headers).text
     9 soup = BeautifulSoup(res, 'lxml')
    10 a_tags_list = soup.select('.book-mulu > ul > li > a')
    11 filename = 'snaguo.text'
    12 fp = open(filename, 'w', encoding='utf-8')
    13 for a_tag in a_list:
    14     title = a_tag.string
    15     detail_url = "http://www.shicimingju.com"+a_tag["href"] 
    16     detail_content = requests.get(url=detail_url, headers=headers).text
    17     soup = BeautifulSoup(detail_content, "lxml")
    18     detail_text = soup.find('div', class_="chapter_content").text
    19     fp.write(title+'
    '+detail_text)
    20     print(title, '下载完毕')
    21 print('over')
    22 fp.close()
    23     

    五、简单使正则爬取图片

     1 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5170552'
     2 start_page = int(input("请输入起始页:"))
     3 end_page = int(input("请输入结束页:"))
     4 headers = {
     5     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
     6 }
     7 for page in range(start_page,end_page+1):
     8     new_url = format(url%page)
     9     response = requests.get(url=new_url, headers=headers).text
    10     # 每一页的图片url
    11     images_url = re.findall('<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',response,re.S)
    12 os.mkdir('qiutu')
    13     
    14     for image_url in images_url:
    15         detail_url = 'http:'+image_url
    16         # 获取到当前图片的二进制流
    17         content = requests.get(url=detail_url,headers=headers).content
    18         # 切割 把图片路径最后的字符作为图片名
    19         image_name = image_url.split('/')[-1]
    20         with open('./qiutu/'+image_name,'wb')as f1:
    21             f1.write(content)
    22 print('over')



  • 相关阅读:
    c语言之数据类型
    C语言之概述
    012.day12
    011.day011
    010.day010
    010.day08
    010.周六自习
    009.day07
    008.day06
    007.day05
  • 原文地址:https://www.cnblogs.com/liaopeng123/p/10440242.html
Copyright © 2011-2022 走看看