zoukankan      html  css  js  c++  java
  • 简单爬虫实例

    代码工具:jupyter

    抓包工具:fiddle

    1:搜狗页面内容爬取

    1 import requests
    2 
    3 url='https://www.sogou.com/'
    4 response=requests.get(
    5     url=url
    6 )
    7 text=response.text
    8 text
    搜狗内容

    2:豆瓣电影分类爬取

     1 import requests
     2 url='https://movie.douban.com/j/new_search_subjects'
     3 param={
     4     'sort':'U',
     5     'range': '0,10',
     6     'tags': '',
     7     'start': '0',
     8     'genres': '爱情'     
     9 }
    10 headers={
    11    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
    12 }
    13 response=requests.get(
    14         url=url,
    15         headers=headers,
    16         params=param,
    17        
    18 )
    19 text=response.json()
    20 text
    豆瓣电影

    3:搜索磁条爬取并写入文件

     1 import requests
     2 
     3 url='https://www.sogou.com/web'
     4 headers={
     5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
     6 }
     7 param={
     8     'query':'校花'
     9 }
    10 response=requests.get(
    11     url=url,
    12     headers=headers,
    13     params=param
    14 )
    15 text=response.content
    16 with open('xh.html','wb')as f:
    17     f.write(text)
    爬取并写入文件

    4:国家药监总监内容爬取。爬取动态生成的内容

     1 import requests
     2 
     3 
     4 headers={
     5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
     6 }
     7 url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
     8 
     9 data={
    10    'on': 'true',
    11    'page': '3',
    12    'pageSize': '15',
    13    'productName':'',
    14    'conditionType': '1',
    15    'applyname':'',
    16    'applysn':''
    17 }
    18 response=requests.post(url=url,headers=headers,data=data)
    19 conn_list=response.json()['list']
    20 comm_info_list=[]
    21 for i in conn_list:
    22     url_c="http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
    23     data_c={"id":""}
    24     if i["XC_DATE"]:
    25         data_c["id"]=i["ID"]
    26         res=requests.post(url=url_c,data=data_c,headers=headers)
    27         comm_info_list.append(res.json())
    28 comm_info_list
    动态生成的数据如何爬取
  • 相关阅读:
    MarkDown的快速入门
    openCV打开摄像头,用openGL实现纹理贴图和视频预览
    tensorflow中的dropout是怎么实现的?
    BEEPS-仿美图秀秀磨皮算法,让美女的皮肤更光滑
    鄙人提出的PBDRLSE分割算法(绝对原创)
    怀旧风格照片特效
    铅笔特效算法
    背光图像的增强
    关于push和concat的性能问题
    小程序日历签到
  • 原文地址:https://www.cnblogs.com/duanhaoxin/p/10098639.html
Copyright © 2011-2022 走看看