zoukankan      html  css  js  c++  java
  • 简单爬虫实例

    代码工具:jupyter

    抓包工具:fiddle

    1:搜狗页面内容爬取

    1 import requests
    2 
    3 url='https://www.sogou.com/'
    4 response=requests.get(
    5     url=url
    6 )
    7 text=response.text
    8 text
    搜狗内容

    2:豆瓣电影分类爬取

     1 import requests
     2 url='https://movie.douban.com/j/new_search_subjects'
     3 param={
     4     'sort':'U',
     5     'range': '0,10',
     6     'tags': '',
     7     'start': '0',
     8     'genres': '爱情'     
     9 }
    10 headers={
    11    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
    12 }
    13 response=requests.get(
    14         url=url,
    15         headers=headers,
    16         params=param,
    17        
    18 )
    19 text=response.json()
    20 text
    豆瓣电影

    3:搜索磁条爬取并写入文件

     1 import requests
     2 
     3 url='https://www.sogou.com/web'
     4 headers={
     5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
     6 }
     7 param={
     8     'query':'校花'
     9 }
    10 response=requests.get(
    11     url=url,
    12     headers=headers,
    13     params=param
    14 )
    15 text=response.content
    16 with open('xh.html','wb')as f:
    17     f.write(text)
    爬取并写入文件

    4:国家药监总监内容爬取。爬取动态生成的内容

     1 import requests
     2 
     3 
     4 headers={
     5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
     6 }
     7 url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
     8 
     9 data={
    10    'on': 'true',
    11    'page': '3',
    12    'pageSize': '15',
    13    'productName':'',
    14    'conditionType': '1',
    15    'applyname':'',
    16    'applysn':''
    17 }
    18 response=requests.post(url=url,headers=headers,data=data)
    19 conn_list=response.json()['list']
    20 comm_info_list=[]
    21 for i in conn_list:
    22     url_c="http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
    23     data_c={"id":""}
    24     if i["XC_DATE"]:
    25         data_c["id"]=i["ID"]
    26         res=requests.post(url=url_c,data=data_c,headers=headers)
    27         comm_info_list.append(res.json())
    28 comm_info_list
    动态生成的数据如何爬取
  • 相关阅读:
    基于Token的身份验证--JWT
    在eclipse中使用maven创建springMVC项目
    Mybatis框架插件PageHelper的使用
    java 中==符号的坑
    Gradle project sync failed.
    intellij idea android错误: Missing styles. Is the correct theme chosen for this layout?
    thinkpad win8.1 无线连接受限
    struts2
    在Strust2 使用datatimepicker 标签引发的一系列问题
    struts2中css,js等资源无效 非路径问题(新手问题)
  • 原文地址:https://www.cnblogs.com/duanhaoxin/p/10098639.html
Copyright © 2011-2022 走看看