zoukankan      html  css  js  c++  java
  • 爬虫

    1.首先需要下载Anaconda3 软件

    一,简单使用

    在新建的项目中,每一个模块引入时,都会存储在内存中,读取时无先后之分,但是要使模块运行一遍加载到内存中

    快捷键:

    • 插入cell: a b
    • 删除: x
    • 执行:shift+enter
    • tab:
    • cell模式切换: y(m->code) m(code->m)
    • shift+tab:打开帮助文档

    什么是爬虫?

    爬虫,通过编写程序,模拟浏览器上网,然后让其去互联网爬取数据的过程

    爬虫的分类:

    通用爬虫: 爬去整个浏览页面信息

    聚焦爬虫: 专注于某一个关键字标签

    增量式:在动态数据中,不断获取更新后的信息

    - 反爬机制:
    - 反反爬策略:

    - robots.txt协议:遵从或者不遵从.

    栗子:

    get 请求获取整个数据

    import requests
    
    #1
    url = 'https://www.sogou.com/'
    #2.
    response = requests.get(url=url)
    #3.
    page_text = response.text
    #4.
    with open('./sogou.html','w',encoding='utf-8') as fp:
        fp.write(page_text)

    get 请求带参获取动态数据

    #需求:爬取搜狗指定词条搜索后的页面数据
    import requests
    url = 'https://www.sogou.com/web'
    #封装参数
    wd = input('enter a word:')
    param = {
        'query':wd
    }
    response = requests.get(url=url,params=param)
    
    page_text = response.content
    fileName = wd+'.html'
    with open(fileName,'wb') as fp:
        fp.write(page_text)
        print('over')

    json 请求获取数据动态数据

    #爬取百度翻译结果
    url = 'https://fanyi.baidu.com/sug'
    wd = input('enter a word:')
    data = {
        'kw':wd
    }
    response = requests.post(url=url,data=data)
    
    print(response.json())
    
    #response.text : 字符串
    #.content : 二进制
    #.json() : 对象

    json 请求 获取分页数据

    #爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据
    url = 'https://movie.douban.com/j/chart/top_list'
    param = {
        "type": "5",
        "interval_id": "100:90",
        "action": '',
        "start": "60",
        "limit": "100",
    }
    
    movie_data = requests.get(url=url,params=param).json()
    
    print(movie_data)

    通过re 正则, 获取访问id ,在利用访问id 继续访问获取信息

    #需求:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据http://125.35.6.84:81/xk/
    #反爬机制:UA检测  --> UA伪装
    import requests
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
    }
    id_list = []
    for page in range(1,11):
        data = {
            "on": "true",
            "page": str(page),
            "pageSize": "15",
            "productName": "",
            "conditionType": "1",
            "applyname": "",
            "applysn": "",
        }
        json_data = requests.post(url=url,data=data,headers=headers).json()
        for dic in json_data['list']:
            id = dic['ID']
            id_list.append(id)
        
    detail_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in id_list:
        detail_data = {
            'id':id
        }
        detail_json = requests.post(url=detail_url,data=detail_data,headers=headers).json()
        print(detail_json)

    爬取图片

    #爬取照片
    url = 'https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=806201715,3137077445&fm=26&gp=0.jpg'
    img_data = requests.get(url=url,headers=headers).content
    with open('./xiaohua.jpg','wb') as fp:
        fp.write(img_data)

    利用urllib 模块一行代码生成图片

    import urllib
    
    url = 'https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=806201715,3137077445&fm=26&gp=0.jpg'
    urllib.request.urlretrieve(url=url,filename='./123.jpg'

    关于爬虫的正则

    进入有惊喜

    https://www.cnblogs.com/bobo-zhang/p/9682516.html

    import re
    string = '''fall in love with you
    i love you very much
    i love she
    i love her'''
    
    re.findall('^i.*',string,re.M)
    
    ## re.M 分为按整行匹配
    ['i love you very much', 'i love she', 'i love her']
    # re.M 分为按整行匹配
    string1 = """细思极恐
    你的队友在看书
    你的敌人在磨刀
    你的闺蜜在减肥
    隔壁老王在练腰
    """
    re.findall('.*',string1, re.S)
    
    #re.S 为按整体匹配
    
    ['细思极恐
    你的队友在看书
    你的敌人在磨刀
    你的闺蜜在减肥
    隔壁老王在练腰
    ', '']

    利用re 正则获取数据

    url = 'https://www.qiushibaike.com/pic/page/%d/?s=5170552'
    # page = 1
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
    }
    if not os.path.exists('./qiutu'):
        os.mkdir('./qiutu')
        
    start_page = int(input('enter a start pageNum:'))
    end_page = int(input('enter a end pageNum:'))
    
    for page in range(start_page,end_page+1):
        new_url = format(url%page)
    #     print(new_url)
        page_text = requests.get(url=new_url,headers=headers).text
        img_url_list = re.findall('<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',page_text,re.S)
        for img_url in img_url_list:
            img_url = 'https:'+img_url
            imgName = img_url.split('/')[-1]
            imgPath = 'qiutu/'+imgName
            urllib.request.urlretrieve(url=img_url,filename=imgPath)
            print(imgPath,'下载成功!')
            
    print('over!!!')

    bs4解析 :

    下载 :

    pip install  bs4  pip install lxml

    解析原理:

    • 1.将即将要进行解析的源码加载到bs对象
    • 2.调用bs对象中相关的方法或属性进行源码中的相关标签的定位
    • 3.将定位到的标签之间存在的文本或者属性值获取
    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.shicimingju.com/book/sanguoyanyi.html"
    headers={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"
    }
    
    page_test= requests.get(url=url,headers=headers).text
    
    
    soup = BeautifulSoup(page_test,'lxml')
    
    a_list = soup.select('.book-mulu > ul > li > a')
    
    fp = open('sanguo.txt', "w", encoding ='utf-8')
    for a in a_list:
        title=a.string
        detail_url = "http://www.shicimingju.com"+a["href"]
        detail_page_text = requests.get(url=detail_url,headers=headers).text
        
        soup = BeautifulSoup(detail_page_text,'lxml')
        
        content = soup.find('div',class_="chapter_content").text
        
        fp.write(title+"
    "+content)
        print(title,"下载完毕")
        
    print("over")
  • 相关阅读:
    一个应用程序无法启动错误的解决过程
    C#调用C库的注意事项
    STM32硬件调试详解
    CP2102模块介绍(USB转uart)
    CH340在STM32实现一键下载电路
    LM27313升压转换器
    常用贴片电阻、电容、电感封装
    MAX16054
    在51系列中data,idata,xdata,pdata的区别
    用UGN3503霍尔器件制作的数字指南针_电路图
  • 原文地址:https://www.cnblogs.com/zhangqing979797/p/10440431.html
Copyright © 2011-2022 走看看