zoukankan      html  css  js  c++  java
  • Selenium3+python3自动化(二十七)--爬页面源码(page_source)

    前言

    有时候通过元素的属性查找页面上的某个元素,可能不大好找,这时候可以从源码中爬出想要的信息。selenium的page_source方法可以获取页面源码。

    爬页面源码的作用:如,爬出页面上所有的url地址,可以批量请求页面url地址,看是否存在404等异常等

    一、page_source

    1.selenium的page_source方法可以直接返回页面源码

     二、re非贪婪模式

    1.这里需导入re模块

    2.用re的正则匹配:非贪婪模式

    3.findall方法返回的是一个list集合

    4.匹配出来之后发现有一些不是url链接,可以筛选下

    findall 在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。

    语法格式为:re.findall(pattern, string, flags=0)

     参考代码:

    driver=webdriver.Chrome()
    driver.get("https://www.cnblogs.com/canglongdao")
    #print(type(driver.page_source))
    rs=driver.page_source.encode("utf-8")
    print(type(rs),type(str(rs)))
    aurl=re.findall('href="(.+?)"',str(rs))
    print(aurl)
    

     运行结果:

    <class 'bytes'> <class 'str'>
    ['//common.cnblogs.com/favicon.ico?v=20200522', '/css/blog-common.min.css?v=7Pwqzj5EBy4dBv4DJNI181rFKP8_OF0hT7jO3o8jAa0', '/skins/book/bundle-book-2.min.css', '/skins/book/bundle-book-mobile.min.css?v=XFoR99E4sMNWcYA_LxWBPY7uXp4-8NCPb1RnsUN1Mwo', 'https://www.cnblogs.com/canglongdao/rss', 'https://www.cnblogs.com/canglongdao/rsd.xml', 'https://www.cnblogs.com/canglongdao/wlwmanifest.xml', 'https://www.cnblogs.com/canglongdao/', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://www.cnblogs.com/canglongdao/p/13595372.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13595372', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://www.cnblogs.com/canglongdao/p/13594914.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594914', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://www.cnblogs.com/canglongdao/p/13594459.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13594459', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://www.cnblogs.com/canglongdao/p/13590722.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590722', 'https://www.cnblogs.com/canglongdao/archive/2020/08/31.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://www.cnblogs.com/canglongdao/p/13590348.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13590348', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://www.cnblogs.com/canglongdao/p/13589720.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13589720', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://www.cnblogs.com/canglongdao/p/13587969.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587969', 'https://www.cnblogs.com/canglongdao/archive/2020/08/30.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://www.cnblogs.com/canglongdao/p/13587061.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13587061', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://www.cnblogs.com/canglongdao/p/13586938.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13586938', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://www.cnblogs.com/canglongdao/p/13585477.html', 'https://i.cnblogs.com/EditPosts.aspx?postid=13585477', 'https://www.cnblogs.com/canglongdao/default.html?page=2', 'https://www.cnblogs.com/', 'javascript:void(0);', 'javascript:void(0);', 'https://www.cnblogs.com/canglongdao/archive/2020/09/01.html', 'https://www.cnblogs.com/', 'https://www.cnblogs.com/canglongdao/', 'https://i.cnblogs.com/EditPosts.aspx?opt=1', 'https://msg.cnblogs.com/send/%E6%98%9F%E7%A9%BA6', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/rss/', 'https://i.cnblogs.com/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/', 'https://home.cnblogs.com/u/canglongdao/followers/', 'https://home.cnblogs.com/u/canglongdao/followees/', 'javascript:void(0)', 'https://www.cnblogs.com/canglongdao/p/', 'https://www.cnblogs.com/canglongdao/MyComments.html', 'https://www.cnblogs.com/canglongdao/OtherPosts.html', 'https://www.cnblogs.com/canglongdao/RecentComments.html', 'https://www.cnblogs.com/canglongdao/tag/', 'https://www.cnblogs.com/canglongdao/category/1593317.html', 'https://www.cnblogs.com/canglongdao/category/1694849.html', 'https://www.cnblogs.com/canglongdao/category/1633461.html', 'https://www.cnblogs.com/canglongdao/category/1616592.html', 'https://www.cnblogs.com/canglongdao/category/1609028.html', 'https://www.cnblogs.com/canglongdao/category/1633189.html', 'https://www.cnblogs.com/canglongdao/category/1750002.html', 'https://www.cnblogs.com/canglongdao/category/1566249.html', 'https://www.cnblogs.com/canglongdao/category/1606140.html', 'https://www.cnblogs.com/canglongdao/category/1629226.html', 'https://www.cnblogs.com/canglongdao/category/1588735.html', 'https://www.cnblogs.com/canglongdao/category/1815562.html', 'https://www.cnblogs.com/canglongdao/category/1588084.html', 'https://www.cnblogs.com/canglongdao/category/1589277.html', 'https://www.cnblogs.com/canglongdao/category/1834572.html', 'https://www.cnblogs.com/canglongdao/category/1611757.html', 'https://www.cnblogs.com/canglongdao/category/1589392.html', 'https://www.cnblogs.com/canglongdao/category/1627263.html', 'https://www.cnblogs.com/canglongdao/category/1619655.html', 'https://www.cnblogs.com/canglongdao/category/1657195.html', 'https://www.cnblogs.com/canglongdao/category/1612257.html', 'https://www.cnblogs.com/canglongdao/category/1769926.html', 'https://www.cnblogs.com/canglongdao/category/1635972.html', 'https://www.cnblogs.com/canglongdao/category/1630667.html', 'https://www.cnblogs.com/canglongdao/archive/2020/09.html', 'https://www.cnblogs.com/canglongdao/archive/2020/08.html', 'https://www.cnblogs.com/canglongdao/archive/2020/07.html', 'https://www.cnblogs.com/canglongdao/archive/2020/06.html', 'https://www.cnblogs.com/canglongdao/archive/2020/05.html', 'https://www.cnblogs.com/canglongdao/archive/2020/04.html', 'https://www.cnblogs.com/canglongdao/archive/2020/03.html', 'https://www.cnblogs.com/canglongdao/archive/2020/02.html', 'https://www.cnblogs.com/canglongdao/archive/2020/01.html', 'https://www.cnblogs.com/canglongdao/archive/2019/12.html', 'https://www.cnblogs.com/canglongdao/archive/2019/11.html', 'https://www.cnblogs.com/canglongdao/archive/2019/10.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/12722846.html', 'https://www.cnblogs.com/canglongdao/p/12606952.html', 'https://www.cnblogs.com/canglongdao/p/12019714.html', 'https://www.cnblogs.com/canglongdao/p/12436272.html', 'https://www.cnblogs.com/canglongdao/p/12726642.html', 'https://www.cnblogs.com/canglongdao/p/11973931.html', 'https://www.cnblogs.com/canglongdao/p/12013291.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12067902.html', 'https://www.cnblogs.com/canglongdao/p/13380505.html', 'https://www.cnblogs.com/canglongdao/p/12636403.html', 'https://www.cnblogs.com/canglongdao/p/12601894.html', 'https://www.cnblogs.com/canglongdao/p/13414829.html']
    

     三、筛选url地址出来

    1.加个if语句判断,'http'在url里面说明是正常的url地址了

    2.把所有的url地址放到一个集合,就是我们想要的结果

    参考代码:

    # coding:utf-8
    from selenium import webdriver
    import re
    driver=webdriver.Chrome()
    driver.get("https://www.cnblogs.com/canglongdao")
    #print(type(driver.page_source))
    rs=driver.page_source.encode("utf-8")
    # print(type(rs),type(str(rs)))
    aurl=re.findall('href="(.+?)"',str(rs))
    print(aurl)
    url=[]
    for i in aurl:
        if 'http' in i:
            url.append(i)
    #最终的url集合
    print(len(url),url)
    

     运行结果:

    越努力,越幸运!!! good good study,day day up!!!
  • 相关阅读:
    Lotus iNotes 用户启用标识符保险库
    Domino NSD日志诊断/分析
    从 Domino 7.x 升级到 Domino 8.0.1 后服务器性能下降
    Domino服务器命令表
    源码:使用LotusScript发送mime格式邮件
    构架Domino CA中心之一
    如何在DNS中增加SPF记录
    构架Domino CA中心之二
    在Ubuntu 8.04上安装Domino R8.02
    内存陷阱 驯服C++中的野指针 沧海
  • 原文地址:https://www.cnblogs.com/canglongdao/p/13596364.html
Copyright © 2011-2022 走看看