zoukankan      html  css  js  c++  java
  • [Selenium2+python2.7][Scrap]爬虫和selenium方式下拉滚动条获取简书作者目录并且生成Markdown格式目录

    预计阅读时间: 15分钟

    环境: win7 + Selenium2.53.6+python2.7 +Firefox 45.2  (具体配置参考 http://www.cnblogs.com/yoyoketang/p/selenium.html)

    FF45.2 官方下载地址: http://ftp.mozilla.org/pub/firefox/releases/45.2.0esr/win64/en-US/ 

    痛点:爸爸的一个朋友最近简书上面更新了20多篇文章,让我添加目录。每次手动查找链接再添加标题太麻烦了,30多篇就需要半个多小时,而且链接可能会变换。

    解决办法:由于简书支持markdown 格式,爬取作者目录然后生成Markdown格式文档即可

    原始思路一: 采用urllib2方式爬取目录

    步骤:

    1.使用urllib2模拟header  request打开页面

    2. 采用正则匹配href的链接,然后用列表推导式生成链接

    3. 采用正则获取标题

    4. 生成目录

     1 #coding=utf-8
     2 import urllib2,re
     3 
     4 def getHtml(url):
     5     header = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36'}
     6     request = urllib2.Request(url,headers=header)  #init user request with url and headers
     7     response = urllib2.urlopen(request)            #open url
     8     text = response.read()
     9     return text
    10 
    11 def getTitleLink(html):
    12     pattern1 = re.compile('<a class="title" target="_blank" href="/p/(\w{0,12})"', re.S)
    13     links = re.findall(pattern1,html)
    14     urls = ["www.jianshu.com/p/"+str(link) for link in links]
    15 
    16     pattern2 = re.compile('<a class="title" target="_blank" href="/p/.*?">(.*?)</a>',re.S)
    17     titles = re.findall(pattern2,html)
    18     for title,url in zip(titles,urls):
    19         if r'目录' not in title:
    20             print "["+title+"](" + url + ")"
    21     #return urls
    22 
    23 
    24 #sample test menu
    25 url = 'http://www.jianshu.com/u/73632348f37a'
    26 html = getHtml(url)
    27 getTitleLink(html)

    测试发现如果作者文章只有五六篇,能正确生成。

    但是如果文章20篇以上,发现问题:

    这种办法只爬取了当前页面加载的文章链接,手工拖拽滚动条动态加载的标题内容无法直接获取到,网上建议用selenium来解决

    思路二: 采用selenium打开网页,调用js模拟鼠标点击滚动条,加载全部页面

    步骤:

    1. 使用selenium打开网页

    2. 循环调用js模拟鼠标点击下拉滚动条,直至加载全部页面

    3. 使用find_elements_by_xpath查找标题tag

    4. 将标题tag解析后写入目录并打印

    注: 步骤3获取的为WebElement 类型对象

     1 #coding=utf-8
     2 
     3 #refer to http://www.cnblogs.com/haigege/p/5492177.html
     4 #Step1: scroll and generate Markdown format Menu
     5 
     6 from selenium import webdriver
     7 import time
     8 
     9 def scroll_top(driver):
    10     if driver.name == "chrome":
    11         js = "var q=document.body.scrollTop=0"
    12     else:
    13         js = "var q=document.documentElement.scrollTop=0"
    14     return driver.execute_script(js)
    15 
    16 # 拉到底部
    17 def scroll_foot(driver):
    18     if driver.name == "chrome":
    19         js = "var q=document.body.scrollTop=100000"
    20     else:
    21         js = "var q=document.documentElement.scrollTop=100000"
    22     return driver.execute_script(js)
    23 
    24 def write_text(filename, info):
    25     """
    26     :param info: 要写入txt的文本内容
    27     :return: none
    28     """
    29     # 创建/打开info.txt文件,并写入内容
    30     with open(filename, 'a+') as fp:
    31         fp.write(info.encode('utf-8'))
    32         fp.write('\n'.encode('utf-8'))
    33         fp.write('\n'.encode('utf-8'))
    34 
    35 def sroll_multi(driver,times=5,loopsleep=2):
    36     #40 titles about 3 times
    37     for i in range(times):
    38         time.sleep(loopsleep)
    39         print "Scroll foot %s time..." % i
    40         scroll_foot(driver)
    41     time.sleep(loopsleep)
    42 
    43 #Note: titles is titles_WebElement type object
    44 def write_menu(filename,titles):
    45     with open(filename, 'w') as fp:
    46         pass
    47     for title in titles:
    48         if r'目录' not in title.text:
    49             print "[" + title.text + "](" + title.get_attribute("href") + ")"
    50             t = title.text.encode('utf-8')
    51             t = title.text.replace(":", "")
    52             t = title.text.replace("|", "")
    53             t = title.text.decode('utf-8')
    54             write_text(filename, "[" + t + "](" + title.get_attribute("href") + ")")
    55             #assert type(title) == "WebElement"
    56             #print type(title)
    57 
    58 def main(url):
    59     # eg. <a class="title" href="/p/6f543f43aaec" target="_blank"> titleXXX</a>
    60     driver = webdriver.Firefox()
    61     driver.implicitly_wait(10)
    62     # driver.maximize_window()
    63     driver.get(url)
    64     sroll_multi(driver)
    65     titles = driver.find_elements_by_xpath('.//a[@class="title"]|.//a[target="_blank"]')
    66     write_menu(filename, titles)
    67 
    68 if __name__ == '__main__':
    69     # sample link
    70     url = 'http://www.jianshu.com/u/73632348f37a'
    71     filename = r'info.txt'
    72     main(url)

    注:

    1. 参考链接: http://www.cnblogs.com/haigege/p/5492177.html

    2. 环境下载:Firefox45: https://ftp.mozilla.org/pub/firefox/releases/45.0esr/win64/en-US/

    3. 如果编码格式报错,添加

    reload(sys)
    sys.setdefaultencoding('utf8')
  • 相关阅读:
    day03—JavaScript中DOM的Event事件方法
    day02-Javascript之document.write()方法
    day01-JavaScript中"Uncaught TypeError: Cannot set property 'innerHTML' of null"错误
    Linux安装Tomcat8
    CentOS7安装jdk8及环境变量配置
    Linux命令之lsof
    java如何停止一个运行的线程?
    大数据技术之Hadoop(HDFS)
    大数据技术之Hadoop入门
    用word2013 把word 文档发送到博客园
  • 原文地址:https://www.cnblogs.com/carol2000/p/6737355.html
Copyright © 2011-2022 走看看