zoukankan html css js c++ java

利用 selenium 爬取糗事百科

需要：

最近看到了selenium介绍，说是可以模拟人类自动打开网页

很有兴趣，于是学习了下，

果然：兴趣是最好的老师。

说明：

选取糗事百科，因为没有设置爬虫robots，所以用来练手，

请不要恶意爬取。

代码如下：

#!/usr/bin/env python
#-*- coding:utf-8 -*-


import time 
from selenium import webdriver 
from pymongo import MongoClient


"""
1. 获取一个标签就是：element
2. 获取多个标签就是：elements
"""


"""
获取标签文本：text
获取href属性值：get_attribute("href")
"""

def get_db():
    client = MongoClient(host="localhost", port=27017)
    db = client.spider
    collection = db.qiushibaike_selenium
    return collection 


def get_text():
    content_list = driver.find_elements_by_class_name("main-list") 
    # print(content_list)
    collection = get_db()
    for item in content_list:
        tm = item.find_element_by_class_name("fr").text
        title = item.find_element_by_class_name("title").text
        link = item.find_element_by_class_name("title").find_element_by_tag_name("a").get_attribute("href")
        text = item.find_element_by_class_name("content").text
        url = driver.current_url
        
        out_dict = {
            "发表时间": tm,
            "文章标题": title,
            "文章完整连接": link,
            "文章内容": text,
            "url": url
        }
        
        print("33[31m将该段子写入数据库中33[0m")
        collection.insert_one(out_dict)
        # print(out_dict)
    

def get_next():
    print("33[32m开始进入下一页33[0m")
    
    try:
        next_page = driver.find_element_by_class_name("next")
        next_page.click()
        return True
    except Exception as e:
        print("这是最后一页啦")
        return False
    
    
if __name__ == "__main__":
    driver =  webdriver.Firefox()
    driver.get("http://qiushidabaike.com/text_280.html") 
    get_text()
    time.sleep(2)

    while get_next():
        get_text()
        time.sleep(5)

需要掌握的知识点：

1. mongo数据库的登陆，数据插入，没有这方面基础的同学，可以将爬取到的结果存入到文本文件中；

2.selenium如何定位元素，需要有一定的html，css基础，如果什么基础都没有，可以看下面的附属小tips；

3.如何找到下一页，并进行爬取

附属小tips：

1.如何定位元素：

在网页上面找到需要的元素，点击右键--检查元素--复制--Xpath即可，

2. 爬取内容时，记得设置下休眠时间，减少网站压力，同时也减少由于网页渲染失败导致的错误

查看全文

相关阅读:
poj3417 闇の連鎖【树上差分】By cellur925
Luogu P1613跑路【倍增】By cellur925
CF519E A and B and Lecture Rooms
poj 2412 The Balance 【exgcd】By cellur925
NOIp 2014 解方程【数学/秦九韶算法/大数取膜】By cellur925
Maven项目整合SSH框架
 传递依赖
 Maven项目整合Struts2框架
 K.O. -------- Eclipse中Maven的报错处理
 依赖范围

原文地址：https://www.cnblogs.com/lmt921108/p/12941716.html