zoukankan html css js c++ java

爬取腾讯漫画

一总结

　　页面是异步加载，页面滑动的过程中，每张图片的url才显示。所以推荐selenium。同时需要selenium执行js代买，实现页面滚动的效果。就是window.scrollTo()方法。

　　在用scrapy框架中，并不是所有的request都是需要经过用selenium。经过selenium拿到数据，返回response，具体某一话漫画的首页才这个需求。将这个需求写入到下载中间件中，并加条件判断。

　　具体参考：https://jiayi.space/post/scrapy-phantomjs-seleniumdong-tai-pa-chong

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

START_URL = 'http://ac.qq.com/ComicView/index/id/505430/cid/920'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}

js = 'window.scrollTo(0,{})'

driver = webdriver.Chrome()
driver.get(START_URL)
for i in range(1,25):
    driver.execute_script(js.format(i*1000))
    time.sleep(1)
content = driver.page_source
soup = BeautifulSoup(content,'lxml')
lis = soup.select('ul#comicContain > li')
i = 1
for li in lis:
    img = li.select('img')[0]
    url = img.get('src')
    if url.startswith('http://ac.tc.qq.com/store_file_download?'):
        r = requests.get(url)
        con = r.content
        with open('page{}.jpg'.format(i),'wb') as f:
            f.write(con)
        i += 1

查看全文

相关阅读:
javascript中的几种遍历方法浅析
 实用的正则表达式
 关于git中的merge和rebase
油猴脚本-3
油猴脚本-2
油猴脚本-1
hadoop各个组件之间的通信
 mysql 表数据修改的方法，单标修改、多表修改--将一张表里面的其中一个字段的值赋值给另一张表
 kafka的副本同步机制（ISR）
sql的over函数的作用和方法

原文地址：https://www.cnblogs.com/654321cc/p/8909615.html