zoukankan html css js c++ java

Python 爬虫之 Beautifulsoup4，爬网站图片

安装：

pip3 install beautifulsoup4
pip install beautifulsoup4

Beautifulsoup4 解析器使用 lxml，原因为，解析速度快，容错能力强，效率够高

安装解析器：

pip install lxml

使用方法：

加载 beautifulsoup4 模块
加载 urllib 库的 urlopen 模块
使用 urlopen 读取网页，如果是中文，需要添加 utf-8 编码模式
使用 beautifulsoup4 解析网页

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
from urllib.request import urlopen

#if chinese apply decode()
html = urlopen("https://www.anviz.com/product/entries/1.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
all_li = soup.find_all("li",{"class","product-subcategory-item"})
for li_title in all_li:
  li_item_title = li_title.get_text()
  print(li_item_title)

Beautifulsoup4文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id13

方法同 jQuery 类似：

//获取所有的某个标签：soup.find_all('a')，find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点
find_all()
soup.find_all("a")  //查找所有的标签
soup.find_all(re.compile("a"))  //查找匹配包含 a 的标签
soup.find_all(id="link2")
soup.find_all(href=re.compile("elsie")) //搜索匹配每个tag的href属性
soup.find_all(id=True)  //搜索匹配包含 id 的属性
soup.find_all("a", class_="sister")  //搜索匹配 a 标签中 class 为 sister 
soup.find_all("p", class_="strikeout")
soup.find_all("p", class_="body strikeout")
soup.find_all(text="Elsie")  //搜索匹配内容为 Elsie 
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
soup.find_all("a", limit=2)  //当搜索内容满足第2页时，停止搜索
//获取tag中包含的文本内容
get_text() 
soup.get_text("|")
soup.get_text("|", strip=True)
//用来搜索当前节点的父辈节点
find_parents()
find_parent()
//用来搜索兄弟节点
find_next_siblings() //返回所有符合条件的后面的兄弟节点
find_next_sibling()  //只返回符合条件的后面的第一个tag节点
//用来搜索兄弟节点
find_previous_siblings() //返回所有符合条件的前面的兄弟节点
find_previous_sibling() //返回第一个符合条件的前面的兄弟节点

find_all_next()  //返回所有符合条件的节点
find_next()  //返回第一个符合条件的节点

find_all_previous() //返回所有符合条件的节点
find_previous()  //返回第一个符合条件的节点

.select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag
soup.select("body a")
soup.select("head > title")
soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("#link1 ~ .sister")
soup.select(".sister")
soup.select("[class~=sister]")
soup.select("#link1")
soup.select('a[href]')
soup.select('a[href="http://example.com/elsie"]')

.wrap() 方法可以对指定的tag元素进行包装 [8] ,并返回包装后的结果

爬取 anviz 网站产品列表图片： demo

使用了

BeautifulSoup

requests

os

#Python 自带的模块有以下几个，使用时直接 import 即可
    import json
    import random     //生成随机数
    import datetime
    import time
    import os       //建立文件夹

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
import requests
import os

URL = "https://www.anviz.com/product/entries/2.html"
html = requests.get(URL).text
os.makedirs("./imgs/",exist_ok=True)
soup = BeautifulSoup(html,features="lxml")

all_li = soup.find_all("li",class_="product-subcategory-item")
for li in all_li:
    imgs = li.find_all("img")
    for img in imgs:
        imgUrl = "https://www.anviz.com/" + img["src"]
        r = requests.get(imgUrl,stream=True)
        imgName = imgUrl.split('/')[-1]
        with open('./imgs/%s' % imgName, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print('Saved %s' % imgName)

爬取的这个 URL 地址是写死的，其实这个网站是分三大块的，末尾 ID 不一样，还没搞明白怎么自动全爬。

查看全文

相关阅读:
linux driver ------ 交叉工具链（cross toolchain）
Qt ------ 截图、获取鼠标指定的RGB值
 Qt ------ QWidget 自定义子类使用信号与槽（Q_OBJECT）后 stylesheet 失效
 Qt error ------ incomplete type 'QApplication' used in nested name specifier
Qt ------ Q_UNUSED
SpringCloud 组件Eureka参数配置项详解
 过滤器(Filter)与拦截器(Interceptor)的区别
 事务隔离级别
 事务四大特性
 get与post的区别

原文地址：https://www.cnblogs.com/baiyygynui/p/10813046.html