zoukankan html css js c++ java

BeautifulSoup 爬虫

一安装BeautifulSoup

安装Python的包管理器pip 然后运行

$pip3 install beautifulsoup

在终端里导入它测试下是否安装成功

>>>from bs import BeautifulSoup

如果没有错误，说明导入成功了

简单例子 http://sc.chinaz.com/biaoqing/baozou.html 爬取图片

代码如下

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings("ignore")
def getTitle(url):
    list =[];
    try:
       html=urlopen(url);
    except (HTTPError,URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        a=bsObj.findAll("img",{"src":re.compile("http://.*jpg|png|jpeg|tiff|raw|bmp|gig")});
        for i in a:
            if i['src']!="":
               list.append(i['src']);
    except AttributeError as e:
        return None;

    return list;
# a=getTitle(url)
# print(a)

def getHread(is_urls):
    list=[];
    try:
        html = urlopen(is_urls);
    except (HTTPError, URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        tables=bsObj.findAll("a")

        for i in tables:
            if "href" in i.attrs:
               list.append(i.attrs['href']);

             #print(getTitle(i.attrs['href']));
        temp=set(list);
        for d in temp:
            print(getTitle(d));
    except AttributeError as e:
        return None;
    #return list;
is_ulrs="http://sc.chinaz.com/biaoqing/baozou.html";
a=getHread(is_ulrs)
print(a)
##################运行结果****************************** 
没有具体需求 只是简单的例子 只是处理了重复返回的图片用到set集合 运行的速度有点慢 没有时间优化 等有时间一定好好写写。

查看全文

相关阅读:
仿酷狗音乐播放器开发日志二十七用ole为窗体增加文件拖动功能（附源码）
redis持久化和主从同步
 MySQL主从复制
 Nginx 安装与详解
 ContOS安装配置MySQL，redis
ContOS7编译安装python3，配置虚拟环境
 ContOS7切换国内源
 ContOS 常用命令
 轮询、长轮询、websock
flask之三方组件

原文地址：https://www.cnblogs.com/wxc1/p/6130079.html