zoukankan html css js c++ java

爬取百度产品列表

import requests
from bs4 import BeautifulSoup


url = 'https://www.baidu.com/more/'



response = requests.get(url)
response.encoding = 'utf-8'

# 解析html
soup = BeautifulSoup(response.text, 'lxml')
res = soup.find_all('div', class_='con')

datas = []
for info in res:
　　# find_next_siblings() 方法返回所有符合条件的后面的兄弟节点
    title = info.find('div').find_next_sibling('div').find('a').get_text()
    desc = info.find('div').find_next_sibling('div').find('span').get_text()
    datas.append({
        'title': title,
        'desc': desc
    })
with open('more.txt', 'w', encoding='utf-8') as f:
    for data in datas:
        f.write(str(data))
    print('done')

分析：

1.直接使用python模拟请求---->成功

2、发现返回来的结果是乱码

解决方案：

response.encoding = 'utf-8'  # 原本编码是iso-8859-1

3.解析html

此次采用BeautifulSoup进行解析

按CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:

soup = BeautifulSoup(response.text, 'lxml')

res = soup.find_all('div', class_='con')

find_all() 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 find_all() 方法来查找<body>标签就不太合适, 使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.

爬取结果：

一个遗留的坑：百度产品列表的图片没有爬取下来

原因：图片在css中的background中，还不会.....

查看全文

相关阅读:
再谈C#装箱和拆箱操作
 C#装箱与拆箱总结
 大话设计模式
 创建ASP.NET Webservice
Lambada和linq查询数据库的比较
 设置VS2015背景图片（转载）
windows 下使用Linux 子系统-安装.net core 环境
 .net core 3.1 ef Migrations 使用 CLI 数据迁移及同步
 linq 大数据 sql 查询及分页优化
 数据迁移最快方式，多线程并行执行 Sql插入

原文地址：https://www.cnblogs.com/kun666/p/14879812.html