使用BeautifulSoup模块
使用正则表达式
使用到多线程爬取
使用说明
使用前请安装BeauifulSoup
运行程序后会在当前目录下生成txt文件,内容为json格式.如下所示:
{“branch_first_letter”: “S”, “branch_name”: “萨博”, “branch_id”: “64”, “producer”: “萨博”, “producer_id”: “”, “car_series”: “Saab 900”, “car_series_id”: “s2630”, “car_price”: } 源代码 #!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2020/1/16 15:34 # @Author : wsx # @Site : # @File : cars.py # @Software: PyCharm import json from multiprocessing import Pool import requests from requests.exceptions import RequestException import re from bs4 import BeautifulSoup def get_one_page(url): """ 请求网页函数. :param url: :return: """ headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0'} try: response = requests.get(url, headers=headers) print(response.status_code) if response.status_code == 200: return response.text return None except RequestException: return None def parse_one_page(html, first_letter): """ 网页处理函数, 生成器 :param html: :param first_letter: :return:iterable """ # 加载网页 soup = BeautifulSoup(html, 'lxml') # 创建字典,存储数据 info = {'branch_first_letter': '', 'branch_name': '', 'branch_id': '', 'producer': '', 'producer_id': '', 'car_series': '', 'car_series_id': '', 'car_price': ''} # 找出所需信息在的标签 branches = soup.find_all('dl') # 先获取品牌 for branch in branches: info['branch_name'] = branch.dt.div.a.string.strip() info['branch_id'] = branch['id'] info['branch_first_letter'] = first_letter print('正在抓取...品牌:', info['branch_name']) # 生成新的处理块 block = branch.find_all('dd') soup = BeautifulSoup(str(block), 'lxml') # 获取某一品牌下的所有制造商 producers = soup.find_all('div', attrs={'class': 'h3-tit'}) for producer in producers: info['producer'] = producer.a.get_text().strip() # 找不到这个参数呀. info['producer_id'] = '' print('正在抓取...生产商:', info['producer']) cars = producer.find_next('ul') for car in cars.find_all('li', attrs={'id': True}): info['car_series_id'] = car['id'] info['car_series'] = car.h4.a.get_text().strip() # 价格这个参数难提取, 初步过滤一下 price = car.find_all('a', attrs={'class': True, 'data-value': False}) # 判断一下抓取的是不是价格, 用正则表达式再过滤一下 if price: print(price[0].get_text()) if re.match('.*?万.*?', price[0].get_text(), re.S): info['car_price'] = price[0].get_text().strip() else: info['car_price'] = '暂无报价' # 做成迭代器 yield info def write_file(content): """ 将抓取数据保存成Json文件 :param content: :return: None """ with open('cars.txt', 'a', encoding='utf-8') as f: f.write(json.dumps(content, ensure_ascii=False) + ' ') f.close() def main(first_letter): """ 主函数 :param first_letter: :return: None """ html = get_one_page('https://www.autohome.com.cn/grade/carhtml/' + first_letter + '.html') soup = BeautifulSoup(html, 'lxml') html = soup.prettify() # 测试时先存在本地以免频繁访问站点 # with open('car_home.html', 'w', encoding='utf-8') as f: # f.write(html) # f.close() # with open('car_home.html', 'r', encoding='utf-8') as f: # html = f.read() # f.close() for item in parse_one_page(html, first_letter): write_file(item) if __name__ == '__main__': # 如不需要按照字母顺序, 则uncomment # pool = Pool() # pool.map(main, [chr(i + ord('A')) for i in range(26)]) # 如需要多线程, 则comment for letter in [chr(i + ord('A')) for i in range(26)]: main(letter)
大家可能会问:为什么爬取个简单的数据还要三层循环?我主要考虑到数据之间的关联性、层级性才使用了三层循环,这样才能保证数据之间的层级关系保持不乱。
编写代码过程中遇到BeautifulSoup中,find_all()方法如果只需要确定是否存在某个属性,而不指定具体属性值,可以写成下面这样:
car.find_all('a', attrs={'class': True, 'data-value': False})