zoukankan html css js c++ java

009 Python网络爬虫与信息提取淘宝商品比价定向爬虫

[A] 淘宝商品比价定向爬虫实例介绍

　　　　功能描述

　　　　　　目标：

　　　　　　　　获取淘宝搜索页面的信息，提取其中的商品名称和价格

　　　　　　分析：

　　　　　　　　1. 淘宝的搜索接口， 2. 翻页处理

　　　　技术路线：

　　　　　　requests，re

　　　　程序结构设计：

　　　　　　步骤1：提交商品搜索请求，循环获取页面

　　　　　　步骤2：对于每个页面，提取商品名称和价格信息

　　　　　　步骤3：将信息打印在屏幕上

[B] 淘宝商品比价定向爬虫实例编写

　　　　示例代码：

import requests
from bs4 import BeautifulSoup
import bs4
import re


# 1. 获取页面内容，将所需要的html页面返回
def get_HTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        text = r.text
        return text
    except:
        return ''


# 2. 解析页面内容，将所需要的商品信息放在数组中返回
def parsePage(html):
    ulist = []
    soup = BeautifulSoup(html, 'html.parser')
    divs = soup(name='div', attrs='pro_list_product_img2')
    for k in divs:
        try:
            id = k.attrs['shopid']
            name = k.ul('li', 'titheight font_tit')[0].a.text[18:]
            try:
                price = k.em('font')[0].text + '.' + k.em('font')[1].text
            except:
                price = k.em.text
            stress = re.findall('[u4e00-u9fa50-9-]{3,}', k('li', 'shshopname')[0].text)[0]
            ulist.append([id, name, price, stress])
        except:
            continue
    return ulist


# 3. 打印商品信息，将之前保存好的商品内容打印出来
def printGoodsList(ulist, page=1):
    splt1 = '{0:<5}	{1:<10}	{2:<60}	{3:<20}	{4:<20}'
    splt2 = '{0:<5}	{1:<10}	{2:<45}	{3:<20}	{4:<20}'
    if page == 1:
        print(splt1.format('序号', '店铺id', '商品名称', '商品价格', '店家地址'))
    for k in range(0, len(ulist)):
        print(splt2.format(50*(page - 1) + k + 1, ulist[k][0], ulist[k][1], ulist[k][2], ulist[k][3]))


def main():
    # 'http://www.yiwugo.com/search/s.html?cpage=1&q=连衣裙'
    depth = 2
    keyword = '小火车'
    for k in range(1, depth + 1):
        url = 'http://www.yiwugo.com/search/s.html?cpage=' + str(k) + '&q=' + keyword
        html = get_HTMLText(url)
        ulist = parsePage(html)
        printGoodsList(ulist, k)

main()

View Code

查看全文

相关阅读:
Annotation Type ManyToMany->>>>>Oracle
windows 控制台默认为UTF-8显示的方法
 springboot读取配置文件
 Spring Boot配置文件放在jar外部
 Vue自定义过滤器
 vue中limitBy，filterBy，orderBy的用法
 track-by的使用
 Vue的computed属性
 vue实现百度下拉框
 Cas服务器以及客户端搭建

原文地址：https://www.cnblogs.com/carreyBlog/p/14015459.html

009 Python网络爬虫与信息提取 淘宝商品比价定向爬虫

009 Python网络爬虫与信息提取淘宝商品比价定向爬虫