zoukankan html css js c++ java

爬虫9-淘宝商品信息定向爬虫

功能描述：

目标　　获取淘宝搜索页面的信息，提取其中的商品信息名称和价格

理解　　淘宝的搜索接口翻页的处理

技术路线　　requests re

当我们在淘宝上搜索书包时：

观察淘宝页面可知每一页共44个商品。

同时通过robots协议，发现不支持爬取。

程序的结构设计：

1、提交商品搜索需求，循环获取页面

2、对于每个页面，提取商品名称和价格信息

3、将信息输出在屏幕上

import requests
import re

def getHTMLText(url):
    print('')
    
def parserPage(ilt,html):
    print('')
    
def printGoodList(ilt):
    print('')

def main():
    goods='书包'#搜索关键词
    depth=2#爬取深度
    start_url='https://s.taobao.com/search?q='+goods#初始链接
    infoList=[]#输出列表
    for i in range(depth):
        try:
            url=start_url+'&s='+str(44*i)
            html=getHTMLText(url)
            parserPage(infoList,html)
        except:
            continue
    printGoodList(infoList)
main()

代码编写习惯，先写框架，再填充丰满。

完整程序：

import requests
import re

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ''
    
def parserPage(ilt,html):
    try:
        plt=re.findall(r'"view_price":"[d.]*"',html)
        tlt=re.findall(r'"raw_title":".*?"',html)#加问号最小匹配
        for i in range(len(plt)):
            #eval函数去掉最外层的单引号 双引号
            price=eval(plt[i].split(':')[1])#只保留键值对中的数字部分
            title=eval(tlt[i].split(':')[1])
            ilt.append([price,title])
    except:
        print('')
    
def printGoodList(ilt):
    #打印模板
    tplt='{:4}	{:8}	{:16}'
    print(tplt.format('序号','价格','商品名称'))
    count=0
    for q in ilt:
        count=count+1
        print(tplt.format(count,q[0],q[1]))

def main():
    goods='书包'#搜索关键词
    depth=2#爬取深度
    start_url='https://s.taobao.com/search?q='+goods#初始链接
    infoList=[]#输出列表
    for i in range(depth):
        try:
            url=start_url+'&s='+str(44*i)
            html=getHTMLText(url)
            parserPage(infoList,html)
        except:
            continue
    printGoodList(infoList)
main()

输出;

查看全文

相关阅读:
使用github
在存储过程中用动态SQL建表后如果用PL/SQL插入
 使用drving_site处理DBLINK数据的无数据的问题
 TCP/IP详情图片
 pl/sql developer中建立job
ueditor1.2.6图片被压缩的解决办法
 ueditor图片上传,网络连接错误的解决方案
 .net根据经纬度获取地址(百度api)
jQuery里面的DOM操作（查找，创建，添加，删除节点）
关于jQuery中的选择器

原文地址：https://www.cnblogs.com/rayshaw/p/8620920.html