zoukankan      html  css  js  c++  java
  • 快递信息爬取

    https://www.kuaidi100.com/network/net_4117_all_all_2.htm

    获取每一页的链接

    import requests 
    from bs4  import BeautifulSoup
    url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
    try:
        r = requests.get(url)
        r.raise_for_status()   ##
        r.encoding = r.apparent_encoding
        print(r.text[:1000])
    except:
        print("爬取出错")
        
    soup = BeautifulSoup(r.text, "html.parser")
    networklist = soup.select(".networkListItem")
    for i in networklist:
        print(i.find("a").attrs['href'])	
    

    网店节点链接https://www.kuaidi100.com/network/net_4117_all_all_2.htm 只有2.htm 会变

    所以对https://www.kuaidi100.com/network/net_4117_all_all_2 数字累加,并判断网店的链接是否为0来决定

    url = "https://www.kuaidi100.com/network/net_4117_all_all_60.htm"
    try:
        r = requests.get(url)
        r.raise_for_status()   ##
        r.encoding = r.apparent_encoding
        print(r.text[:1000])
    except:
        print("爬取出错")
        
    soup = BeautifulSoup(r.text, "html.parser")
    networklist = soup.select(".networkListItem")
    len(networklist)  
    0  ##长度为0 /net_4117_all_all_60.htm  链接无信息
    
    url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
    soup = BeautifulSoup(r.text, "html.parser")
    networklist = soup.select(".networkListItem")
    len(networklist)  
    10  ##长度为10 该连接有网店信息
    

    匹配url , url 一次累加

    import re 
    pattern = re.compile(r"(.*_)(d+).htm$", re.I)
    url = "https://www.kuaidi100.com/network/net_4117_all_all_1.htm"
    m = pattern.match(url)
    m.group(0)
    'https://www.kuaidi100.com/network/net_4117_all_all_1.htm'
    m.group(1)
    'https://www.kuaidi100.com/network/net_4117_all_all_'
    m.group(2)
    '1'
    m.groups()
    ('https://www.kuaidi100.com/network/net_4117_all_all_', '1')
    
    

    对url 链接不断累加,如果判断对一个网页里的网店链接信息url 小于10 就停止

    i = 1 
    while  True:
        if i < 50:
            i = i + 1
            print(i)
        else:
            print("you are over")
            break
    

    分析

    ##输入初始url 以便获取网点详情
    
    def 
    
    
    
    
    
    
    
    ##根据1,抓取每个url
    
    
    
    
    ##对每个url 抓取信息
    url = "https://www.kuaidi100.com/network/networkdt792925391984709.htm"
    try:
        r = requests.get(url)
        r.raise_for_status()   ##
        r.encoding = r.apparent_encoding
        print(r.text[:1000])
    except:
        print("爬取出错")
        
    ##抓取信息
    soup = BeautifulSoup(r.text, "html.parser")
    kdinfo = soup.select(".kd-info")[0]   ##获取的为list 
    kdinfo.prettify() ##打印获取的信息
    ## kddlinfo = kdinfo.find_all("dl")  ##获取kdinfo 标签内的dl标签,标签内有dt(名称)  及dd(详情)
    title = kdinfo.h1.text
    
    
    
    ddlist = kdinfo.find_all("dd")
    
    for dd in ddlist:
        print("
    ----------")
        print(dd.text)
    
        
        ##打印
    -------------------------
    河南,驻马店市,正阳县
    
    -------------------------
    正确路口东段北侧
    
    -------------------------
    查件电话:17744695161业务电话:17744695161
    
    -------------------------
    联系时,请一定说明是在快递100看到的信息,谢谢!
    
    -------------------------
    南环路以北、西环路以东,交警队、电视台以南,正付路-东环路以西。东、西、南、北大街,中心街、花园路、正大路、东、西顺河街、慎西路。县直各单位、局委、厂区、学校。铜钟街及铜钟全境。
    陡沟镇、傅寨乡、兰青乡、永兴镇、彭桥乡、新阮店乡、熊寨镇、吕河乡。
    
    延迟派送:岳城乡:1天,西严店乡:1天。
    
    -------------------------
    寒冻镇。
    
    -------------------------
    到付业务,代收货款
    
    -------------------------
    2019-11-04
    
    
    
    
    列表:
    h1:地点;驻马店正阳县
      title = kdinfo.h1.text
    所在地区:
    
    公司地址:
    联系电话:
    派送范围:
    延迟派送:
    派送范围:
    不派送范围:
    备注:
    本站更新:
    
    
    
    
    
    
    ##保存抓取的信息
    http://www.python-excel.org/
    https://www.jianshu.com/p/a8391a2b8c6c
    
    
    
    
    
    
    
    
    

    https://www.kancloud.cn/xmsumi/pythonspider/160081

  • 相关阅读:
    PAT顶级 1024 Currency Exchange Centers (35分)(最小生成树)
    Codeforces 1282B2 K for the Price of One (Hard Version)
    1023 Have Fun with Numbers (20)
    1005 Spell It Right (20)
    1092 To Buy or Not to Buy (20)
    1118 Birds in Forest (25)
    1130 Infix Expression (25)
    1085 Perfect Sequence (25)
    1109 Group Photo (25)
    1073 Scientific Notation (20)
  • 原文地址:https://www.cnblogs.com/g2thend/p/12382087.html
Copyright © 2011-2022 走看看