zoukankan      html  css  js  c++  java
  • 爬取58同城二手电脑一页的商品数据,requests+BeautifulSoup

    爬取58同城二手电脑一页的商品数据(35个商品),不要网站推广的商品,只爬取普通个人的商品;每个商品爬取的数据有:'标题'、价格'、'分类'、'区域'、'浏览量'(浏览量没能正确获取,用selenium能获取浏览量,在此没有具体描述)
    58同城平板电脑页:https://bj.58.com/pbdn/0/
    详情页:
    https://bj.58.com/pingbandiannao/41525432516648x.shtml?link_abtest=&psid=195682258207684760008335092&entinfo=41525432516648_p&slot=-1&iuType=p_1&PGTID=0d305a36-0000-1ad2-2601-c5a3525b5373&ClickID=1
    元素中有子孙级元素的多段文字,用stripped_strings处理,即:list(soup.select('div.nav')[0].stripped_strings)
    import requests
    from bs4 import BeautifulSoup
    import time
    
    # 给定商品概览页,爬取数据,返回多个电脑详情的url,列表形式
    def get_computer_urls(url):
        computer_urls = []
        raw_page = requests.get(url).text
        print('正在打开网页:',url)
        time.sleep(3)
        soup = BeautifulSoup(raw_page,'lxml')
        eles = soup.select('tr[infotag*=commoninfo]>td>a')   # 普通个人的商品,不是网站推广的商品
        print('正在定位元素,选择器是:','tr[infotag*=commoninfo]>td>a')
        for e in eles:
            url = e.get('href')
            computer_urls.append(url)
        return computer_urls
    
    # 给定电脑详情页,返回电脑信息,字典格式
    def get_computer_info(url):
        raw_page = requests.get(url).text
        print('正在打开网页:', url)
        time.sleep(3)
        soup = BeautifulSoup(raw_page,'lxml')
    
        title = soup.select('h1.detail-title__name')[0].get_text().strip()
        category = list(soup.select('div.nav')[0].stripped_strings)
        category = ''.join(category)
        price = soup.select('span.infocard__container__item__main__text--price')[0].get_text().strip()
        address = list(soup.select('div.infocard__container__item__main')[1].stripped_strings)
        address = ''.join(address)
        look_num = soup.select('span#totalcount')[0].get_text().strip()
        #print({'标题':title,'价格':price,'分类':category,'区域':address,'浏览量':look_num})
        return {'标题':title,'价格':price,'分类':category,'区域':address,'浏览量':look_num}
    
    
    start_url = 'https://bj.58.com/pbdn/0/'
    computer_urls = get_computer_urls(start_url)
    print(computer_urls)
    print('len(computer_urls):',len(computer_urls))
    time.sleep(3)
    i = 0
    for url in computer_urls:
        i +=1
        dict = get_computer_info(url)
        print(i, dict)
    

      

  • 相关阅读:
    Docker生态会重蹈Hadoop的覆辙吗?
    刘志强博士:专业涵养 奉献情怀
    Sublime Text3前端必备插件
    JVM性能调优监控工具jps、jstack、jmap、jhat、jstat、hprof使用详解
    jvm的stack和heap,JVM内存模型,垃圾回收策略,分代收集,增量收集(转)
    Eclipse安装MAT插件
    tomcat内存泄漏存入dump文件
    CSS中behavior属性语法简介
    get/post时中文乱码问题的解决办法
    Java序列化的机制和原理
  • 原文地址:https://www.cnblogs.com/djlbolgs/p/12522380.html
Copyright © 2011-2022 走看看