zoukankan      html  css  js  c++  java
  • python爬虫-京东商品爬取

    京东商品爬取

    仅供学习

    一.使用selenium

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys #键盘按键操作from selenium.webdriver.support import expected_conditions as EC
    import time
    
    
    
    def get_goods(driver):
        try:
            goods=driver.find_elements_by_class_name('gl-item')
    
    
            for good in goods:
                detail_url=good.find_element_by_tag_name('a').get_attribute('href')
    
                p_name=good.find_element_by_css_selector('.p-name em').text.replace('
    ','')
                price=good.find_element_by_css_selector('.p-price i').text
                p_commit=good.find_element_by_css_selector('.p-commit a').text
    
                msg = '''
                商品 : %s
                链接 : %s
                价钱 :%s
                评论 :%s
                ''' % (p_name,detail_url,price,p_commit)
    
                print(msg,end='
    
    ')
    
    
    
            button=driver.find_element_by_partial_link_text('下一页')
            button.click()
            time.sleep(1)
            get_goods(driver)
        except Exception:
            pass
    
    
    
    def spider(url,keyword):
        driver = webdriver.Chrome()
        driver.get(url)
        driver.implicitly_wait(3)
        try:
            input_tag=driver.find_element_by_id('key')
            input_tag.send_keys(keyword)
            input_tag.send_keys(Keys.ENTER)
            get_goods(driver)
        finally:
            driver.close()
    
    if __name__ == '__main__':
        spider('https://www.jd.com/',keyword='iPhone8手机')
    

    二.不使用selenium

    from requests_html import HTMLSession
    session = HTMLSession()
    page=1
    while True:
        res =session.get(f'https://search.jd.com/Search?keyword=苹果8&enc=utf-8&page={page*2-1}')  #keyword搜索内容 #enc编码格式 #8page页数*2-1
        res.html.encoding='utf8'
        info_list=res.html.xpath('//*[@class="gl-i-wrap"]')
        if not info_list:
            print(f'一共爬取{page}页')
            break
        print(f'url={res.url}第{page}页',[info.text for info  in info_list])
        page+=1
    

    三.个人感觉

    selenium真的慢- -

  • 相关阅读:
    iframe标签
    Meta标签
    表单相关标签之textarea,select
    marquee标签
    表单相关标签之input标签
    表单相关标签之form标签
    1.7.8- HTML合并单元格
    1.7.7- 表格标题标签
    01- QTP快速入门
    1.7.6- 浏览器审查HTML标签元素
  • 原文地址:https://www.cnblogs.com/pythonywy/p/12008430.html
Copyright © 2011-2022 走看看