zoukankan      html  css  js  c++  java
  • 结合Selenium和正则表达式提高爬虫效率

    任务

    爬取https://www.aliexpress.com/wholesale?SearchText=cartoon+case&d=y&origin=n&catId=0&initiative_id=SB_20200523214041这个页面下的商品详情,由于页面是异步加载的,需要使用Selenium模拟浏览器来获取商品url。但直接使用Selenium定位网页元素速度又很慢,因此需要结合Re或者BeautifulSoup来提高爬取效率。


    模拟登陆

    使用Selenium模拟登录,登录成功后获取cookie。

    def login(username, password, driver=None):
        driver.get('https://login.aliexpress.com/')
        driver.maximize_window()
        name = driver.find_element_by_id('fm-login-id')
        name.send_keys(username)
        name1 = driver.find_element_by_id('fm-login-password')
        name1.send_keys(password)
        submit = driver.find_element_by_class_name('fm-submit')
        time.sleep(1)
        submit.click()
        return driver
    
    
    browser = webdriver.Chrome()
    browser = login('Wheabion1944@dayrep.com','ab123456',browser)
    browser.get('https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case&ltype=wholesale&SortType=default&page=')
    

     这个网站对用户监管不严,使用邮箱注册都不需要进行验证,可以用这个网站获取假邮箱进行注册:http://www.fakemailgenerator.com/

    其实后续真正运行程序爬的时候并没有登录,爬了十页也没碰到反爬。


    获取商品详情页的URL

    这一过程需要解决的问题在于该网页是ajex异步加载的,网页不会在打开的同时加载全部数据,在下拉的同时网页刷新返回新的数据包并渲染,因此通过request无法一次性读到网页的全部源码。解决思路是通过Selenium来模拟浏览器下拉行为以获取一页内全部的数据,然后暂时还是通过sel去获取元素。

    登录后打开任务需要的页面会出现广告弹窗,首先需要关闭广告弹窗:

    def close_win(browser):
        time.sleep(10)
        try:
            closewindow = browser.find_element_by_class_name('next-dialog-close')
            browser.execute_script("arguments[0].click();", closewindow)
        except Exception as e:
            print(f"searchKey: there is no suspond Page1. e = {e}")
    

     模拟下拉行为并获取一页中全部商品的url:

    def get_products(browser):
        wait = WebDriverWait(browser, 1)
        for i in range(30):
            browser.execute_script('window.scrollBy(0,230)')
            time.sleep(1)
            products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info")))
            if len(products) >= 60:
                break
            else:
                print(len(products))
                continue
        products = browser.find_elements_by_class_name('product-info')
        return products 

    后来经学长指点发现不需要这么麻烦,搜索页的商品信息虽然是经过下滑操作才会通过JS动态渲染,但商品信息其实都是写在html文档里的,可以通过以下方式获取:

    url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case&ltype=wholesale&SortType=default&page='
    driver = webdriver.Chrome()
    driver.get(url)
    info = re.findall('window.runParams = ({.*})',driver.page_source)[-1]
    infos = json.loads(info)
    items = infos['items']  

    然后就可以慢慢去匹配。


    获取商品内页详情

    这一部分的问题在于需要爬取的网页很多,继续使用sel会导致爬虫速度很慢,另外商品内页的数据似乎不是异步返回的。解决方案是先使用sel访问商品内页,将整个网页源码down下来后用正则表达式去匹配元素:

    def get_pro_info(product):
        url = product.find_element_by_class_name('item-title').get_attribute('href')
        driver = webdriver.Chrome()
        driver.get(url)
        page = driver.page_source
        driver.close()
        material=re.findall(r'"skuAttr":".*?#(.*?);',page)
        color=re.findall(r'skuAttr":".*?#.*?#(.*?)"',page)
        stock=re.findall(r'skuAttr":".*?"availQuantity":(.*?),',page)
        price=re.findall(r'skuAttr":".*?"actSkuCalPrice":"(.*?)"',page)
        pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page)
        titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page)
        video = re.findall(r'id="item-video" src="(.*?)"', page)
        return material, color, stock, price, pics, titles, video
    

     


    接入MySQL

    爬取到的数据要求用数据库储存,这里需要接入MySQL,数据库crawl和表SKU都是提前建好的:

    conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl')
    mycursor = conn.cursor()

    通过循环实现数组数据的写入,这里很坑的一点是insert的时候pymysql的格式转换和python不是完全一样,参数用'%s'匹配就可以,不需要针对数字型字段搞整形或浮点型

        #写入sku表
        sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)"#就是这里,虽然有些变量是数值型,但还是用%s来对应
        for i in range(len(skuID)):
            if titles:
                params = (skuID[i], material[i], color[i], stock[i], price[i],url)
            else:
                params = (skuID[i], material[i], ' ', stock[i], price[i],url)
            try: 
                mycursor.execute(sql,params)
                conn.commit()
            except IntegrityError: #当出现duplicate primary key时会抛出这个错误,这里这样写的本意是碰到重复主键就跳过这一条记录,但实际运行这段代码的时候还是会报错。偷懒的解决办法是把主键取消,但这样好像不是很合理,日后知道怎么解决再来更新
                conn.rollback()
                continue

    实现写入操作时碰到的另一个问题是用re匹配不到元素时返回的是一个空的list,这样会导致无法写入mysql而报错,因此要判断待写入的变量是否是空的list,是的话要赋合适的值:

        sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)"
    
        if rating:
            pass
        else:
            rating = '0.0'
    
        if review:
            pass
        else:
            review = '0'
        
        if video:
            pass
        else:
            video = ' '
        
        if shipping:
            pass
        else:
            shipping = '0.0'
            
        params = (url, pro_name, rating,review, video, shipping)
        mycursor.execute(sql,params)

    关闭数据库连接:

    conn.commit()

    提升速度

    除了前面提到的使用selenium访问后转用re匹配外,还发现一个提升爬虫效率的点:

    browser = webdriver.Chrome()
    browser.get(source_url)
    browser = close_win(browser)  

     像这样重复地实例化和关闭浏览器驱动是很耗费时间的,因此要使用尽量少的浏览器窗口来访问网站。

    本任务中是只实例化了两个webdriver,一个用来访问多个商品的展示页,一个用来访问商品内页,具体方法就是实例化后不要这两个driver,一直用它们来get新的网页。原来的代码中是每打开一个网页都初始化一个新的webdriver去访问,做出这一修改后代码运行时间减少了一半。

    def scratch_page(source_url):
        browser = webdriver.Chrome()
        browser.get(source_url)
        browser.maximize_window()
        browser = close_win(browser)
        pros = get_products(browser)
        #商品内页的浏览器
        browser2 = webdriver.Chrome()
        error_file = open('ERROR.txt','a+',encoding='utf8')
        for pro in pros:
            url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2)#对前面的get_pro_info
    做简单修改 if len(skuID)!=len(color): error_file.write('url:'+url+' ') continue save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review) error_file.close() browser.close() browser2.close()

      


     完整代码

    from selenium import webdriver
    import time 
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import re
    import pymysql
    from sqlalchemy.exc import IntegrityError#捕获重复主键的异常
    
    def login(username, password, driver=None):
        driver.get('https://login.aliexpress.com/')
        driver.maximize_window()
        name = driver.find_element_by_id('fm-login-id')
        name.send_keys(username)
        name1 = driver.find_element_by_id('fm-login-password')
        name1.send_keys(password)
        submit = driver.find_element_by_class_name('fm-submit')
        time.sleep(1)
        submit.click()
        return driver
    
    def close_win(browser):
        time.sleep(5)
        try:
            closewindow = browser.find_element_by_class_name('next-dialog-close')
            closewindow.click()
        except Exception as e:
            print(f"searchKey: there is no suspond Page1. e = {e}")
        return browser
    
    def get_products(browser):
        wait = WebDriverWait(browser, 1)
        for i in range(30):
            browser.execute_script('window.scrollBy(0,230)')
            time.sleep(1)
            products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info")))
            if len(products) >= 60:
                break
            else:
                continue
        products = browser.find_elements_by_class_name('product-info')
        return products
    
    def get_pro_info(product, driver):
        url = product.find_element_by_class_name('item-title').get_attribute('href')
        driver.get(url)
        time.sleep(0.5)
        page = driver.page_source
        material=re.findall(r'"skuAttr":".*?#(.*?);',page)
        color=re.findall(r'"skuAttr":".*?#.*?#(.*?)"',page)
        stock=re.findall(r'"skuAttr":".*?"availQuantity":(.*?),',page)
        price=re.findall(r'"skuAttr":".*?"skuCalPrice":"(.*?)"',page)
        pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page)
        titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page)
        video = re.findall(r'id="item-video" src="(.*?)"', page)
        skuID = re.findall(r'"skuId":(.*?),',page)
        pro_name = re.findall(r'"product-title-text">(.*?)</h1>', page)
        rating = re.findall(r'itemprop="ratingValue">(.*?)</span>', page)
        shipping = re.findall(r'<span class="bold">(.*?)&nbsp;', page)
        review = re.findall(r'"reviewCount">(.*?) Reviews</span>', page)
        #当商品没有颜色可选时,网页源码结构变化,需要重新匹配
        if titles:
            pass
        else:
            material = re.findall(r'"skuAttr":".*?#(.*?)"', page)
            color=[]
            pics = re.findall(r'"imagePathList":["(.*?)",', page)
        return url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review 
    
    def save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review):
        url = re.findall('/item/(.*?).html',url)
    #    try:
        conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl')
        mycursor = conn.cursor()
        #写入sku表
        sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)"
        for i in range(len(skuID)):
            if titles:
                params = (skuID[i], material[i], color[i], stock[i], price[i],url)
            else:
                params = (skuID[i], material[i], ' ', stock[i], price[i],url)
    #        mycursor.execute(sql,params)
    #        conn.commit()
            try:
                mycursor.execute(sql,params)
                conn.commit()
            except IntegrityError:
                conn.rollback()
                continue
        #写入img表
        sql = "INSERT INTO image(url, color, img) VALUES (%s,%s,%s)"
        i = 0
        if titles:
            for i in range(len(titles)):
                params = (url, titles[i], pics[i])
    #            mycursor.execute(sql,params)
    #            conn.commit()
                try:
                    mycursor.execute(sql,params)
                    conn.commit()
                except IntegrityError:
                    conn.rollback()
                    continue
        else:
            params = (url, ' ', pics)
    #        mycursor.execute(sql,params)
    #        conn.commit()
            try:
                mycursor.execute(sql,params)
                conn.commit()
            except IntegrityError:
                conn.rollback()
        #写入product表
        sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)"
    
        if rating:
            pass
        else:
            rating = '0.0'
    
        if review:
            pass
        else:
            review = '0'
        
        if video:
            pass
        else:
            video = ' '
        
        if shipping:
            pass
        else:
            shipping = '0.0'
            
        params = (url, pro_name, rating,review, video, shipping)
        mycursor.execute(sql,params)
        conn.commit()
    #    try:
    #        mycursor.execute(sql,params)
    #        conn.commit()
    #    except Exception:
    #        conn.rollback()
        conn.close()
    #    except Exception as e:
    #    conn.rollback()
    #    print(e)
            
    def scratch_page(source_url):
        browser = webdriver.Chrome()
        browser.get(source_url)
        browser.maximize_window()
        browser = close_win(browser)
        pros = get_products(browser)
        #商品内页的浏览器
        browser2 = webdriver.Chrome()
        error_file = open('ERROR.txt','a+',encoding='utf8')
        for pro in pros:
            url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2)
            if len(skuID)!=len(color):
                error_file.write('url:'+url+'
    ')
                continue
            save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review)
        error_file.close()
        browser.close()
        browser2.close()
        
    url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case&ltype=wholesale&SortType=default&page='
    for p in range(1,11):
        url_ = url + str(p)
        start_time = time.time()
        scratch_page(url_)
        end_time = time.time()
        print('成功爬取' + str(p) + '')
        print('' + str(p) + '页耗时: '+str(start_time-end_time)+'s')
  • 相关阅读:
    【译】使用自定义ViewHelper来简化Asp.net MVC view的开发part5(完)
    【译】使用自定义ViewHelper来简化Asp.net MVC view的开发part1
    【译】使用自定义ViewHelper来简化Asp.net MVC view的开发part3
    开发者分享在PC上制作iOS游戏的经验(上)
    dpi和ppi是什么意思
    Go语言
    逆向思维魔兽世界封包分析(2)
    关于手机游戏的部分情况调查
    《Android Dev Guide》系列教程1:什么是Android?
    拼包函数及网络封包的异常处理(含代码)
  • 原文地址:https://www.cnblogs.com/heiheixiaocai/p/13061356.html
Copyright © 2011-2022 走看看