zoukankan      html  css  js  c++  java
  • 在用scrapy时遇到的坑

    1. 一开始是想用scrapy和selenium来爬什么值得买,结果遇到了一个奇怪的问题,直接上代码

       def start_requests(self):
            self.logger.info("starting")
            broswer = webdriver.Firefox()
            broswer.get(self.start_url)
            last_height = broswer.execute_script("return document.body.scrollHeight")
            print(last_height)
            count = 0
            while True:
                print(count)
                if count==2:
                    break
                broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(2)
                new_height = broswer.execute_script("return document.body.scrollHeight")
                if new_height == last_height:
                    break
                last_height = new_height
                time.sleep(1.2)
                count = count + 1
            source = broswer.page_source
            broswer.close()
    
            scrapy_selector = Selector(text = source)
            items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]')
            self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.')
            try:
                s=0
                for item_selector in items_selector:
                    print(s)
                    print(item_selector.getall())
    #错误写法(Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.)
                    # url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')
    
                    # 错误写法(multiple class should be: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')])
                    # url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')
    
                    url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")
    
    # assert isinstance(url_selector, scrapy.selector.Selector)
                    print(url_selector.extract())
                    # self.logger.info("sss" + url)
    
                    url = url_selector.get()
                    s = s + 1
                    # self.logger.info("sss" + url)
            except Exception as e:
                self.logger.info('Reached last iteration #' + str(e) + str(s))
    
            return
    broswer.page_source表示浏览器上的整个页面的html代码,scrapy_selector是建立在整个页面的选择器,items_selector是页面上抓下来的表示首页上商品信息的列表div块的选择器列表,这些都没问题。但出问题的是标成大红色的那段代码
    url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')
    item_selector表示每一个商品信息div块的选择器,这个用print(item_selector.getall())打印出来是对的,出问题的是url_selector,表示div块里商品链接的选择器,print(url_selector.extract())打印出来发现是个url列表,共18个
    ['https://www.smzdm.com/p/20610761/#hfeeds', 'https://www.smzdm.com/p/20601553/#hfeeds', 'https://www.smzdm.com/p/20597500/#hfeeds', 'https://www.smzdm.com/p/20603303/#hfeeds', 'https://www.smzdm.com/p/20613198/#hfeeds', 'https://www.smzdm.com/p/20601438/#hfeeds', 'https://www.smzdm.com/p/20615602/#hfeeds', 'https://www.smzdm.com/p/20596520/#hfeeds', 'https://www.smzdm.com/p/20617429/#hfeeds', 'https://www.smzdm.com/p/20607426/#hfeeds', 'https://www.smzdm.com/p/20615296/#hfeeds', 'https://www.smzdm.com/p/20618224/#hfeeds', 'https://www.smzdm.com/p/20603149/#hfeeds', 'https://www.smzdm.com/p/20604376/#hfeeds', 'https://www.smzdm.com/p/20603224/#hfeeds', 'https://www.smzdm.com/p/20615599/#hfeeds', 'https://www.smzdm.com/p/20615846/#hfeeds', 'https://www.smzdm.com/p/20586712/#hfeeds']
    item_selector共有60个,每个里面的url_selector都是一个一样的18个元素的列表(为什么是从头上数下来的18个,不得而知)。
    错误原因后来在官网的文档上发现这样一句话
    Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.)
    If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

    原因很明白了,如果xpath以/或//开始,会从整个文档解析而不是从那个item_selector开始解析。然后改成了

    url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')

    发现取不到值,后来在官网上又看到了一句话:

    Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ').

    If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

    还是比较麻烦的,改成下面这样就能正确取到url了
    url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")
     由于这样比较麻烦,官网建议取css时用下面的形式,先用css的取法,然后再加xpath
    >>> sel.css('.shout').xpath('./time/@datetime').getall()
    
    

     2. 页面html是

    <h1 class="item-name">
                                    <span class="edit_interface"></span>
                                    闲鱼出售全新ipad pro 2018翻车日记                            </h1>
    goods_scrapy_selector.xpath("//article/h1/text()")取出来的却是一个数组['
                                    ', '
                                    闲鱼出售全新ipad pro 2018翻车日记                            ']
     估计是因为text是
     ... 
    ..., 每一个换行都是一个记录?
    -- text() selects all text node children of the context node. from https://www.w3.org/TR/1999/REC-xpath-19991116/#section-String-Functions
    也就是说text()会返回子节点所有的内容,因为h1下面还有个span,而且"
    闲鱼出售全新ipad pro 2018翻车日记“在span下面。

    3. scrapy 2.0.1
    scrapy原先输出在console的日志是:
    2020-05-28 22:56:06,765 - smzdm_jingxuan - INFO - smzdm_jingxuan spider starting
    
    
    想在输出的日志改变下格式,把行号打印出来,首先想到的是改logging.basicConfig
      class SmzdmSpider(scrapy.Spider):
    
        name = 'smzdm_jingxuan'
        allowed_domains = ['spider.smzdm']
        start_urls = ("http://books.toscrape.com/",)
        # logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d 
        #   %(levelname)s - %(message)s", "%Y-%m-%d %H:%M')
    
        logging.basicConfig(
            format='%(asctime)s,%(msecs)d %(levelname)-8s [%(pathname)s:%(lineno)d in function % (funcName)s] % (message)s',
            datefmt = '%Y-%m-%d:%H:%M:%S',
            level = logging.INFO)
    
        logger = logging.getLogger(__name__)

    输出在console的日志是

    2020-05-28 22:53:25,197 - smzdmCrawler.spiders.smzdm_jingxuan - INFO - smzdm_jingxuan spider starting

    改变了一点点,但没有输出行号,结果也和设置不符。(不知道原因)

    网上查了下,发现setting.py可以设置LOG_FILELOG_ENABLELOG_FORMAT这些日志参数,设置成

    IMAGES_STORE = '/Users/gaoxianghu/temp/image'
    
    LOG_FILE = '/Users/gaoxianghu/temp/scrapy_log.log'
    
    LOG_ENABLED = False
    
    LOG_FORMAT = '[%(asctime)s] p%(process)s {%(pathname)s:%(lineno)d} %(levelname)s - %(message)s'

    发现LOG_ENABLED = False不生效,不管是console还是日志文件还是有日志输出,但日志文件的日志格式已经按照LOG_FORMAT的打印,但console里的还是没变(不知道原因)

    LOG_ENABLED = False不生效的原因,网上有人说这饿做可以

    
    
    logging.getLogger('scrapy').propagate = False
    日志文件里日志格式为
    [2020-05-28 22:16:12] p41288 {/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py:38} INFO - smzdm_jingxuan spider starting

     后来发现要把 logging配置写在setting.py里就能改变console的日志了

    setting.py
    logging.basicConfig(level
    =logging.DEBUG, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d %(levelname)s - %(message)s', )

    console输出

    2020-05-28 23:30:45,008 /Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py smzdm_jingxuan.py parse 34 INFO - smzdm_jingxuan spider starting

    3. 在scrapy 的 spider中用到relative import时,执行scrapy crawl smzdm_jingxuan 时报:scrapy attempted relative import with no known parent package. 原来用 from smzdmCrawler.items import SmzdmItem时没问题

    from .. import items

    代码结构为:

    smzdmCrawler
    |--model
    |--spider
    |--|--smzdm_jingxuan.py
    |--items.py
    |--__init__.py
    |--main.py

    因为smzdmCrawler下已经有了__init__.py,所以smzdmCrawler是一个包,网上查说包不包的由__name__决定,我这边的情况是和执行scrapy crawl smzdm_jingxuan时的目录有关,原先的位置是:/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler,改在/Users/gaoxianghu/git/cheap/smzdmCrawler就没问题

    4. 在用scrapyd部署服务时,要注意此时的程序无法读取环境变量,用scrapyd-deploy部署后会先把代码解释一遍,如果此时因为无法读取环境变量而报错,比如如下代码

    SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
    
    # 这里只有线上才会传LOG_FILE
    if LOG_FILE:
       log_file = LOG_FILE
       image_file = '/data/image/' + today_str
    else:
       if SCRAPY_ENV == None:
          log_file = '/Users/gaoxianghu/temp/scraping.log'
          image_file = '/Users/gaoxianghu/temp/image/' + today_str
       else:
          log_file = '/data/log/scrapy/scraping.log'
          image_file = '/data/image/' + today_str
    
    logHandler = TimedRotatingFileHandler(log_file, when='midnight', interval=1)
    

    虽然我在服务器上设了环境变量'SCRAPY_ENV',但因为不是在命令行环境,无法读取,导致log_file = '/Users/gaoxianghu/temp/scraping.log'这个路径不存在,报错。然后我用

    curl http://david_scrapyd:david_2021@42.192.51.99:6801/schedule.json -d project=smzdmCrawler -d spider=smzdm_single -d setting=LOG_FILE=/data/log/scrapy/scraping.log 来执行scrapy,但这里虽然设置了setting,按照代码应该不会把log_file设置为'/Users/gaoxianghu/temp/scraping.log',但还是报同样的错,只不过看看日志运行的是一个临时文件代码,为什么还是报错暂且不太清楚,因为在本地起服务验证是可以读到传入的LOG_FILE的。是不是因为部署的时候解释没通过,所以还是会先解释一遍再运行导致报错。

    File "/tmp/smzdmCrawler-1614340245-de7610pr.egg/smzdmCrawler/settings.py"
    FileNotFoundError: [Errno 2] No such file or directory: '/Users/gaoxianghu/temp/scraping.log'

    5. 用scrapyd时还需要注意,根据https://github.com/scrapy/scrapyd-client#scrapyd-deploy,上面说

    You may want to keep certain settings local and not have them deployed to Scrapyd. To accomplish this you can create a local_settings.py file at the root of your project, where your scrapy.cfg file resides, and add the following to your project's settings:
    
    try:
        from local_settings import *
    except ImportError:
        pass
    scrapyd-deploy doesn't deploy anything outside of the project module, so the local_settings.py file won't be deployed.

    这里根据亲示,在本地部署scrapyd,将scrapy部署到本地时,local_setting是可以访问到的,到egg里却没有。这里比较奇怪,部署到远程的scrapyd就访问不到。作者这句话应该是针对远程来说的

    喜欢艺术的码农
  • 相关阅读:
    《神经网络论文精读》
    刻意练习
    马斯洛模型
    MRP执行计划列表(禁用)
    生产齐套分析
    BZOJ 3589: 动态树 树链剖分+线段树+树链的并
    CF1043F Make It One 容斥+dp+组合
    CF1073D Berland Fair 二分+线段树
    BZOJ 5084: hashit 后缀自动机(原理题)
    BZOJ 3991: [SDOI2015]寻宝游戏 树链的并+set
  • 原文地址:https://www.cnblogs.com/zjhgx/p/12771271.html
Copyright © 2011-2022 走看看