zoukankan      html  css  js  c++  java
  • Scrapy Item Loaders

    Item Loaders

    基本写法

    # item
    from dataclasses import dataclass, field
    from typing import Optional
    
    @dataclass
    class InventoryItem:
        name: Optional[str] = field(default=None)
        price: Optional[float] = field(default=None)
        stock: Optional[int] = field(default=None)
            
    
    # spider
    from scrapy.loader import ItemLoader
    from myproject.items import Product
    
    def parse(self, response):
        l = ItemLoader(item=Product(), response=response)
        l.add_xpath('name', '//div[@class="product_name"]')
        l.add_xpath('name', '//div[@class="product_title"]')
        l.add_xpath('price', '//p[@id="price"]')
        l.add_css('stock', 'p#stock]')
        l.add_value('last_updated', 'today') # you can also use literal values
        return l.load_item()
    

    Declaring Item Loaders

    from itemloaders.processors import TakeFirst, MapCompose, Join
    from scrapy.loader import ItemLoader
    
    class ProductLoader(ItemLoader):
    
        default_output_processor = TakeFirst()  # 多个值的话获取第一个
    
        name_in = MapCompose(str.title)   # 输入处理
        name_out = Join()  # 输出处理
    
        price_in = MapCompose(str.strip) 
    
        # ...
        
    ###
    class MapCompose:
    
        def __init__(self, *functions, **default_loader_context):
            self.functions = functions
            self.default_loader_context = default_loader_context
    ###
    

    Declaring Input and Output Processors

    import scrapy
    from itemloaders.processors import Join, MapCompose, TakeFirst
    from w3lib.html import remove_tags
    
    def filter_price(value):
        if value.isdigit():
            return value
    
    class Product(scrapy.Item):
        name = scrapy.Field(
            input_processor=MapCompose(remove_tags),
            output_processor=Join(),
        )
        price = scrapy.Field(
            input_processor=MapCompose(remove_tags, filter_price),
            output_processor=TakeFirst(),
        )
        
    # demo:
    >>> from scrapy.loader import ItemLoader
    >>> il = ItemLoader(item=Product())
    >>> il.add_value('name', ['Welcome to my', '<strong>website</strong>'])
    >>> il.add_value('price', ['&euro;', '<span>1000</span>'])
    >>> il.load_item()
    {'name': 'Welcome to my website', 'price': '1000'}
    

    嵌套 Loaders

    # 类似于嵌套的xpath
    # html
    <footer>
        <a class="social" href="https://facebook.com/whatever">Like Us</a>
        <a class="social" href="https://twitter.com/whatever">Follow Us</a>
        <a class="email" href="mailto:whatever@example.com">Email Us</a>
    </footer>
    
    
    # spider
    loader = ItemLoader(item=Item())
    # load stuff not in the footer
    footer_loader = loader.nested_xpath('//footer')
    footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
    footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
    # no need to call footer_loader.load_item()
    loader.load_item()
    

    重写、扩展 Item Loaders

    网页中存在 ---Plasma TV--- 文本,获取后,需要将 --- 去掉,那么就可以使用类似下面的方法来解决问题:

    from itemloaders.processors import MapCompose
    from myproject.ItemLoaders import ProductLoader
    
    def strip_dashes(x):
        return x.strip('-')
    
    class SiteSpecificLoader(ProductLoader):
        name_in = MapCompose(strip_dashes, ProductLoader.name_in)
    
  • 相关阅读:
    StringBuild
    String 字符串
    win7 64位支持的最大内存
    Spring获取对象与java new对象区别
    生成随机数
    java String转base64
    java时间格式
    Bash Scripting Learn Notes
    Linux parent process and child process when 'sudo'
    Linux services, runlevels, and rc.d scripts
  • 原文地址:https://www.cnblogs.com/xt12321/p/13880141.html
Copyright © 2011-2022 走看看