zoukankan      html  css  js  c++  java
  • Scrapy中对xpath使用re

    Scrapy中使用xpath时,根据xpath的语法不一定能得到想要的。

    如下面的html源码:

    1 <div class="db_contout">    <div class="db_cont">                <div class="details_nav">            <a href="http://movie.mtime.com/79055/addimage.html" class="db_addpic" target="_blank">                <strong class="px16">+</strong> 添加图片</a>            <ul id="imageNavUl">                <li><i>&nbsp;</i><a href="http://movie.mtime.com/79055/posters_and_images/">全部图片</a></li>                <li><i>&nbsp;</i><a href="#">剧照</a></li>                <li><i>&nbsp;</i><a href="#">海报</a></li>                <li><i>&nbsp;</i><a href="#">工作照</a></li>                <li><i>&nbsp;</i><a href="#">新闻图片</a></li>                <li><i>&nbsp;</i><a href="#">桌面</a></li>                <li><i>&nbsp;</i><a href="#">封套</a></li>            </ul>        </div>        <div class="db_pictypeout">            <div class="pictypenav clearfix">                                <ul id="imageSubNavUl" class="fl mt3">                </ul>                                <div id="filters" class="db_selbox fr">                </div>            </div>                        <dl id="imagesDiv" class="db_pictypelist clearfix">            </dl>                        <div id="pageDiv">            </div>        </div>    </div></div><div id="M13_B_DB_Movie_FooterTopTG"></div><script type="text/javascript">
    2     var imageList = [{"stagepicture":[{"officialstageimage":[{"id":1059362,"title":"官方剧照 #16","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_1000X1000.jpg","width":3233,"height":2000,"fileSize":5472,"enterTime":"2009-07-09","enterNickName":"jackali","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/1059362/","topNum":4,"newIndex":37,"typeHotIndex":0,"typeNewIndex":37,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_235X235.jpg"},{"id":829271,"title":"官方剧照 #06","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_1000X1000.jpg","width":842,"height":477,"fileSize":74,"enterTime":"2008-12-17","enterNickName":"边界","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/829271/","topNum":0,"newIndex":51,"typeHotIndex":1,"typeNewIndex":51,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_235X235.jpg"},{"id":625583,"title":"官方剧照 

    要得到img_1000后面picture的source路径,通过xpath的语法我没有得到直接取到的方法,折中办法参考:http://www.cnblogs.com/Garvey/p/6697162.html,使用re来获得需要的内容。

     1 class MtimeSpider(scrapy.Spider):
     2     name = "mtime"
     3     allowed_domains = ["http://www.mtime.com"]
     4     start_urls = (
     5         'http://movie.mtime.com/79055/posters_and_images/posters/hot.html',
     6     )
     7 
     8     def parse(self, response):
     9         allpics = response.xpath("//script[@type='text/javascript']").re('"img_1000":"(.+?jpg)"')
    10         print len(allpics)
    11         nameList = []
    12         i = 0
    13         for pic in allpics:
    14             i = i+1
    15             item = S0819MtimeTiantangItem()
    16             while True:
    17                 itemName = random.randint(0, 1000)*3
    18                 itemName = str(itemName)
    19                 if itemName in nameList:
    20                     pass
    21                 else:
    22                     name = str(i)
    23                     nameList.append(itemName)
    24                     #print "-----"+itemName
    25                     print "-----"
    26                     #print nameList
    27                     break
    28             addr = pic
    29             item['name'] = name
    30             item['addr'] = addr
    31             print "+++++"+addr 
    32             print "+++++"+name
    33             yield item
  • 相关阅读:
    对es6中Promise和async的理解
    js里面的map、filter、forEach、reduce、for in、for of等遍历方法
    浏览器怎么解析一个hmtl文档
    js获取浏览器版本
    js中的浅复制和深复制
    作为一个程序员,如何调试抓取跳转页面前发送的请求
    escape、unescape、encodeURIComponent、decodeURLComponent
    css超过一定长度显示省略号 强制换行
    gojs去除水印
    版本控制系统svn的超基础使用
  • 原文地址:https://www.cnblogs.com/v-BigdoG-v/p/7398787.html
Copyright © 2011-2022 走看看