zoukankan      html  css  js  c++  java
  • pyspider

    #!/usr/bin/env python
    # -*- encoding: utf-8 -*-
    # Created on 2019-09-29 16:43:19
    # Project: test
    
    from pyspider.libs.base_handler import *
    
    
    class Handler(BaseHandler):
        crawl_config = {
        }
    
        @every(minutes=24 * 60)
        def on_start(self):
            self.crawl('https://github.com/trending', callback=self.index_page)
    
        @config(age=10 * 24 * 60 * 60)
        def index_page(self, response):
            for each in response.doc('h1.h3.lh-condensed a').items():
                self.crawl(each.attr.href, callback=self.detail_page)
    
        @config(priority=2)
        def detail_page(self, response):
            list = []
            for each in response.doc('.social-count').items():
                list.append(each)
                
            return {
                "url": response.url,
                "title": response.doc('title').text(),
                "watch": list[0].text(),
                "star": list[1].text(),
                "fork": list[2].text(),
            }
    		
    

      

    https://www.cnblogs.com/lei0213/p/7676254.html


    ID是写在#后面
    class的值写在小数点 . 后面
    标签直接写
    空格表示子孙节点

    class名中有空格的:
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    用.
    li = doc('.item-1.active')


    获取属性值的两种方法
    print(item.attr.href)
    print(item.attr('href'))

    获取父子节点
    print(item.parent())
    print(item.children())

    #注意这里查找ul标签的所有子标签class属性
    print(item.children('[class]'))


    获取标签的内容
    doc("a").text()

  • 相关阅读:
    审判程序的灵魂
    程序的灵魂-算法
    JQuery
    JavaScript
    BOM和DOM
    HTML和css
    css属性
    初始HTML
    单表查询和连表查询
    事务和python操作数据库
  • 原文地址:https://www.cnblogs.com/hushpa/p/11654203.html
Copyright © 2011-2022 走看看