zoukankan      html  css  js  c++  java
  • A webscraping framework written in Javascript, using PhantomJS and jQuery pjscrape

    Tutorial

    Writing Scrapers
    The core of a pjscrape script is the definition of one or more scraper functions. Here's what you need to know:

    Scraper functions are evaluated in a full browser context. This means you not only have access to the DOM, you have access to Javascript variables and functions, AJAX-loaded content, etc.

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return wgPageName; // variable set by Wikipedia
        }
    });
    // Output: ["List_of_towns_in_Vermont"]
    Scraper functions are evaluated in a sandbox (read more here). Closures will not work the way you think:

    var myPrivateVariable = "test";
    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return myPrivateVariable;
        }
    });
    // CLIENT: ReferenceError: Can't find variable: myPrivateVariable
    The best way to think about your scraper functions is to assume the code is being eval()'d in the context of the page you're trying to scrape.

    Scrapers have access to a set of helper functions in the _pjs namespace. See the Javascript API docs for more info. One particularly useful function is _pjs.getText(), which returns an array of text from the matched elements:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return _pjs.getText('#sortable_table_id_0 tr td:nth-child(2)');
        }
    });
    // Output: ["Addison","Albany","Alburgh", ...]
    For this instance, there's actually a shorter syntax - if your scraper is a string instead of a function, pjscrape will assume it is a selector and use it in a function like the one above:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: '#sortable_table_id_0 tr td:nth-child(2)'
    });
    // Output: ["Addison","Albany","Alburgh", ...]
    Scrapers can return data in whatever format you want, provided it's JSON-serializable (so you can't return a jQuery object, for example). For example, the following code returns the list of towns in the Django fixture syntax:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        scraper: function() {
            return $('#sortable_table_id_0 tr').slice(1).map(function() {
                var name = $('td:nth-child(2)', this).text(),
                    county = $('td:nth-child(3)', this).text(),
                    // convert relative URLs to absolute
                    link = _pjs.toFullUrl(
                        $('td:nth-child(2) a', this).attr('href')
                    );
                return {
                    model: "myapp.town",
                    fields: {
                        name: name,
                        county: county,
                        link: link
                    }
                }
            }).toArray(); // don't forget .toArray() if you're using .map()
        }
    });
    /* Output:
    [{"fields":{"link":"http://en.wikipedia.org/wiki/Addison,_Vermont",
    "county":"Addison","name":"Addison"},"model":"myapp.town"}, ...]
    */
    Scraper functions can always access the version of jQuery bundled with pjscrape (currently v.1.6.1). If you're scraping a site that also uses jQuery, and you want the latest features, you can set noConflict: true and use the _pjs.$ variable:

    pjs.addSuite({
        url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
        noConflict: true,
        scraper: function() {
            return [
                window.$().jquery, // the version Wikipedia is using
                _pjs.$().jquery // the version pjscrape is using
            ];
        }
    });
    // Output: ["1.4.2","1.6.1"]
    Asynchronous Scraping
    Docs coming soon. For now, see:

    Test for the ready option - wait for a ready condition before starting the scrape.
    Test for asyncronous scrapes - scraper function is expected to set _pjs.items when its scrape is complete.
    Crawling Multiple Pages
    Docs coming soon - the main thing is to set the moreUrls option to either a function or a selector that identifies more URLs to scrape. For now, see:

    Test for basic recursive scrape
    Test for recursive scrape, setting maximum scrape depth
    Test for recursive scrape, using a selector to find more URLs

  • 相关阅读:
    hadoop2.2.0 centos6.4 编译安装详解
    Hadoop 2.2.0的高可用性集群中遇到的一些问题(64位)
    Visual Studio 常用快捷键
    Android(1)—Mono For Android 环境搭建及破解
    IbatisNet SqlMap.config配置节导致的程序无法通过
    CAD数据分块,偏移校准,加载到百度地图、高德地图、谷歌等地图上
    数据库SQL优化大总结
    Scratch 3下载,最新版Scratch下载,macOS、Windows版
    高性能网站设计之缓存更新的套路
    【验证无效】MySQL的count(*)的优化,获取千万级数据表的总行数
  • 原文地址:https://www.cnblogs.com/lexus/p/2428701.html
Copyright © 2011-2022 走看看