zoukankan      html  css  js  c++  java
  • gocrawl 分析

    1. gocrawl 类结构

     

     1 // The crawler itself, the master of the whole process
     2 type Crawler struct {
     3     Options *Options
     4 
     5     // Internal fields
     6     logFunc         func(LogFlags, string, ...interface{})
     7     push            chan *workerResponse
     8     enqueue         chan interface{}
     9     stop            chan struct{}
    10     wg              *sync.WaitGroup
    11     pushPopRefCount int
    12     visits          int
    13 
    14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
    15     // is of no use, but this is the smallest type possible - it uses no memory at all.
    16     visited map[string]struct{}
    17     hosts   map[string]struct{}
    18     workers map[string]*worker
    19 }
     1 // The Options available to control and customize the crawling process.
     2 type Options struct {
     3     UserAgent             string
     4     RobotUserAgent        string
     5     MaxVisits             int
     6     EnqueueChanBuffer     int
     7     HostBufferFactor      int
     8     CrawlDelay            time.Duration // Applied per host
     9     WorkerIdleTTL         time.Duration
    10     SameHostOnly          bool
    11     HeadBeforeGet         bool
    12     URLNormalizationFlags purell.NormalizationFlags
    13     LogFlags              LogFlags
    14     Extender              Extender
    15 }
     1 // Extension methods required to provide an extender instance.
     2 type Extender interface {
     3     // Start, End, Error and Log are not related to a specific URL, so they don't
     4     // receive a URLContext struct.
     5     Start(interface{}) interface{}
     6     End(error)
     7     Error(*CrawlError)
     8     Log(LogFlags, LogFlags, string)
     9 
    10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
    11     // is related to a URLContext (holds a ctx field).
    12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
    13 
    14     // All other extender methods are executed in the context of an URL, and thus
    15     // receive an URLContext struct as first argument.
    16     Fetch(*URLContext, string, bool) (*http.Response, error)
    17     RequestGet(*URLContext, *http.Response) bool
    18     RequestRobots(*URLContext, string) ([]byte, bool)
    19     FetchedRobots(*URLContext, *http.Response)
    20     Filter(*URLContext, bool) bool
    21     Enqueued(*URLContext)
    22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
    23     Visited(*URLContext, interface{})
    24     Disallowed(*URLContext)
    25 }

    entry point:

     1 func main() {
     2     ext := &Ext{&gocrawl.DefaultExtender{}}
     3     // Set custom options
     4     opts := gocrawl.NewOptions(ext)
     5     opts.CrawlDelay = 1 * time.Second
     6     opts.LogFlags = gocrawl.LogError
     7     opts.SameHostOnly = false
     8     opts.MaxVisits = 10
     9 
    10     c := gocrawl.NewCrawlerWithOptions(opts)
    11     c.Run("http://0value.com")
    12 }

    3 steps:  in main

    1) get a Extender

    2) create Options with given Extender

    3) create gocrawel

    as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

    2. other key structs

    worker, workResponse and sync.WaitGroup

    1 // Communication from worker to the master crawler, about the crawling of a URL
    2 type workerResponse struct {
    3     ctx           *URLContext
    4     visited       bool
    5     harvestedURLs interface{}
    6     host          string
    7     idleDeath     bool
    8 }
     1 // The worker is dedicated to fetching and visiting a given host, respecting
     2 // this host's robots.txt crawling policies.
     3 type worker struct {
     4     // Worker identification
     5     host  string
     6     index int
     7 
     8     // Communication channels and sync
     9     push    chan<- *workerResponse
    10     pop     popChannel
    11     stop    chan struct{}
    12     enqueue chan<- interface{}
    13     wg      *sync.WaitGroup
    14 
    15     // Robots validation
    16     robotsGroup *robotstxt.Group
    17 
    18     // Logging
    19     logFunc func(LogFlags, string, ...interface{})
    20 
    21     // Implementation fields
    22     wait           <-chan time.Time
    23     lastFetch      *FetchInfo
    24     lastCrawlDelay time.Duration
    25     opts           *Options
    26 }
    for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

    3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

  • 相关阅读:
    介绍一种很好用的任务调度平台
    java中的进制与操作符
    类再生(合成、继承、final)
    初始化
    重新学习Spring2——IOC和AOP原理彻底搞懂
    重新学习Spring一--Spring在web项目中的启动过程
    JDK并发包
    java并行程序基础
    MVC模式
    访问者模式
  • 原文地址:https://www.cnblogs.com/harrysun/p/3798438.html
Copyright © 2011-2022 走看看