zoukankan      html  css  js  c++  java
  • gocrawl 分析

    1. gocrawl 类结构

     

     1 // The crawler itself, the master of the whole process
     2 type Crawler struct {
     3     Options *Options
     4 
     5     // Internal fields
     6     logFunc         func(LogFlags, string, ...interface{})
     7     push            chan *workerResponse
     8     enqueue         chan interface{}
     9     stop            chan struct{}
    10     wg              *sync.WaitGroup
    11     pushPopRefCount int
    12     visits          int
    13 
    14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
    15     // is of no use, but this is the smallest type possible - it uses no memory at all.
    16     visited map[string]struct{}
    17     hosts   map[string]struct{}
    18     workers map[string]*worker
    19 }
     1 // The Options available to control and customize the crawling process.
     2 type Options struct {
     3     UserAgent             string
     4     RobotUserAgent        string
     5     MaxVisits             int
     6     EnqueueChanBuffer     int
     7     HostBufferFactor      int
     8     CrawlDelay            time.Duration // Applied per host
     9     WorkerIdleTTL         time.Duration
    10     SameHostOnly          bool
    11     HeadBeforeGet         bool
    12     URLNormalizationFlags purell.NormalizationFlags
    13     LogFlags              LogFlags
    14     Extender              Extender
    15 }
     1 // Extension methods required to provide an extender instance.
     2 type Extender interface {
     3     // Start, End, Error and Log are not related to a specific URL, so they don't
     4     // receive a URLContext struct.
     5     Start(interface{}) interface{}
     6     End(error)
     7     Error(*CrawlError)
     8     Log(LogFlags, LogFlags, string)
     9 
    10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
    11     // is related to a URLContext (holds a ctx field).
    12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
    13 
    14     // All other extender methods are executed in the context of an URL, and thus
    15     // receive an URLContext struct as first argument.
    16     Fetch(*URLContext, string, bool) (*http.Response, error)
    17     RequestGet(*URLContext, *http.Response) bool
    18     RequestRobots(*URLContext, string) ([]byte, bool)
    19     FetchedRobots(*URLContext, *http.Response)
    20     Filter(*URLContext, bool) bool
    21     Enqueued(*URLContext)
    22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
    23     Visited(*URLContext, interface{})
    24     Disallowed(*URLContext)
    25 }

    entry point:

     1 func main() {
     2     ext := &Ext{&gocrawl.DefaultExtender{}}
     3     // Set custom options
     4     opts := gocrawl.NewOptions(ext)
     5     opts.CrawlDelay = 1 * time.Second
     6     opts.LogFlags = gocrawl.LogError
     7     opts.SameHostOnly = false
     8     opts.MaxVisits = 10
     9 
    10     c := gocrawl.NewCrawlerWithOptions(opts)
    11     c.Run("http://0value.com")
    12 }

    3 steps:  in main

    1) get a Extender

    2) create Options with given Extender

    3) create gocrawel

    as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

    2. other key structs

    worker, workResponse and sync.WaitGroup

    1 // Communication from worker to the master crawler, about the crawling of a URL
    2 type workerResponse struct {
    3     ctx           *URLContext
    4     visited       bool
    5     harvestedURLs interface{}
    6     host          string
    7     idleDeath     bool
    8 }
     1 // The worker is dedicated to fetching and visiting a given host, respecting
     2 // this host's robots.txt crawling policies.
     3 type worker struct {
     4     // Worker identification
     5     host  string
     6     index int
     7 
     8     // Communication channels and sync
     9     push    chan<- *workerResponse
    10     pop     popChannel
    11     stop    chan struct{}
    12     enqueue chan<- interface{}
    13     wg      *sync.WaitGroup
    14 
    15     // Robots validation
    16     robotsGroup *robotstxt.Group
    17 
    18     // Logging
    19     logFunc func(LogFlags, string, ...interface{})
    20 
    21     // Implementation fields
    22     wait           <-chan time.Time
    23     lastFetch      *FetchInfo
    24     lastCrawlDelay time.Duration
    25     opts           *Options
    26 }
    for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

    3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

  • 相关阅读:
    转载一篇关于kafka零拷贝(zero-copy)通俗易懂的好文
    kafka的一些核心理论知识
    Kafka知识点(Partitions and Segments)
    kafka: Producer配置和Consumer配置
    kafka: Java实现简单的Producer和Consumer
    SAP抛xml资料到kafka(本机模拟)
    解决方法: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation
    kafka log保存在本机的位置 kafka数据保存的位置
    Kafka: 下载安装和启动
    tomcat错误提示:指定的服务未安装。Unable to open the service 'tomcat9'的原因和解决方法
  • 原文地址:https://www.cnblogs.com/harrysun/p/3798438.html
Copyright © 2011-2022 走看看