Colly是Go下功能比较完整的一个HTTP客户端工具.
安装
使用GoLand作为开发环境
GOROOT: go目录放到了/opt/go, 所以GOROOT默认指向的也是/opt/go
GOPATH: 在Settings->Go->GOPATH里配置Global GOPATH, 指向 /home/milton/WorkGo
GOPROXY: 在Settings->Go->Go Modules下, 设置 Environments, GOPROXY=https://goproxy.cn
在GoLand内部的Terminal里查看环境变量, 命令 go env, 确认路径无误, 然后执行以下命令安装
# v1 go get -u github.com/gocolly/colly # v2 go get -u github.com/gocolly/colly/v2
基础使用
增加import
import "github.com/gocolly/colly/v2"
调用
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s
", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
使用代理池
参考文档中的例子 http://go-colly.org/docs/examples/proxy_switcher/ 这里的例子要注意两个问题
1. 初始化时, 需要设置AllowURLRevisit, 否则在访问同一URL时会直接跳过返回之前的结果
c := colly.NewCollector(colly.AllowURLRevisit())
2. 还需要设置禁用KeepAlive, 否则在多次访问同一网址时, 只会调用一次GetProxy, 这样达不到轮询代理池的效果, 相关信息 #392, #366 , #339
c := colly.NewCollector(colly.AllowURLRevisit())
c.WithTransport(&http.Transport{
DisableKeepAlives: true,
})
Golang里的协程同步(等价于Java中的锁)
Mutex
在Go程序中为解决Race Condition和Data Race问题, 使用Mutex来锁定资源只能同时被一个协程调用, 通过 &sync.Mutex() 创建一个全局变量, 在子方法里面通过Lock()和Unlock()锁定和释放资源. 注意defer关键字的使用.
import (
"strconv"
"sync"
)
var myBalance = &balance{amount: 50.00, currency: "GBP"}
type balance struct {
amount float64
currency string
mu sync.Mutex
}
func (b *balance) Add(i float64) {
b.mu.Lock()
b.amount += i
b.mu.Unlock()
}
func (b *balance) Display() string {
b.mu.Lock()
defer b.mu.Unlock()
return strconv.FormatFloat(b.amount, 'f', 2, 64) + " " + b.currency
}
读写锁使用RWMutex, 在Mutex的基础上, 增加了RLock()和RUnlock()方法. 在Lock()时依然是互斥的, 但是RLock()与RLock()之间不互斥
import (
"strconv"
"sync"
)
var myBalance = &balance{amount: 50.00, currency: "GBP"}
type balance struct {
amount float64
currency string
mu sync.RWMutex
}
func (b *balance) Add(i float64) {
b.mu.Lock()
b.amount += i
b.mu.Unlock()
}
func (b *balance) Display() string {
b.mu.RLock()
defer b.mu.RUnlock()
return strconv.FormatFloat(b.amount, 'f', 2, 64) + " " + b.currency
}
Channel
Channel类似于Java中的Semaphore, 通过设置channel容量限制同时工作的协程数, channel满了之后协程会被阻塞
package main
import (
"fmt"
"time"
"strconv"
)
func makeCakeAndSend(cs chan string) {
for i := 1; i<=3; i++ {
cakeName := "Strawberry Cake " + strconv.Itoa(i)
fmt.Println("Making a cake and sending ...", cakeName)
cs <- cakeName //send a strawberry cake
}
}
func receiveCakeAndPack(cs chan string) {
for i := 1; i<=3; i++ {
s := <-cs //get whatever cake is on the channel
fmt.Println("Packing received cake: ", s)
}
}
func main() {
cs := make(chan string)
go makeCakeAndSend(cs)
go receiveCakeAndPack(cs)
//sleep for a while so that the program doesn’t exit immediately
time.Sleep(4 * 1e9)
}
可以设置channel的容量
c := make(chan Type, n)