Python通用爬虫，聚焦爬虫概念理解

通用爬虫：百度、360、搜狐、谷歌、必应.......

原理：

（1）抓取网页

（2）采集数据

（3）数据处理

（4）提供检索服务

百度爬虫：Baiduspider

通用爬虫如何抓取新网站？

（1）主动提交url

（2）设置友情链接

（3）百度会和DNS服务商合作，抓取新网站

检索排名

（1）竞价排名

（2）根据pagerank值排名，由访问量，点击量得出，SEO岗位做的工作

如果不想让百度爬虫你的网站：加一个文件robots.txt，可以限定哪些可以爬取我的网站，哪些不可以，例如淘宝的部分robots.txt内容：

User-agent:  Baiduspider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Allow: /$
Disallow:  /product/
Disallow:  /

User-Agent:  Googlebot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /
这个协议仅仅是口头上的协议，真正的还是可以爬取的。
聚焦爬虫：根据特定的需求抓取指定的数据。
思路：代替浏览器上网
    网页的特点：
       （1）网页都有自己唯一的url
       （2）网页内容都是HTML结构的
       （3）使用的都是http,https协议
（1）给一个url
（2）写程序，模拟浏览器访问url
（3）解析内容，提取数据

查看全文

相关阅读:
[java] 怎么去掉小数点后面不需要的0
[SoapUI] 在SoapUI script里获取Response（Json格式）某个节点值
 nacos启动不停打印日志[com.alibaba.nacos.client.naming.updater] INFO com.alibaba.nacos.client.naming:192
sppringcloud应用启动访问gateway无法path无法路由到目标应用,404
springcloud项目启动gateway报错org.springframework.cloud.gateway.config.GatewayAutoConfiguration required a bean of type 'org.springframework.http.codec.ServerCodecConfigurer' that could not be found
nacos 01
微服务spring-cloud day2
springcloud本地启动指定profile后错误If you are using the git profile, you need to set a Git URI in your configuration.
微服务spring-cloud day1
金融云部署sofaboot应用指定了项目路径健康检查无法通过，windows格式/unix格式/mac格式的坑

原文地址：https://www.cnblogs.com/lyxcode/p/11490064.html