<Web Crawler><Java><thread-safe queue>

zoukankan html css js c++ java

<Web Crawler><Java><thread-safe queue>
Basic Solution
- The simplest way is to build a web crawler that runs on a single machine with single thread.
- So, a basic web crawler should be like this:
  
  Start with a URL pool that contains all the websites we want to crawl.
  
  For each URL, issue a HTTP GET request to fetch the web page content.
  
  Parse the content(usually HTML) and extract potential URLs that we want to crawel.
  
  Add new URLs to the pool and keep crawling.
Scale issues
- As is known to all, any system will face a bunch of issues after scaling.
- Then, what can be bottlenecks of a distributed web crawler? And how to solve them?
Crawling frequency
- How often will u crawel a website?
- 对于小网站，它们的服务器可能负载不了过于频繁的请求。
- 一种解决方式是参照robot.txt文件。
Dedup
- In a single machine, u can keep the URL pool in memory and remove duplicate entries.
- However, things becomes more complicated in a distributed system.
- So how can we dedup these URLs?
- 一种常见的做法是使用Bloom Filter。bf是一种空间有效的系统，它可以用来检测一个元素是否在集合中。但是bf给出的在pool中的判断可能是错误的。(不在的判断是精准的)
Parsing
- After fetching the response data, the next step is to parse the data(usually HTML) to extract the information we care about.
- This sounds like a simple thing, but it can be quite hard to make it robust.
Other pro.
- detect loops: many websites may contain links like A -> B -> C -> A, and ur crawler may end up running forever. How to fix this? 【实际上去重之后就不会有环路了吧，like BFS】
- DNS lookup: when the system get scaled to certain level, DNS lookup can be a bottleneck and u may build ur own DNS server.
A java Web Crawler
- 为了移除暂不想关注的点，包括html解析，cookie的使用等等。直接爬一个接口，返回的是json。
- 多线程：
  
  queue使用的是BlockingQueue阻塞队列，offer的时候设置一个等待时间，超时则返回null，同样的poll也有等待时间；
  
  visited使用Collections.synchronizedSet
- 处理url：
  
  HTTP GET request: 使用URL(requestUrl).openStream()，因为不用设置额外的头部，所以很简单；
  
  分情况：
  
  如果该url是json，则解析该json，把取得的链接放到queue中；
  
  如果该url是mp3，则下载保存到本地
- 代码见github-wttttt
Thread-safe Queue
- java提供了两种thread-safe的类：
- BlockingQueue: 阻塞队列:
  
  入队操作：
  
  add(e): 在队列满的时候会报异常；
  
  offer(e): 不会报异常，也不会阻塞，返回值是boolean。即在队满的时候不会插入元素，而直接返回false；
  
  offer(e, timeout, unit): 可以设定等待时间；
  
  put(e): 在队列满时会阻塞；
  
  出队操作：
  
  remove(): 从空队列remove会报异常；
  
  poll(): 不会报异常也不会阻塞，与offer(e)相对应；
  
  poll(timeout, unit): 设定等待时间；
  
  take(): 队列为空时会阻塞；
  
  查看元素：
  
  element(): 在队列为空时报异常；
  
  peek(): 不报异常也不阻塞，返回boolean；
  
  BlockingQueue接口的具体实现类：
  
  ArrayBlockingQueue：构造函数必须带int参数以指明大小；
  
  LinkedBlockingQueue：若其构造函数带一个规定大小的参数，生成的BlockingQueue有大小限制，若不带大小参数，所生成的BlockingQueue的大小由Integer.MAX_VALUE来决定；
  
  PriorityBlockingQueue：其所含对象的排序不是FIFO,而是依据对象的自然排序顺序或者是构造函数的Comparator决定的顺序
- concurrentLinkedQueue: 非阻塞队列
  
  ConcurrentLinkedQueue是一个无锁的并发线程安全队列；
  
  对比锁机制的实现，无锁机制的难点在于要充分考虑线程间的协调。简单说来就是多个线程对内部数据结构进行访问时，若其中一个线程执行的中途因为一些原因出现故障，其他的线程能够监测并帮助完成剩下的操作。这就需要把数据结构的操作过程精细地划分为多个状态或阶段，考虑每个阶段或状态多线程访问会出现的情况。
  
  ConcurrentLinkedQueue有两个volatile的线程共享变量：head、tail。要保证队列的线程安全就是要保证对这两个node的引用的访问(更新、查看)的原子性和可见性。
  
  由于volatile本身能保证可见性，所以就是对其修改的原子性要被保证。
- anyway，阻塞算法其实本质就是加锁，使用synchronized关键字。而相比之下，非阻塞算法的设计和实现就比较困难了，要通过低级的原子性来支持并发。
满地都是六便士，她却抬头看见了月亮。
查看全文

相关阅读:
【Azure Redis 缓存】Azure Redis 功能性讨论二
 【Azure Developer】如何用Microsoft Graph API管理AAD Application里面的Permissions
【Azure 环境】通过Python SDK收集所有订阅简略信息，例如订阅id 名称, 资源组及组内资源信息等，如何给Python应用赋予相应的权限才能获取到信息呢？
【Azure 应用服务】App Service与APIM同时集成到同一个虚拟网络后，如何通过内网访问内部VNET的APIM呢？
【Azure 云服务】如何从Azure Cloud Service中获取项目的部署文件
 【Azure Redis 缓存】Azure Redis 异常
 【Azure 微服务】基于已经存在的虚拟网络(VNET)及子网创建新的Service Fabric并且为所有节点配置自定义DNS服务
 【Azure Redis 缓存】遇见Azure Redis不能创建成功的问题：至少一个资源部署操作失败，因为 Microsoft.Cache 资源提供程序未注册。
【Azure Redis 缓存】如何得知Azure Redis服务有更新行为？
【Azure API 管理】在 Azure API 管理中使用 OAuth 2.0 授权和 Azure AD 保护 Web API 后端，在请求中携带Token访问后报401的错误

原文地址：https://www.cnblogs.com/wttttt/p/6970558.html

<Web Crawler><Java><thread-safe queue>

Basic Solution

Scale issues

Crawling frequency

Dedup

Parsing

Other pro.

A java Web Crawler

Thread-safe Queue