爬虫入门——02

zoukankan html css js c++ java

爬虫入门——02
爬虫入门——02
1. 引言
 在上一篇中，我们简单的了解了爬虫的工作流程，也简单的实现了一个爬虫，并且在文末简单分析了目前存在的问题。这一篇博客将会对上一篇分析出的问题，给出改进方法。我们将从以下几个方面加以改进。
2. 改进
(1) Bloom Filter
 我们首先利用Bloom Filet来改进UrlQueue中的visitedSet。
 在上一篇中，我们使用visitedSet（HashSet）来存放已经访问过的url。之所以使用HashSet是因为我们需要不断的插入url到visitedSet中，并且还需要频繁地判断某个url是否在其中，而采用Hash Table，在平均情况下，所有的字典操作在O(1)时间内都能完成（具体分析请看散列表(hash table)——算法导论(13)）。但不足之处在于我们需要花费大量的内存空间去维护hash table，我们是否可以减小它的空间复杂度呢？
 从visitedSet的作用入手，它只是用来判断某个url是否被包含在它内部，仅此而已。因此完全没有必要保存每个url的完整信息，保存指纹信息即可。这时我们可想到常用的md5和sha1摘要算法。但尽管对url做了压缩，我们还是需要去保存压缩后的信息。我们还有没有更好的方法呢？这时，Bloom Filter就派上用场了（关于Bloom Filter的介绍及实现在散列表(hash table)——算法导论(13)中的最后一小节）。
(2) Berkeley DB
 我们再来使用Berkeley DB改进我们UrlQueue中的unvisitedList。
 Berkeley DB是一个嵌入式数据库系统，简单、小巧、性能高（简单小巧没得说，至于性能，没验证过）。关于Berkeley DB的下载和使用请到其官网：http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html
 使用Berkeley DB后，我们会将从页面解析出的url直接存入DB中，而unvisitedList只是作为从DB中取url时的缓冲池。即我们会开启一个线程以一定的频率从DB中读取一定数量的url到unvisitedList中，执行页面请求的线程还是从unvisitedList读取url。
(3). 多线程
 最后我们引入多线程来提高爬虫的效率。
 多线程的关键在于同步与通信。这些内容请自行百度。
改进后的整个结构图如下：
3. 实现
(1) 代码
最后我们给出改进后的代码：
① 首先是改进后的UrlQueue.java，我们重命名为BloomQueue.java（其中的BloomFilter类，在散列表(hash table)——算法导论(13)中可找到）
BloomQueue.java
package craw;

import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

public class BloomQueue<T> {

 private BloomFilter<T> bloomFilter;
 private LinkedBlockingQueue<T> visitedList;
 private AtomicInteger flowedCount;
 private int queueCapacity;

 public BloomQueue() {
 this(0.000001, 10000000, 500);
 }

 public BloomQueue(double falsePositiveProbability, int filterCapacity, int queueCapacity) {
 this.queueCapacity = queueCapacity;
 bloomFilter = new BloomFilter<>(falsePositiveProbability, filterCapacity);
 visitedList = new LinkedBlockingQueue<>(queueCapacity);
 flowedCount = new AtomicInteger(0);
 }

 /**
 * 入队(当无法入队时，默认阻塞3秒)
 *
 * @param t
 * @return
 */
 public boolean enqueue(T t) {
 return enqueue(t, 3000);
 }

 /**
 * 入队
 *
 * @param t
 * @param timeout
 * 单位为毫秒
 */
 public boolean enqueue(T t, long timeout) {
 try {
 boolean result = visitedList.offer(t, timeout, TimeUnit.MILLISECONDS);
 if (result) {
 bloomFilter.add(t);
 flowedCount.getAndIncrement();
 }
 return result;
 } catch (InterruptedException e) {
 e.printStackTrace();
 }
 return false;
 }

 /**
 * 出队(当队列为空时，默认会阻塞3秒)
 *
 * @return
 */
 public T dequeue() {
 return dequeue(3000);
 }

 /**
 * 出队
 *
 * @param timeout
 * 单位为毫秒
 * @return
 */
 public T dequeue(long timeout) {
 try {
 return visitedList.poll(timeout, TimeUnit.MILLISECONDS);
 } catch (InterruptedException e) {
 }
 return null;
 }

 /**
 * 当前是否包含
 *
 * @return
 */
 public boolean contains(T t) {
 return visitedList.contains(t);
 }

 /**
 * 曾经是否包含
 *
 * @param t
 * @return
 */
 public boolean contained(T t) {
 return bloomFilter.contains(t);
 }

 public boolean isEmpty() {
 return visitedList.isEmpty();
 }

 public boolean isFull() {
 return visitedList.size() == queueCapacity;
 }

 public int size() {
 return visitedList.size();
 }

 public int flowedCount() {
 return flowedCount.get();
 }

 @Override
 public String toString() {
 return visitedList.toString();
 }
}
BloomFilter.java
package craw;
import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
import java.util.Collection;

public class BloomFilter<E> implements Serializable {
 private static final long serialVersionUID = -9077350041930475408L;

 private BitSet bitset;// 二进制向量
 private int slotSize; // 二进制向量的总位（槽）数（文中的m）
 private double loadFactor; // 装载因子（文中的α）
 private int capacity; // 布隆过滤器的容量（文中的n）
 private int size; // 装载的数目
 private int k; // 一个元素对应的位数（文中的k）

 static final Charset charset = Charset.forName("UTF-8");
 static final String hashName = "MD5";// 默认采用MD5算法加密，也可改为SHA1
 static final MessageDigest digestFunction;

 static {
 MessageDigest tmp;
 try {
 tmp = java.security.MessageDigest.getInstance(hashName);
 } catch (NoSuchAlgorithmException e) {
 tmp = null;
 }
 digestFunction = tmp;
 }

 public BloomFilter(int slotSize, int capacity) {
 this(slotSize / (double) capacity, capacity, (int) Math.round((slotSize / (double) capacity) * Math.log(2.0)));
 }

 public BloomFilter(double falsePositiveProbability, int capacity) {
 this(Math.log(2) / (Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))),//loadFactor = ln2 / k;
 capacity, //
 (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); //k = -ln p / ln2
 }

 public BloomFilter(int slotSize, int capacity, int size, BitSet filterData) {
 this(slotSize, capacity);
 this.bitset = filterData;
 this.size = size;
 }

 public BloomFilter(double loadFactor, int capacity, int k) {
 size = 0;
 this.loadFactor = loadFactor;
 this.capacity = capacity;
 this.k = k;
 this.slotSize = (int) Math.ceil(capacity * loadFactor);
 this.bitset = new BitSet(slotSize);
 }

 public static int createHash(String val, Charset charset) {
 return createHash(val.getBytes(charset));
 }

 public static int createHash(String val) {
 return createHash(val, charset);
 }

 public static int createHash(byte[] data) {
 return createHashes(data, 1)[0];
 }

 public static int[] createHashes(byte[] data, int hashes) {
 int[] result = new int[hashes];

 int k = 0;
 byte salt = 0;
 while (k < hashes) {
 byte[] digest;
 synchronized (digestFunction) {
 digestFunction.update(salt);
 salt++;
 digest = digestFunction.digest(data);
 }

 for (int i = 0; i < digest.length / 4 && k < hashes; i++) {
 int h = 0;
 for (int j = (i * 4); j < (i * 4) + 4; j++) {
 h <<= 8;
 h |= ((int) digest[j]) & 0xFF;
 }
 result[k] = h;
 k++;
 }
 }
 return result;
 }

 /**
 * Compares the contents of two instances to see if they are equal.
 *
 * @param obj
 * is the object to compare to.
 * @return True if the contents of the objects are equal.
 */
 @Override
 public boolean equals(Object obj) {
 if (obj == null) {
 return false;
 }
 if (getClass() != obj.getClass()) {
 return false;
 }
 final BloomFilter<?> other = (BloomFilter<?>) obj;
 if (this.capacity != other.capacity) {
 return false;
 }
 if (this.k != other.k) {
 return false;
 }
 if (this.slotSize != other.slotSize) {
 return false;
 }
 if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) {
 return false;
 }
 return true;
 }

 /**
 * Calculates a hash code for this class.
 *
 * @return hash code representing the contents of an instance of this class.
 */
 @Override
 public int hashCode() {
 int hash = 7;
 hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
 hash = 61 * hash + this.capacity;
 hash = 61 * hash + this.slotSize;
 hash = 61 * hash + this.k;
 return hash;
 }

 /**
 * Calculates the expected probability of false positives based on the
 * number of expected filter elements and the size of the Bloom filter.
 * 
 * 
 * The value returned by this method is the expected rate of false
 * positives, assuming the number of inserted elements equals the number of
 * expected elements. If the number of elements in the Bloom filter is less
 * than the expected value, the true probability of false positives will be
 * lower.
 *
 * @return expected probability of false positives.
 */
 public double expectedFalsePositiveProbability() {
 return getFalsePositiveProbability(capacity);
 }

 /**
 * Calculate the probability of a false positive given the specified number
 * of inserted elements.
 *
 * @param numberOfElements
 * number of inserted elements.
 * @return probability of a false positive.
 */
 public double getFalsePositiveProbability(double numberOfElements) {
 // (1 - e^(-k * n / m)) ^ k
 return Math.pow((1 - Math.exp(-k * (double) numberOfElements / (double) slotSize)), k);

 }

 /**
 * Get the current probability of a false positive. The probability is
 * calculated from the size of the Bloom filter and the current number of
 * elements added to it.
 *
 * @return probability of false positives.
 */
 public double getFalsePositiveProbability() {
 return getFalsePositiveProbability(size);
 }

 /**
 * Returns the value chosen for K. 
 * 
 * K is the optimal number of hash functions based on the size of the Bloom
 * filter and the expected number of inserted elements.
 *
 * @return optimal k.
 */
 public int getK() {
 return k;
 }

 /**
 * Sets all bits to false in the Bloom filter.
 */
 public void clear() {
 bitset.clear();
 size = 0;
 }

 /**
 * Adds an object to the Bloom filter. The output from the object's
 * toString() method is used as input to the hash functions.
 *
 * @param element
 * is an element to register in the Bloom filter.
 */
 public void add(E element) {
 add(element.toString().getBytes(charset));
 }

 /**
 * Adds an array of bytes to the Bloom filter.
 *
 * @param bytes
 * array of bytes to add to the Bloom filter.
 */
 public void add(byte[] bytes) {
 int[] hashes = createHashes(bytes, k);
 for (int hash : hashes)
 bitset.set(Math.abs(hash % slotSize));
 size++;
 }

 /**
 * Adds all elements from a Collection to the Bloom filter.
 *
 * @param c
 * Collection of elements.
 */
 public void addAll(Collection<? extends E> c) {
 for (E element : c)
 add(element);
 }

 /**
 * Returns true if the element could have been inserted into the Bloom
 * filter. Use getFalsePositiveProbability() to calculate the probability of
 * this being correct.
 *
 * @param element
 * element to check.
 * @return true if the element could have been inserted into the Bloom
 * filter.
 */
 public boolean contains(E element) {
 return contains(element.toString().getBytes(charset));
 }

 /**
 * Returns true if the array of bytes could have been inserted into the
 * Bloom filter. Use getFalsePositiveProbability() to calculate the
 * probability of this being correct.
 *
 * @param bytes
 * array of bytes to check.
 * @return true if the array could have been inserted into the Bloom filter.
 */
 public boolean contains(byte[] bytes) {
 int[] hashes = createHashes(bytes, k);
 for (int hash : hashes) {
 if (!bitset.get(Math.abs(hash % slotSize))) {
 return false;
 }
 }
 return true;
 }

 /**
 * Returns true if all the elements of a Collection could have been inserted
 * into the Bloom filter. Use getFalsePositiveProbability() to calculate the
 * probability of this being correct.
 *
 * @param c
 * elements to check.
 * @return true if all the elements in c could have been inserted into the
 * Bloom filter.
 */
 public boolean containsAll(Collection<? extends E> c) {
 for (E element : c)
 if (!contains(element))
 return false;
 return true;
 }

 /**
 * Read a single bit from the Bloom filter.
 *
 * @param bit
 * the bit to read.
 * @return true if the bit is set, false if it is not.
 */
 public boolean getBit(int bit) {
 return bitset.get(bit);
 }

 /**
 * Set a single bit in the Bloom filter.
 *
 * @param bit
 * is the bit to set.
 * @param value
 * If true, the bit is set. If false, the bit is cleared.
 */
 public void setBit(int bit, boolean value) {
 bitset.set(bit, value);
 }

 /**
 * Return the bit set used to store the Bloom filter.
 *
 * @return bit set representing the Bloom filter.
 */
 public BitSet getBitSet() {
 return bitset;
 }

 /**
 * Returns the number of bits in the Bloom filter. Use count() to retrieve
 * the number of inserted elements.
 *
 * @return the size of the bitset used by the Bloom filter.
 */
 public int slotSize() {
 return slotSize;
 }

 /**
 * Returns the number of elements added to the Bloom filter after it was
 * constructed or after clear() was called.
 *
 * @return number of elements added to the Bloom filter.
 */
 public int size() {
 return size;
 }

 /**
 * Returns the expected number of elements to be inserted into the filter.
 * This value is the same value as the one passed to the constructor.
 *
 * @return expected number of elements.
 */
 public int capacity() {
 return capacity;
 }

 /**
 * Get expected number of bits per element when the Bloom filter is full.
 * This value is set by the constructor when the Bloom filter is created.
 * See also getBitsPerElement().
 *
 * @return expected number of bits per element.
 */
 public double getLoadFactor() {
 return this.loadFactor;
 }
}
② 然后我们对Berkeley DB做一个简单的封装，便于使用。
package craw;

import java.io.File;

import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.persist.EntityCursor;
import com.sleepycat.persist.EntityStore;
import com.sleepycat.persist.PrimaryIndex;
import com.sleepycat.persist.StoreConfig;

public class DBHelper<T> {

 public static final String DEFAULT_DB_DIR = "E:/db/";
 public static final String DEFAULT_Entity_Store = "EntityStore";

 public Environment myEnv;
 public EntityStore store;
 public PrimaryIndex<Long, T> primaryIndex;

 public DBHelper(Class<T> clazz) {
 this(clazz, DEFAULT_DB_DIR, DEFAULT_Entity_Store, false);
 }

 public DBHelper(Class<T> clazz, String dbDir, String storeName, boolean isRead) {
 File dir = new File(dbDir);
 if (!dir.exists()) {
 dir.mkdirs();
 }
 EnvironmentConfig envConfig = new EnvironmentConfig();
 envConfig.setAllowCreate(!isRead);
 // Environment
 myEnv = new Environment(dir, envConfig);
 // StoreConfig
 StoreConfig storeConfig = new StoreConfig();
 storeConfig.setAllowCreate(!isRead);
 // store
 store = new EntityStore(myEnv, storeName, storeConfig);
 // PrimaryIndex
 primaryIndex = store.getPrimaryIndex(Long.class, clazz);
 }

 public void put(T t) {
 primaryIndex.put(t);
 store.sync();
 myEnv.sync();
 }

 public EntityCursor<T> entities() {
 return primaryIndex.entities();
 }

 public T get(long key) {
 return primaryIndex.get(key);
 }

 public void close() {
 if (store != null) {
 store.close();
 }
 if (myEnv != null) {
 myEnv.cleanLog();
 myEnv.close();
 }
 }
}

package craw;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import org.jsoup.nodes.Document;

public class CrawlerEngine {

 public static final String DEFAULT_SAVE_DIR = "E:/download/";
 private static final long FEEDER_SLEEP_TIME = 10;
 private static final long FEEDER_MAX_WAIT_TIME = 3 * 1000;// 当DB中取不到url时，feeder最长等待时间（即如果等待该时间后，DB还是为空，则feeder结束工作）
 private static final int FEEDER_MAX_WAIT_COUNT = (int) (FEEDER_MAX_WAIT_TIME / FEEDER_SLEEP_TIME);// 当DB中取不到url时，feeder最长等待时间（即如果等待该时间后，DB还是为空，则feeder结束工作）
 private static final boolean LOG = false;

 private BloomQueue<Url> urlQueue;
 private ExecutorService fetcherPool;
 private int fetcherCount;
 private boolean running;
 private DBHelper<Url> dbHelper;
 private JsoupDownloader downloader;
 private String parseRegex;
 private String saveRegex;
 private String saveDir;
 private String saveName;
 private long maxCount = 1000;
 private long startTime;
 private long endTime = Long.MAX_VALUE;

 public CrawlerEngine() {
 this(20, DEFAULT_SAVE_DIR, null);
 }

 public CrawlerEngine(int fetcherCount, String saveDir, String saveName) {
 this.fetcherCount = fetcherCount;
 urlQueue = new BloomQueue<>();
 fetcherPool = Executors.newFixedThreadPool(fetcherCount);
 dbHelper = new DBHelper<>(Url.class);
 downloader = JsoupDownloader.getInstance();
 this.saveDir = saveDir;
 this.saveName = saveName;
 }

 public void startUp(String[] seeds) {
 if (running) {
 return;
 }
 running = true;
 startTime = System.currentTimeMillis();
 for (String seed : seeds) {
 Url url = new Url(seed);
 urlQueue.enqueue(url);
 }
 for (int i = 0; i < fetcherCount; i++) {
 fetcherPool.execute(new Fetcher());
 }
 new Feeder().start();
 }

 public void shutdownNow() {
 running = false;
 fetcherPool.shutdown();
 }

 public void shutdownAtTime(long time) {
 if (time > startTime) {
 endTime = time;
 }
 }

 public void shutdownDelayed(long delayed) {
 shutdownAtTime(startTime + delayed);
 }

 public void shutdownAtCount(long count) {
 maxCount = count;
 }

 private boolean isEnd() {
 return urlQueue.flowedCount() > maxCount || System.currentTimeMillis() > endTime;
 }

 private long currId = 1;
 private int currWaitCount;

 /**
 * 饲养员
 * 
 * 从DB中获取一定数量的url到queue
 * 
 *
 * @author D.K
 *
 */
 private class Feeder extends Thread {
 @Override
 public void run() {
 while (!isEnd() && running && currWaitCount != FEEDER_MAX_WAIT_COUNT) {
 try {
 sleep(FEEDER_SLEEP_TIME);
 if (urlQueue.isFull()) {
 log("Feeder", "队列已满");
 continue;
 }
 Url url = dbHelper.get(currId);
 if (url == null) {
 currWaitCount++;
 log("Feeder", "url为null，currWaitCount = " + currWaitCount);
 } else {
 while (urlQueue.contained(url)) {
 currId++;
 url = dbHelper.get(currId);
 }
 if (url != null) {
 log("Feeder", "url准备入队");
 urlQueue.enqueue(url);
 currId++;
 log("Feeder", "url已经入队，currId = " + currId);
 currWaitCount = 0;
 }
 }
 } catch (Exception e) {
 e.printStackTrace();
 }
 }
 log("Feeder", "执行结束...");
 while (true) {
 try {
 sleep(100);
 log("Feeder", "等待Fetcher结束...");
 } catch (InterruptedException e) {
 }
 if (urlQueue.isEmpty()) {
 shutdownNow();
 System.out.println(">>>>>>>>>>>>爬取结束，共请求了" + urlQueue.flowedCount() + "个页面，用时" + (System.currentTimeMillis() - startTime) + "毫秒<<<<<<<<<<<<");
 return;
 }
 }
 }
 }

 /**
 * 抓取者
 * 
 * 从queue中取出url，下载页面，解析页面，并把解析出的新的url添加到DB中
 * 
 *
 * @author D.K
 *
 */
 private class Fetcher implements Runnable {

 @Override
 public void run() {
 while (!isEnd() && (running || !urlQueue.isEmpty())) {
 log("Fetcher", "开始从队列获取url，size=" + urlQueue.size());
 Url url = urlQueue.dequeue();
 if (url == null) {
 log("Fetcher", "url为null");
 continue;
 }
 log("Fetcher", "取出了url");
 Document doc = downloader.downloadPage(url.getUrl());
 Set<String> urlSet = downloader.parsePage(doc, parseRegex);
 downloader.savePage(doc, saveDir, saveName, saveRegex);
 for (String str : urlSet) {
 Url u = new Url(str);
 if (!urlQueue.contained(u)) {
 dbHelper.put(u);
 }
 }
 }
 }
 }

 private void log(String talker, String content) {
 if (LOG) {
 System.out.println("[" + talker + "] " + content);
 }
 }

 public String getParseRegex() {
 return parseRegex;
 }

 public void setParseRegex(String parseRegex) {
 this.parseRegex = parseRegex;
 }

 public String getSaveRegex() {
 return saveRegex;
 }

 public void setSaveRegex(String saveRegex) {
 this.saveRegex = saveRegex;
 }

 public void setSavePath(String saveDir, String saveName) {
 this.saveDir = saveDir;
 this.saveName = saveName;
 }
}
(2) 测试
 我们采用上一篇的测试例子来做同样的测试，以检验我们优化后的效果。下面是测试代码：
package craw;
public class Client {
 public static void main(String[] args) throws InterruptedException {
 CrawlerEngine crawlerEngine = new CrawlerEngine();
 crawlerEngine.setParseRegex("(http://www.cnblogs.com/artech/p|http://www.cnblogs.com/artech/default|http://www.cnblogs.com/artech/archive/\d{4}/\d{2}/\d{2}/).*");
 crawlerEngine.setSaveRegex("(http://www.cnblogs.com/artech/p|http://www.cnblogs.com/artech/archive/\d{4}/\d{2}/\d{2}/).*");
 crawlerEngine.startUp(new String[] { "http://www.cnblogs.com/artech/" });
 crawlerEngine.shutdownAtCount(1000);
 }
}
下面是运行结果：
(4) 总结
 对比我们上一篇中的测试时间61s，改进后用时14s，效率有明显的提升。
 在下一篇中，我们要对整个代码再次进行小的优化，完善一些细节，如对请求状态码的处理，抽取出一些接口以降低代码之间的耦合度，增强灵活性。
(5) 代码下载
http://files.cnblogs.com/files/walk-the-Line/crawler.zip
坚持，坚持，再坚持。
查看全文

相关阅读:
各种工具网站汇总
 Python中numpy库unique函数解析
 matlab中集合运算函数——解析
 hash算法搜索获得api函数地址的实现，"kernel32.dll", "CreateThread"
PEB及LDR链
 PE文件结构及其加载机制
 spring boot2.0一看就会的超详细快速入门(七)-访问静态资源
 spring boot2.0一看就会的超详细快速入门(六)-集成Thymeleaf模板引擎
 spring boot2.0一看就会的超详细快速入门(五)-开发人员工具devtools
spring boot2.0一看就会的超详细快速入门(四)-自动配置原理

原文地址：https://www.cnblogs.com/walk-the-Line/p/4868400.html