引言
最近项目有需求从一个老的站点抓取信息然后倒入到新的系统中。由于老的系统已经没有人维护,数据又比较分散,而要提取的数据在网页上表现的反而更统一,所以计划通过网络请求然后分析页面的方式来提取数据。而两年前的这个时候,我似乎做过相同的事情——缘分这件事情,真是有趣。
设想
在采集信息这件事情中,最麻烦的往往是不同的页面的分解、数据的提取——因为页面的设计和结构往往千差万别。同时,对于有些页面,通常不得不绕着弯子请求(ajax、iframe等),这导致数据提取成了最耗时也最痛苦的过程——因为你需要编写大量的逻辑代码将整个流程串联起来。我隐隐记得15年的7月,也就是两年前的这个时候,我就思考过这个问题。当时引入了一个类型CommonExtractor
来解决这个问题。总体的定义是这样的:
public class CommonExtractor
{
public CommonExtractor(PageProcessConfig config)
{
PageProcessConfig = config;
}
protected PageProcessConfig PageProcessConfig;
public virtual void Extract(CrawledHtmlDocument document)
{
if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i)))
return;
var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl };
ExtractData(node, PageProcessConfig);
}
protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig)
{
var data = new Dictionary<string, ExtractionResult>();
foreach (var config in blockConfig.DataExtractionConfigs)
{
if (node == null)
continue;
/*使用'.'将当前节点作为上下文*/
var selectedNodes = node.Node.SelectNodes("." + config.XPath);
var result = new ExtractionResult(config, node.FromUrl);
if (selectedNodes != null && selectedNodes.Any())
{
foreach (var sNode in selectedNodes)
{
if (config.Attribute != null)
result.Fill(sNode.Attributes[config.Attribute].Value);
else
result.Fill(sNode.InnerText);
}
data[config.Key] = result;
}
else { data[config.Key] = null; }
}
if (DataExtracted != null)
{
var args = new DataExtractedEventArgs(data, node.FromUrl);
DataExtracted(this, args);
}
return data;
}
public EventHandler<DataExtractedEventArgs> DataExtracted;
}
代码有点乱(因为当时使用的是Abot进行爬网),但是意图还是挺明确的,希望从一个html文件中提取出有用的信息,然后通过一个配置来指定如何提取信息。这种处理方式存在的主要问题是:无法应对复杂结构,在应对特定的结构的时候必须引入新的配置,新的流程,同时这个新的流程不具备较高程度的可重用性。
设计
简单的开始
为了应对现实情况中的复杂性,最基本的处理必须设计的简单。从以前代码中捕捉到灵感,对于数据提取,其实我们想要的就是:
- 给程序提供一个html文档
- 程序给我们返回一个值
由此,给出了最基本的接口定义:
public interface IContentProcessor
{
/// <summary>
/// 处理内容
/// </summary>
/// <param name="source"></param>
/// <returns></returns>
object Process(object source);
}
可组合性
在上述的接口定义中,IContentProcessor
接口的实现方法如果足够庞大,其实可以解决任何html页面的数据提取,但是,这意味着其可复用性会越来越低,同时维护将越来越困难。所以,我们更希望其方法实现足够小。但是,越小代表着其功能越少,那么,为了面对复杂的现实需求,必须让这些接口可以组合起来。所以,要为接口添加新的要素:子处理器。
public interface IContentProcessor
{
/// <summary>
/// 处理内容
/// </summary>
/// <param name="source"></param>
/// <returns></returns>
object Process(object source);
/// <summary>
/// 该处理器的顺序,越小越先执行
/// </summary>
int Order { get; }
/// <summary>
/// 子处理器
/// </summary>
IList<IContentProcessor> SubProcessors { get; }
}
这样一来,各个Processor
就可以进行协作了。其嵌套关系和Order
属性共同决定了其执行的顺序。同时,整个处理流程也具备了管道的特点:上一个Processor
的处理结果可以作为下一个Processor
的处理源。
结果的组合性
虽然解决了处理流程的可组合性,但是就目前而言,处理的结果还是不可组合的,因为无法应对复杂的结构。为了解决这个问题,引入了IContentCollector,这个接口继承自IContentProcessor,但是提出了额外的要求,如下:
public interface IContentCollector : IContentProcessor
{
/// <summary>
/// 数据收集器收集的值对应的键
/// </summary>
string Key { get; }
}
该接口要求提供一个Key来标识结果。这样,我们就可以用一个Dictionary<string,object>
把复杂的结构管理起来了。因为字典的项对应的值也可以是Dictionary<string,object>
,这个时候,如果使用json作为序列化手段的话,是非常容易将结果反序列化成复杂的类的。
至于为什么要将这个接口继承自IContentProcessor
,这是为了保证节点类型的一致性,从而方便通过配置来构造整个处理流程。
配置
从上面的设计中可以看到,整个处理流程其实是一棵树,结构非常规范。这就为配置提供了可行性,这里使用一个Content-Processor-Options
类型来表示每个Processor
节点的类型和必要的初始化信息。定义如下所示:
public class ContentProcessorOptions
{
/// <summary>
/// 构造Processor的参数列表
/// </summary>
public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>();
/// <summary>
/// Processor的类型信息
/// </summary>
public string ProcessorType { get; set; }
/// <summary>
/// 指定一个子Processor,用于快速初始化Children,从而减少嵌套。
/// </summary>
public string SubProcessorType { get; set; }
/// <summary>
/// 子项配置
/// </summary>
public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>();
}
在Options中引入了SubProcessorType
属性来快速初始化只有一个子处理节点的ContentCollector
,这样就可以减少配置内容的层级,从而使得配置文件更加清晰。而以下方法则表示了如何通过一个Content-Processor-Options
初始化Processor
。这里使用了反射,但是由于不会频繁初始化,所以不会有太大的问题。
public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions)
{
Type instanceType = null;
try
{
instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true);
}
catch
{
foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies())
{
if (assembly.IsDynamic) continue;
instanceType = assembly.GetExportedTypes()
.FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType);
if (instanceType != null) break;
}
}
if (instanceType == null) return null;
var instance = Activator.CreateInstance(instanceType);
foreach (var property in contentProcessorOptions.Properties)
{
var instanceProperty = instance.GetType().GetProperty(property.Key);
if (instanceProperty == null) continue;
var propertyType = instanceProperty.PropertyType;
var sourceValue = property.Value.ToString();
var dValue = sourceValue.Convert(propertyType);
instanceProperty.SetValue(instance, dValue);
}
var processorInstance = (IContentProcessor) instance;
if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace())
{
var quickOptions = new ContentProcessorOptions
{
ProcessorType = contentProcessorOptions.SubProcessorType,
Properties = contentProcessorOptions.Properties
};
var quickProcessor = BuildContentProcessor(quickOptions);
processorInstance.SubProcessors.Add(quickProcessor);
}
foreach (var processorOption in contentProcessorOptions.Children)
{
var processor = BuildContentProcessor(processorOption);
processorInstance.SubProcessors.Add(processor);
}
return processorInstance;
}
几个约束
需要收敛集合
通过一个例子来说明问题:比如,一个html文档中提取了n个p标签,返回了一个string []
,同时将这个作为源传递给下一个处理节点。下一个处理节点会正确的处理每个string
,但是如果此节点也是针对一个string
返回一个string[]
的话,这个string []
应该被一个Connector
拼接起来。否则的话,结果就变成了2维
、3维度
乃至是更多维度的数组。这样的话,每个节点的逻辑就变复杂同时不可控了。所以集合需要收敛到一个维度。
配置文件中的Properties不支持复杂结构
由于当前使用的.NET CORE的配置文件系统,无法在一个Dictionary<string,object>
中将其子项设置为集合。
若干实现
Processor的实现和测试
HttpRequestContentProcessor
该处理器用于从网络上下载一段html文本,将文本内容作为源传递给下一个处理器;可以同时指定请求url或者将上一个请求节点传递过来的源作为url进行请求。实现如下:
public class HttpRequestContentProcessor : BaseContentProcessor
{
public bool UseUrlWhenSourceIsNull { get; set; } = true;
public string Url { get; set; }
public bool IgnoreBadUri { get; set; }
protected override object ProcessElement(object element)
{
if (element == null) return null;
if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute))
{
if (IgnoreBadUri) return null;
throw new FormatException($"需要请求的地址{Url}格式不正确");
}
return DownloadHtml(element.ToString());
}
public override object Process(object source)
{
if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace())
return DownloadHtml(Url);
return base.Process(source);
}
private static async Task<string> DownloadHtmlAsync(string url)
{
using (var client = new HttpClient())
{
var result = await client.GetAsync(url);
var html = await result.Content.ReadAsStringAsync();
return html;
}
}
private string DownloadHtml(string url)
{
return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url));
}
}
测试如下:
[TestMethod]
public void HttpRequestContentProcessorTest()
{
var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"};
var result = processor.Process(null);
Assert.IsTrue(result.ToString().Contains("baidu"));
}
XpathContentProcessor
该处理器通过接受一个XPath路径来获取指定的信息。可以通过指定ValueProvider
和ValueProviderKey
来指定如何从一个节点中获取数据,实现如下:
public class XpathContentProcessor : BaseContentProcessor
{
/// <summary>
/// 索引的元素路径
/// </summary>
public string Xpath { get; set; }
/// <summary>
/// 值得提供器的键
/// </summary>
public string ValueProviderKey { get; set; }
/// <summary>
/// 提供器的类型
/// </summary>
public XpathNodeValueProviderType ValueProviderType { get; set; }
/// <summary>
/// 节点的索引
/// </summary>
public int? NodeIndex { get; set; }
/// <summary>
///
/// </summary>
public string ResultConnector { get; set; } = Constants.DefaultResultConnector;
public override object Process(object source)
{
var result = base.Process(source);
return DeterminAndReturn(result);
}
protected override object ProcessElement(object element)
{
var result = base.ProcessElement(element);
if (result == null) return null;
var str = result.ToString();
return ProcessWithXpath(str, Xpath, false);
}
protected object ProcessWithXpath(string documentText, string xpath, bool returnArray)
{
if (documentText == null) return null;
var document = new HtmlDocument();
document.LoadHtml(documentText);
var nodes = document.DocumentNode.SelectNodes(xpath);
if (nodes == null)
return null;
if (returnArray && nodes.Count > 1)
{
var result = new List<string>();
foreach (var node in nodes)
{
var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
if (!nodeResult.IsNullOrWhiteSpace())
{
result.Add(nodeResult);
}
}
return result;
}
else
{
var result = string.Empty;
foreach (var node in nodes)
{
var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
if (!nodeResult.IsNullOrWhiteSpace())
{
if (result.IsNullOrWhiteSpace()) result = nodeResult;
else result = $"{result}{ResultConnector}{nodeResult}";
}
}
return result;
}
}
}
将这个Processor
和上一个Processor
组合起来,我们抓一下百度首页的title
:
[TestMethod]
public void XpathContentProcessorTest()
{
var xpathProcessor = new XpathContentProcessor
{
Xpath = "//title",
ValueProviderType = XpathNodeValueProviderType.InnerText
};
var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" };
xpathProcessor.SubProcessors.Add(processor);
var result = xpathProcessor.Process(null);
Assert.AreEqual("百度一下,你就知道", result.ToString());
}
Collector的实现和测试
Collector
最大的作用是解决复杂的输出模型的问题。一个复杂数据结构的Collector
的实现如下:
public class ComplexContentCollector : BaseContentCollector
{
/// <summary>
/// Complex Content Collector 需要子的数据提取器提供一个Key,所以忽略Processor
/// </summary>
/// <param name="source"></param>
/// <returns></returns>
protected override object ProcessElement(object source)
{
var result = new Dictionary<string, object>();
foreach (var contentCollector in SubProcessors.OfType<IContentCollector>())
{
result[contentCollector.Key] = contentCollector.Process(source);
}
return result;
}
}
对应的测试如下:
[TestMethod]
public void ComplexContentCollectorTest2()
{
var xpathProcessor = new XpathContentProcessor
{
Xpath = "//title",
ValueProviderType = XpathNodeValueProviderType.InnerText
};
var xpathProcessor2 = new XpathContentProcessor
{
Xpath = "//p[@id="cp"]",
ValueProviderType = XpathNodeValueProviderType.InnerText,
Order = 1
};
var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1};
var complexCollector = new ComplexContentCollector();
var baseCollector = new BaseContentCollector();
baseCollector.SubProcessors.Add(processor);
baseCollector.SubProcessors.Add(complexCollector);
var titleCollector = new BaseContentCollector{Key = "Title"};
titleCollector.SubProcessors.Add(xpathProcessor);
var footerCollector = new BaseContentCollector {Key = "Footer"};
footerCollector.SubProcessors.Add(xpathProcessor2);
footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3});
complexCollector.SubProcessors.Add(titleCollector);
complexCollector.SubProcessors.Add(footerCollector);
var result = (Dictionary<string,object>)baseCollector.Process(null);
Assert.AreEqual("百度一下,你就知道", result["Title"]);
Assert.AreEqual("©2014 Baidu 使用百度前必读 京ICP证030173号", result["Footer"]);
}
使用配置应对稍微复杂的情况
现在,使用以下代码进行测试:
public void RunConfig(string section)
{
var builder = new ConfigurationBuilder()
.SetBasePath(AppDomain.CurrentDomain.BaseDirectory)
.AddJsonFile("appsettings1.json");
var configurationRoot = builder.Build();
var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>();
var processor = Helper.BuildContentProcessor(options);
var result = processor.Process(null);
var json = JsonConvert.SerializeObject(result);
System.Console.WriteLine(json);
}
抓取博客园列表标题
使用的配置:
"newsListOptions": {
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"Properties": {},
"Children": [
{
"ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
"Properties": {
"Url": "https://www.cnblogs.com/news/",
"Order": "0"
}
},
{
"ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//div[@class="post_item"]",
"Order": "1",
"ValueProviderType": "OuterHtml",
"OutputToArray": true
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
"Properties": {
"Order": "2"
},
"Children": [
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//a[@class="titlelnk"]",
"Key": "Url",
"ValueProviderType": "Attribute",
"ValueProviderKey": "href"
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//span[@class="article_comment"]",
"Key": "CommentCount",
"ValueProviderType": "InnerText",
"Order": "0"
},
"Children": [
{
"ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
"Properties": {
"RegexPartten": "[0-9]+",
"Order": "1"
}
}
]
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//*[@class="digg"]//span",
"Key": "LikeCount",
"ValueProviderType": "InnerText"
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//a[@class="titlelnk"]",
"Key": "Title",
"ValueProviderType": "InnerText"
}
}
]
}
]
},
获取的结果:
[
{
"Url": "//news.cnblogs.com/n/574269/",
"CommentCount": "1",
"LikeCount": "3",
"Title": "刘强东:京东13年了,真正懂我们的人还是很少"
},
{
"Url": "//news.cnblogs.com/n/574267/",
"CommentCount": "0",
"LikeCount": "0",
"Title": "联想也开始大谈人工智能,不过它最迫切的目标是卖更多PC"
},
{
"Url": "//news.cnblogs.com/n/574266/",
"CommentCount": "0",
"LikeCount": "0",
"Title": "除了小米1几乎都支持 - 小米MIUI9升级机型一览"
},
...
]
获取该列表中评论最多的新闻的详情
这里面涉及到计算,和集合操作,同时集合元素是个字典,所以需要引入两个一个新的Processor
,一个用于筛选,一个用于映射。
public class ListItemPickContentProcessor : BaseContentProcessor
{
public string Key { get; set; }
/// <summary>
/// 用来操作的类型
/// </summary>
public string OperatorTypeFullName { get; set; }
/// <summary>
/// 用来对比的值
/// </summary>
public string OperatorValue { get; set; }
/// <summary>
/// 下标
/// </summary>
public int Index { get; set; }
/// <summary>
/// 模式
/// </summary>
public ListItemPickMode PickMode { get; set; }
/// <summary>
/// 操作符
/// </summary>
public ListItemPickOperator PickOperator { get; set; }
public override object Process(object source)
{
var preResult = base.Process(source);
if (!Helper.IsEnumerableExceptString(preResult))
{
if (source is Dictionary<string, object>)
return ((Dictionary<string, object>) preResult)[Key];
return preResult;
}
return Pick(source as IEnumerable);
}
private object Pick(IEnumerable source)
{
var objCollection = source.Cast<object>().ToList();
if (objCollection.Count == 0)
return objCollection;
var item = objCollection[0];
var compareDictionary = new Dictionary<object, IComparable>();
if (item is IDictionary)
{
foreach (Dictionary<string, object> dic in objCollection)
{
var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName));
compareDictionary.Add(dic, key);
}
}
else
{
foreach (var objItem in objCollection)
{
var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName));
compareDictionary.Add(objItem, key);
}
}
IEnumerable<object> result;
switch (PickOperator)
{
case ListItemPickOperator.OrderDesc:
result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key);
break;
default: throw new NotSupportedException();
}
switch (PickMode)
{
case ListItemPickMode.First:
return result.FirstOrDefault();
case ListItemPickMode.Last:
return result.LastOrDefault();
case ListItemPickMode.Index:
return result.Skip(Index - 1).Take(1).FirstOrDefault();
default:
throw new NotImplementedException();
}
}
private Type ResolveType(string typeName)
{
if (typeName == typeof(Int32).FullName)
return typeof(Int32);
throw new NotSupportedException();
}
public enum ListItemPickMode
{
First,
Last,
Index
}
public enum ListItemPickOperator
{
LittleThan,
GreaterThan,
Order,
OrderDesc
}
}
这里用了比较多的反射,但是暂时不考虑性能问题。
public class DictionaryPickContentProcessor : BaseContentProcessor
{
public string Key { get; set; }
protected override object ProcessElement(object element)
{
if (element is IDictionary)
{
return (element as IDictionary)[Key];
}
return element;
}
}
这个Processor
将从字典中抽取一条记录。
使用的配置:
"mostCommentsOptions": {
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"Properties": {},
"Children": [
{
"ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
"Properties": {
"Url": "https://www.cnblogs.com/news/",
"Order": "0"
}
},
{
"ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//div[@class="post_item"]",
"Order": "1",
"ValueProviderType": "OuterHtml",
"OutputToArray": true
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
"Properties": {
"Order": "2"
},
"Children": [
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//a[@class="titlelnk"]",
"Key": "Url",
"ValueProviderType": "Attribute",
"ValueProviderKey": "href"
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//span[@class="article_comment"]",
"Key": "CommentCount",
"ValueProviderType": "InnerText",
"Order": "0"
},
"Children": [
{
"ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
"Properties": {
"RegexPartten": "[0-9]+",
"Order": "1"
}
}
]
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//*[@class="digg"]//span",
"Key": "LikeCount",
"ValueProviderType": "InnerText"
}
},
{
"ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
"SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//a[@class="titlelnk"]",
"Key": "Title",
"ValueProviderType": "InnerText"
}
}
]
},
{
"ProcessorType":"IC.Robot.ContentProcessor.ListItemPickContentProcessor",
"Properties":{
"OperatorTypeFullName":"System.Int32",
"Key":"CommentCount",
"PickMode":"First",
"PickOperator":"OrderDesc",
"Order":"4"
}
},
{
"ProcessorType":"IC.Robot.ContentProcessor.DictionaryPickContentProcessor",
"Properties":{
"Order":"5",
"Key":"Url"
}
},
{
"ProcessorType":"IC.Robot.ContentProcessor.FormatterContentProcessor",
"Properties":{
"Formatter":"https:{0}",
"Order":"6"
}
},
{
"ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
"Properties": {
"Order": "7"
}
},
{
"ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
"Properties": {
"Xpath": "//div[@id="news_content"]//p[2]",
"Order": "8",
"ValueProviderType": "InnerHtml",
"OutputToArray": false
}
}
]
}
获取的结果:
昨日,京东突然通知平台商户,将关闭天天快递服务接口。这意味着京东平台上的商户以后不能再用天天快递发货了。
可以优化地方
- 需要一个GUI来处理配置,现在的配置实在不人性化
- 需要引入一个调度器,解决
Processor
调度的问题(深度优先、广度优先等) - 需要在代码级别,对各个调度器的依赖关系提出约束(例如,项的收敛问题),从而更好的引导配置
- 规则还不够统一,比如什么时候该约束返回集合,什么时候不该约束
写代码还是很有趣的,不是吗?