zoukankan      html  css  js  c++  java
  • 编写一个可配置的网页信息提取组件

    引言

    最近项目有需求从一个老的站点抓取信息然后倒入到新的系统中。由于老的系统已经没有人维护,数据又比较分散,而要提取的数据在网页上表现的反而更统一,所以计划通过网络请求然后分析页面的方式来提取数据。而两年前的这个时候,我似乎做过相同的事情——缘分这件事情,真是有趣。

    设想

    在采集信息这件事情中,最麻烦的往往是不同的页面的分解、数据的提取——因为页面的设计和结构往往千差万别。同时,对于有些页面,通常不得不绕着弯子请求(ajax、iframe等),这导致数据提取成了最耗时也最痛苦的过程——因为你需要编写大量的逻辑代码将整个流程串联起来。我隐隐记得15年的7月,也就是两年前的这个时候,我就思考过这个问题。当时引入了一个类型CommonExtractor来解决这个问题。总体的定义是这样的:

        public class CommonExtractor
        {
            public CommonExtractor(PageProcessConfig config)
            {
                PageProcessConfig = config;
            }
    
            protected PageProcessConfig PageProcessConfig;
    
            public virtual void Extract(CrawledHtmlDocument document)
            {
                if (!PageProcessConfig.IncludedUrlPattern.Any(i => Regex.IsMatch(document.FromUrl.ToString(), i)))
                    return;
                var node = new WebHtmlNode { Node = document.Contnet.DocumentNode, FromUrl = document.FromUrl };
                ExtractData(node, PageProcessConfig);
            }
    
            protected Dictionary<string, ExtractionResult> ExtractData(WebHtmlNode node, PageProcessConfig blockConfig)
            {
    
                var data = new Dictionary<string, ExtractionResult>();
                foreach (var config in blockConfig.DataExtractionConfigs)
                {
                    if (node == null)
                        continue;
                    /*使用'.'将当前节点作为上下文*/
                    var selectedNodes = node.Node.SelectNodes("." + config.XPath);
                    var result = new ExtractionResult(config, node.FromUrl);
                    if (selectedNodes != null && selectedNodes.Any())
                    {
                        foreach (var sNode in selectedNodes)
                        {
                            if (config.Attribute != null)
                                result.Fill(sNode.Attributes[config.Attribute].Value);
                            else
                                result.Fill(sNode.InnerText);
                        }
                        data[config.Key] = result;
                    }
                    else { data[config.Key] = null; }
                }
    
                if (DataExtracted != null)
                {
                    var args = new DataExtractedEventArgs(data, node.FromUrl);
                    DataExtracted(this, args);
                }
    
                return data;
            }
    
            public EventHandler<DataExtractedEventArgs> DataExtracted;
        }
    

    代码有点乱(因为当时使用的是Abot进行爬网),但是意图还是挺明确的,希望从一个html文件中提取出有用的信息,然后通过一个配置来指定如何提取信息。这种处理方式存在的主要问题是:无法应对复杂结构,在应对特定的结构的时候必须引入新的配置,新的流程,同时这个新的流程不具备较高程度的可重用性。

    设计

    简单的开始

    为了应对现实情况中的复杂性,最基本的处理必须设计的简单。从以前代码中捕捉到灵感,对于数据提取,其实我们想要的就是:

    • 给程序提供一个html文档
    • 程序给我们返回一个值

    由此,给出了最基本的接口定义:

        public interface IContentProcessor
        {
            /// <summary>
            /// 处理内容
            /// </summary>
            /// <param name="source"></param>
            /// <returns></returns>
            object Process(object source);
        }
    

    可组合性

    在上述的接口定义中,IContentProcessor接口的实现方法如果足够庞大,其实可以解决任何html页面的数据提取,但是,这意味着其可复用性会越来越低,同时维护将越来越困难。所以,我们更希望其方法实现足够小。但是,越小代表着其功能越少,那么,为了面对复杂的现实需求,必须让这些接口可以组合起来。所以,要为接口添加新的要素:子处理器。

        public interface IContentProcessor
        {
            /// <summary>
            /// 处理内容
            /// </summary>
            /// <param name="source"></param>
            /// <returns></returns>
            object Process(object source);
    
            /// <summary>
            /// 该处理器的顺序,越小越先执行
            /// </summary>
            int Order { get; }
    
            /// <summary>
            /// 子处理器
            /// </summary>
            IList<IContentProcessor> SubProcessors { get; }
        }
    

    这样一来,各个Processor就可以进行协作了。其嵌套关系和Order属性共同决定了其执行的顺序。同时,整个处理流程也具备了管道的特点:上一个Processor的处理结果可以作为下一个Processor的处理源。

    结果的组合性

    虽然解决了处理流程的可组合性,但是就目前而言,处理的结果还是不可组合的,因为无法应对复杂的结构。为了解决这个问题,引入了IContentCollector,这个接口继承自IContentProcessor,但是提出了额外的要求,如下:

        public interface IContentCollector : IContentProcessor
        {
            /// <summary>
            /// 数据收集器收集的值对应的键
            /// </summary>
            string Key { get; }
        }
    

    该接口要求提供一个Key来标识结果。这样,我们就可以用一个Dictionary<string,object>把复杂的结构管理起来了。因为字典的项对应的值也可以是Dictionary<string,object>,这个时候,如果使用json作为序列化手段的话,是非常容易将结果反序列化成复杂的类的。

    至于为什么要将这个接口继承自IContentProcessor,这是为了保证节点类型的一致性,从而方便通过配置来构造整个处理流程。

    配置

    从上面的设计中可以看到,整个处理流程其实是一棵树,结构非常规范。这就为配置提供了可行性,这里使用一个Content-Processor-Options类型来表示每个Processor节点的类型和必要的初始化信息。定义如下所示:

        public class ContentProcessorOptions
        {
            /// <summary>
            /// 构造Processor的参数列表
            /// </summary>
            public Dictionary<string, object> Properties { get; set; } = new Dictionary<string, object>();
    
            /// <summary>
            /// Processor的类型信息
            /// </summary>
            public string ProcessorType { get; set; }
    
            /// <summary>
            /// 指定一个子Processor,用于快速初始化Children,从而减少嵌套。
            /// </summary>
            public string SubProcessorType { get; set; }
    
            /// <summary>
            /// 子项配置
            /// </summary>
            public List<ContentProcessorOptions> Children { get; set; } = new List<ContentProcessorOptions>();
        }
    

    在Options中引入了SubProcessorType属性来快速初始化只有一个子处理节点的ContentCollector,这样就可以减少配置内容的层级,从而使得配置文件更加清晰。而以下方法则表示了如何通过一个Content-Processor-Options初始化Processor。这里使用了反射,但是由于不会频繁初始化,所以不会有太大的问题。

            public static IContentProcessor BuildContentProcessor(ContentProcessorOptions contentProcessorOptions)
            {
                Type instanceType = null;
                try
                {
                    instanceType = Type.GetType(contentProcessorOptions.ProcessorType, true);
                }
                catch
                {
                    foreach (var assembly in AppDomain.CurrentDomain.GetAssemblies())
                    {
                        if (assembly.IsDynamic) continue;
                        instanceType = assembly.GetExportedTypes()
                            .FirstOrDefault(i => i.FullName == contentProcessorOptions.ProcessorType);
                        if (instanceType != null) break;
                    }
                }
    
                if (instanceType == null) return null;
    
                var instance = Activator.CreateInstance(instanceType);
                foreach (var property in contentProcessorOptions.Properties)
                {
                    var instanceProperty = instance.GetType().GetProperty(property.Key);
                    if (instanceProperty == null) continue;
                    var propertyType = instanceProperty.PropertyType;
                    var sourceValue = property.Value.ToString();
                    var dValue = sourceValue.Convert(propertyType);
                    instanceProperty.SetValue(instance, dValue);
                }
                var processorInstance = (IContentProcessor) instance;
                if (!contentProcessorOptions.SubProcessorType.IsNullOrWhiteSpace())
                {
                    var quickOptions = new ContentProcessorOptions
                    {
                        ProcessorType = contentProcessorOptions.SubProcessorType,
                        Properties = contentProcessorOptions.Properties
                    };
                    var quickProcessor = BuildContentProcessor(quickOptions);
                    processorInstance.SubProcessors.Add(quickProcessor);
                }
                foreach (var processorOption in contentProcessorOptions.Children)
                {
                    var processor = BuildContentProcessor(processorOption);
                    processorInstance.SubProcessors.Add(processor);
                }
                return processorInstance;
            }
    

    几个约束

    需要收敛集合

    通过一个例子来说明问题:比如,一个html文档中提取了n个p标签,返回了一个string [],同时将这个作为源传递给下一个处理节点。下一个处理节点会正确的处理每个string,但是如果此节点也是针对一个string返回一个string[]的话,这个string []应该被一个Connector拼接起来。否则的话,结果就变成了2维3维度乃至是更多维度的数组。这样的话,每个节点的逻辑就变复杂同时不可控了。所以集合需要收敛到一个维度。

    配置文件中的Properties不支持复杂结构

    由于当前使用的.NET CORE的配置文件系统,无法在一个Dictionary<string,object>中将其子项设置为集合。

    若干实现

    Processor的实现和测试

    HttpRequestContentProcessor

    该处理器用于从网络上下载一段html文本,将文本内容作为源传递给下一个处理器;可以同时指定请求url或者将上一个请求节点传递过来的源作为url进行请求。实现如下:

      public class HttpRequestContentProcessor : BaseContentProcessor
        {
            public bool UseUrlWhenSourceIsNull { get; set; } = true;
    
            public string Url { get; set; }
    
            public bool IgnoreBadUri { get; set; }
    
            protected override object ProcessElement(object element)
            {
                if (element == null) return null;
                if (Uri.IsWellFormedUriString(element.ToString(), UriKind.Absolute))
                {
                    if (IgnoreBadUri) return null;
                    throw new FormatException($"需要请求的地址{Url}格式不正确");
                }
                return DownloadHtml(element.ToString());
            }
    
            public override object Process(object source)
            {
                if (source == null && UseUrlWhenSourceIsNull && !Url.IsNullOrWhiteSpace())
                    return DownloadHtml(Url);
                return base.Process(source);
            }
    
            private static async Task<string> DownloadHtmlAsync(string url)
            {
                using (var client = new HttpClient())
                {
                    var result = await client.GetAsync(url);
                    var html = await result.Content.ReadAsStringAsync();
                    return html;
                }
            }
    
            private string DownloadHtml(string url)
            {
                return AsyncHelper.Synchronize(() => DownloadHtmlAsync(url));
            }
        }
    

    测试如下:

            [TestMethod]
            public void HttpRequestContentProcessorTest()
            {
                var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com"};
                var result = processor.Process(null);
                Assert.IsTrue(result.ToString().Contains("baidu"));
            }
    

    XpathContentProcessor

    该处理器通过接受一个XPath路径来获取指定的信息。可以通过指定ValueProviderValueProviderKey来指定如何从一个节点中获取数据,实现如下:

        public class XpathContentProcessor : BaseContentProcessor
        {
            /// <summary>
            /// 索引的元素路径
            /// </summary>
            public string Xpath { get; set; }
    
            /// <summary>
            /// 值得提供器的键
            /// </summary>
            public string ValueProviderKey { get; set; }
    
            /// <summary>
            /// 提供器的类型
            /// </summary>
            public XpathNodeValueProviderType ValueProviderType { get; set; }
    
            /// <summary>
            /// 节点的索引
            /// </summary>
            public int? NodeIndex { get; set; }
    
            /// <summary>
            /// 
            /// </summary>
            public string ResultConnector { get; set; } = Constants.DefaultResultConnector;
    
            public override object Process(object source)
            {
                var result = base.Process(source);
                return DeterminAndReturn(result);
            }
    
            protected override object ProcessElement(object element)
            {
                var result = base.ProcessElement(element);
                if (result == null) return null;
    
                var str = result.ToString();
                
                return ProcessWithXpath(str, Xpath, false);
            }
    
            protected object ProcessWithXpath(string documentText, string xpath, bool returnArray)
            {
                if (documentText == null) return null;
    
                var document = new HtmlDocument();
                document.LoadHtml(documentText);
                var nodes = document.DocumentNode.SelectNodes(xpath);
    
                if (nodes == null)
                    return null;
    
                if (returnArray && nodes.Count > 1)
                {
                    var result = new List<string>();
                    foreach (var node in nodes)
                    {
                        var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                        if (!nodeResult.IsNullOrWhiteSpace())
                        {
                            result.Add(nodeResult);
                        }
                    }
                    return result;
                }
                else
                {
                    var result = string.Empty;
                    foreach (var node in nodes)
                    {
                        var nodeResult = Helper.GetValueFromHtmlNode(node, ValueProviderType, ValueProviderKey);
                        if (!nodeResult.IsNullOrWhiteSpace())
                        {
                            if (result.IsNullOrWhiteSpace()) result = nodeResult;
                            else result = $"{result}{ResultConnector}{nodeResult}";
                        }
                    }
                    return result;
                }
            }
        }
    

    将这个Processor和上一个Processor组合起来,我们抓一下百度首页的title

            [TestMethod]
            public void XpathContentProcessorTest()
            {
                var xpathProcessor = new XpathContentProcessor
                {
                    Xpath = "//title",
                    ValueProviderType = XpathNodeValueProviderType.InnerText
                };
                var processor = new HttpRequestContentProcessor { Url = "https://www.baidu.com" };
                xpathProcessor.SubProcessors.Add(processor);
    
                var result = xpathProcessor.Process(null);
                Assert.AreEqual("百度一下,你就知道", result.ToString());
            }
    

    Collector的实现和测试

    Collector最大的作用是解决复杂的输出模型的问题。一个复杂数据结构的Collector的实现如下:

        public class ComplexContentCollector : BaseContentCollector
        {
            /// <summary>
            /// Complex Content Collector 需要子的数据提取器提供一个Key,所以忽略Processor
            /// </summary>
            /// <param name="source"></param>
            /// <returns></returns>
            protected override object ProcessElement(object source)
            {
                var result = new Dictionary<string, object>();
    
                foreach (var contentCollector in SubProcessors.OfType<IContentCollector>())
                {
                    result[contentCollector.Key] = contentCollector.Process(source);
                }
    
                return result;
            }
        }
    

    对应的测试如下:

    [TestMethod]
            public void ComplexContentCollectorTest2()
            {
                var xpathProcessor = new XpathContentProcessor
                {
                    Xpath = "//title",
                    ValueProviderType = XpathNodeValueProviderType.InnerText
                };
    
                var xpathProcessor2 = new XpathContentProcessor
                {
                    Xpath = "//p[@id="cp"]",
                    ValueProviderType = XpathNodeValueProviderType.InnerText,
                    Order = 1
                };
                var processor = new HttpRequestContentProcessor {Url = "https://www.baidu.com", Order = -1};
                var complexCollector = new ComplexContentCollector();
                var baseCollector = new BaseContentCollector();
    
                baseCollector.SubProcessors.Add(processor);
                baseCollector.SubProcessors.Add(complexCollector);
                
                var titleCollector = new BaseContentCollector{Key = "Title"};
                titleCollector.SubProcessors.Add(xpathProcessor);
                var footerCollector = new BaseContentCollector {Key = "Footer"};
                footerCollector.SubProcessors.Add(xpathProcessor2);
                footerCollector.SubProcessors.Add(new HtmlCleanupContentProcessor{Order = 3});
    
                complexCollector.SubProcessors.Add(titleCollector);
                complexCollector.SubProcessors.Add(footerCollector);
    
                var result = (Dictionary<string,object>)baseCollector.Process(null);
                Assert.AreEqual("百度一下,你就知道", result["Title"]);
                Assert.AreEqual("©2014 Baidu 使用百度前必读 京ICP证030173号", result["Footer"]);
    
            }
    

    使用配置应对稍微复杂的情况

    现在,使用以下代码进行测试:

            public void RunConfig(string section)
            {
                var builder = new ConfigurationBuilder()
                    .SetBasePath(AppDomain.CurrentDomain.BaseDirectory)
                    .AddJsonFile("appsettings1.json");
                var configurationRoot = builder.Build();
    
                var options = configurationRoot.GetSection(section).Get<ContentProcessorOptions>();
                var processor = Helper.BuildContentProcessor(options);
    
                var result = processor.Process(null);
                var json = JsonConvert.SerializeObject(result);
                System.Console.WriteLine(json);
            }
    

    抓取博客园列表标题

    使用的配置:

    "newsListOptions": {
        "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
        "Properties": {},
        "Children": [
          {
            "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
            "Properties": {
              "Url": "https://www.cnblogs.com/news/",
              "Order": "0"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//div[@class="post_item"]",
              "Order": "1",
              "ValueProviderType": "OuterHtml",
              "OutputToArray": true
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
            "Properties": {
              "Order": "2"
            },
            "Children": [
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//a[@class="titlelnk"]",
                  "Key": "Url",
                  "ValueProviderType": "Attribute",
                  "ValueProviderKey": "href"
                }
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//span[@class="article_comment"]",
                  "Key": "CommentCount",
                  "ValueProviderType": "InnerText",
                  "Order": "0"
                },
                "Children": [
                  {
                    "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
                    "Properties": {
                      "RegexPartten": "[0-9]+",
                      "Order": "1"
                    }
                  }
                ]
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//*[@class="digg"]//span",
                  "Key": "LikeCount",
                  "ValueProviderType": "InnerText"
                }
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//a[@class="titlelnk"]",
                  "Key": "Title",
                  "ValueProviderType": "InnerText"
                }
              }
            ]
          }
        ]
      },
    

    获取的结果:

    [
            {
                "Url": "//news.cnblogs.com/n/574269/",
                "CommentCount": "1",
                "LikeCount": "3",
                "Title": "刘强东:京东13年了,真正懂我们的人还是很少"
            },
            {
                "Url": "//news.cnblogs.com/n/574267/",
                "CommentCount": "0",
                "LikeCount": "0",
                "Title": "联想也开始大谈人工智能,不过它最迫切的目标是卖更多PC"
            },
            {
                "Url": "//news.cnblogs.com/n/574266/",
                "CommentCount": "0",
                "LikeCount": "0",
                "Title": "除了小米1几乎都支持 - 小米MIUI9升级机型一览"
            },
            ...
    ]
    
    

    获取该列表中评论最多的新闻的详情

    这里面涉及到计算,和集合操作,同时集合元素是个字典,所以需要引入两个一个新的Processor,一个用于筛选,一个用于映射。

        public class ListItemPickContentProcessor : BaseContentProcessor
        {
            public string Key { get; set; }
    
            /// <summary>
            /// 用来操作的类型
            /// </summary>
            public string OperatorTypeFullName { get; set; }
    
            /// <summary>
            /// 用来对比的值
            /// </summary>
            public string OperatorValue { get; set; }
    
            /// <summary>
            /// 下标
            /// </summary>
            public int Index { get; set; }
    
            /// <summary>
            /// 模式
            /// </summary>
            public ListItemPickMode PickMode { get; set; }
    
            /// <summary>
            /// 操作符
            /// </summary>
            public ListItemPickOperator PickOperator { get; set; }
    
            public override object Process(object source)
            {
                var preResult = base.Process(source);
    
                if (!Helper.IsEnumerableExceptString(preResult))
                {
                    if (source is Dictionary<string, object>)
                        return ((Dictionary<string, object>) preResult)[Key];
                    return preResult;
                }
    
                return Pick(source as IEnumerable);
            }
    
            private object Pick(IEnumerable source)
            {
                var objCollection = source.Cast<object>().ToList();
                if (objCollection.Count == 0)
                    return objCollection;
                var item = objCollection[0];
                var compareDictionary = new Dictionary<object, IComparable>();
                if (item is IDictionary)
                {
    
                    foreach (Dictionary<string, object> dic in objCollection)
                    {
                        var key = (IComparable) dic[Key].ToString().Convert(ResolveType(OperatorTypeFullName));
                        compareDictionary.Add(dic, key);
                    }
                }
                else
                {
                    foreach (var objItem in objCollection)
                    {
                        var key = (IComparable) objItem.ToString().Convert(ResolveType(OperatorTypeFullName));
                        compareDictionary.Add(objItem, key);
                    }
                }
    
                IEnumerable<object> result;
    
                switch (PickOperator)
                {
                    case ListItemPickOperator.OrderDesc:
                        result = compareDictionary.OrderByDescending(i => i.Value).Select(i => i.Key);
                        break;
                    default: throw new NotSupportedException();
                }
    
                switch (PickMode)
                {
                    case ListItemPickMode.First:
                        return result.FirstOrDefault();
                    case ListItemPickMode.Last:
                        return result.LastOrDefault();
                    case ListItemPickMode.Index:
                        return result.Skip(Index - 1).Take(1).FirstOrDefault();
                    default:
                        throw new NotImplementedException();
                }
            }
    
            private Type ResolveType(string typeName)
            {
                if (typeName == typeof(Int32).FullName)
                    return typeof(Int32);
                throw new NotSupportedException();
            }
    
            public enum ListItemPickMode
            {
                First,
                Last,
                Index
            }
    
            public enum ListItemPickOperator
            {
                LittleThan,
                GreaterThan,
                Order,
                OrderDesc
            }
        }
    

    这里用了比较多的反射,但是暂时不考虑性能问题。

        public class DictionaryPickContentProcessor : BaseContentProcessor
        {
            public string Key { get; set; }
    
            protected override object ProcessElement(object element)
            {
                if (element is IDictionary)
                {
                    return (element as IDictionary)[Key];
                }
                return element;
            }
        }
    
    

    这个Processor将从字典中抽取一条记录。

    使用的配置:

    "mostCommentsOptions": {
        "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
        "Properties": {},
        "Children": [
          {
            "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
            "Properties": {
              "Url": "https://www.cnblogs.com/news/",
              "Order": "0"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//div[@class="post_item"]",
              "Order": "1",
              "ValueProviderType": "OuterHtml",
              "OutputToArray": true
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentCollector.ComplexContentCollector",
            "Properties": {
              "Order": "2"
            },
            "Children": [
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//a[@class="titlelnk"]",
                  "Key": "Url",
                  "ValueProviderType": "Attribute",
                  "ValueProviderKey": "href"
                }
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//span[@class="article_comment"]",
                  "Key": "CommentCount",
                  "ValueProviderType": "InnerText",
                  "Order": "0"
                },
                "Children": [
                  {
                    "ProcessorType": "IC.Robot.ContentProcessor.RegexMatchContentProcessor",
                    "Properties": {
                      "RegexPartten": "[0-9]+",
                      "Order": "1"
                    }
                  }
                ]
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//*[@class="digg"]//span",
                  "Key": "LikeCount",
                  "ValueProviderType": "InnerText"
                }
              },
              {
                "ProcessorType": "IC.Robot.ContentCollector.BaseContentCollector",
                "SubProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
                "Properties": {
                  "Xpath": "//a[@class="titlelnk"]",
                  "Key": "Title",
                  "ValueProviderType": "InnerText"
                }
              }
            ]
          },
          {
            "ProcessorType":"IC.Robot.ContentProcessor.ListItemPickContentProcessor",
            "Properties":{
              "OperatorTypeFullName":"System.Int32",
              "Key":"CommentCount",
              "PickMode":"First",
              "PickOperator":"OrderDesc",
              "Order":"4"
            }
          },
          {
            "ProcessorType":"IC.Robot.ContentProcessor.DictionaryPickContentProcessor",
            "Properties":{
              "Order":"5",
              "Key":"Url"
            }
          },
          {
            "ProcessorType":"IC.Robot.ContentProcessor.FormatterContentProcessor",
            "Properties":{
              "Formatter":"https:{0}",
              "Order":"6"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentProcessor.HttpRequestContentProcessor",
            "Properties": {
              "Order": "7"
            }
          },
          {
            "ProcessorType": "IC.Robot.ContentProcessor.XpathContentProcessor",
            "Properties": {
              "Xpath": "//div[@id="news_content"]//p[2]",
              "Order": "8",
              "ValueProviderType": "InnerHtml",
              "OutputToArray": false
            }
          }
        ]
      }
    
    

    获取的结果:

      昨日,京东突然通知平台商户,将关闭天天快递服务接口。这意味着京东平台上的商户以后不能再用天天快递发货了。
    

    可以优化地方

    • 需要一个GUI来处理配置,现在的配置实在不人性化
    • 需要引入一个调度器,解决Processor调度的问题(深度优先、广度优先等)
    • 需要在代码级别,对各个调度器的依赖关系提出约束(例如,项的收敛问题),从而更好的引导配置
    • 规则还不够统一,比如什么时候该约束返回集合,什么时候不该约束

    写代码还是很有趣的,不是吗?

  • 相关阅读:
    [Node.js] CommonJS Modules
    [Node.js] npm init && npm install
    [AngularJS] Hijacking Existing HTML Attributes with Angular Directives
    [Node.js] Level 7. Persisting Data
    [Express] Level 5: Route file
    [Express] Level 5: Route Instance -- refactor the code
    [Express] Level 4: Body-parser -- Delete
    [Express] Level 4: Body-parser -- Post
    [Express] Level 3: Massaging User Data
    [Express] Level 3: Reading from the URL
  • 原文地址:https://www.cnblogs.com/lightluomeng/p/7212577.html
Copyright © 2011-2022 走看看