zoukankan      html  css  js  c++  java
  • Abot爬虫和visjs

    1. 引言

    最近接触Abot爬虫也有几天时间了,闲来无事打算从IMDB网站上爬取一些电影数据玩玩。正好美国队长3正在热映,打算爬取漫威近几年的电影并用vis这个JS库呈现下漫威宇宙的相关电影。

    Abot是一个开源的C#爬虫,代码非常轻巧。可以参看这篇文章(利用Abot 抓取博客园新闻数据)入门Abot。

    Vis 是一个JS的可视化库类似于D3。vis 提供了像Network 网络图的可视化,TimeLine 可视化等等。这里用到了network,只需要给vis传入简单的节点信息,边的信息就可以自动构建一个网络图。

    2. 实现

    首先从数据开始,得到漫威宇宙所有相关的电影名称,这个数据网上太多了:

    562781942015012922314906

    从电影名称到IMDB的电影页面其实有个搜索过程,还好电影数目不多,这里偷个懒直接采用IMDB的电影链接作为种子Url

    复制代码
    复制代码
            public static List<string> ImdbFeedMovies = new List<string>()
            {
                //Iron man 2008
                "http://www.imdb.com/title/tt1233205/",
                //hunk 2008
                "http://www.imdb.com/title/tt0800080/",
                //Iron man 2 2010
                "http://www.imdb.com/title/tt1228705/",
                //Thor 2011
                "http://www.imdb.com/title/tt0800369/",
                //Captain America
                "http://www.imdb.com/title/tt0458339/",
                //Averages
                "http://www.imdb.com/title/tt0848228/",
                //Iron man 3 
                "http://www.imdb.com/title/tt1300854/",
                //thor 2
                "http://www.imdb.com/title/tt1981115/",
                //Captain America 2
                "http://www.imdb.com/title/tt1843866/",
                //Guardians of the Galaxy;
                "http://www.imdb.com/title/tt2015381/",
                //Ultron
                "http://www.imdb.com/title/tt2395427/",
                //ant-man
                "http://www.imdb.com/title/tt0478970/",
                //Civil war
                "http://www.imdb.com/title/tt3498820/",
                //Doctor Strange
                "http://www.imdb.com/title/tt1211837/",
                //Guardians of the Galaxy 2;
                "http://www.imdb.com/title/tt3896198/",
                //Thor 3
                "http://www.imdb.com/title/tt3501632/",
                // Black Panther
                "http://www.imdb.com/title/tt1825683/",
                //Avengers: Infinity War - Part I
                "http://www.imdb.com/title/tt4154756/"
            };
    复制代码
    复制代码

    有了种子Url 就可以利用Abot 爬取电影的数据,这里只爬取电影名称,电影图片以及演员。

    这里定义一些需要用到的数据结构:

    复制代码
    复制代码
        public class MarvellItem
        {
            /// <summary>
            /// http://www.imdb.com/title/tt0800369/
            /// </summary>
            public string ImdbUrl { get; set; }
            public string Name { get; set; }
            public string Image { get; set; }
        }
    
        public class ImdbMovie
        {
            public string ImdbUrl { get; set; }
            public string Name { get; set; }
            public string Image { get; set; }
            public DateTime Date { get; set; }
     
            public List<MarvellItem> Actors { get; set; } 
        }
    
        public static readonly Regex MovieRegex = new Regex("http://www.imdb.com/title/tt\d+", RegexOptions.Compiled);
    复制代码
    复制代码

    Abot中爬取页面后最主要的处理函数就是PageCrawlCompletedAsync ,这里给出爬取每个电影页面后的complete Callback函数

    复制代码
    复制代码
            private ConcurrentDictionary<string, ImdbMovie> movieResult; //爬取到的电影数据
    
            public void Moviecrawler_ProcessPageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
            {
                if (MovieRegex.IsMatch(e.CrawledPage.Uri.AbsoluteUri))
                {
                    var csTitle = e.CrawledPage.CsQueryDocument.Select(".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > h1");
                    string title = HtmlData.HtmlDecode(csTitle.Text().Trim());
    
                    var datetime =
                        e.CrawledPage.CsQueryDocument.Select(
                            ".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > .subtext > a:last > meta");
    
                    var year = datetime.Attr("content").Trim();
    
                    var csImg = e.CrawledPage.CsQueryDocument.Select(".poster > a > img");
                    string image = csImg.Attr("src").Trim();
    
                    if (!string.IsNullOrEmpty(image))
                    {
                        HttpWebRequest webRequest = (HttpWebRequest) WebRequest.Create(image);
                        webRequest.Credentials = CredentialCache.DefaultCredentials;
                        var stream = webRequest.GetResponse().GetResponseStream();
                        if (stream != null)
                        {
                            Image bitmap = new Bitmap(stream);
                            image = e.CrawledPage.Uri.AbsoluteUri.GetHashCode() + ".jpg";
                            bitmap.Save(image);
                        }
                    }
    
                    var csTable = e.CrawledPage.CsQueryDocument.Select("#titleCast > table");
                    var csTrs = csTable.Select("tr", csTable);
    
                    List<MarvellItem> actors = new List<MarvellItem>();
                    foreach (var tr in csTrs)
                    {
                        var csTr = new CsQuery.CQ(tr);
                        var cslink = csTr.Select("td > a", csTr);
                        if (cslink.Any())
                        {
                            string url = NormUrl(cslink.Attr("href").Trim());
                            string actorTitle = cslink.Select("img", cslink).Attr("title").Trim();
                            string actorImage = cslink.Select("img", cslink).Attr("src").Trim();
    
                            actors.Add(new MarvellItem()
                            {
                                Name = actorTitle,
                                ImdbUrl = url,
                                Image = actorImage
                            });
                        }
                    }
    
                    this.movieResult.TryAdd(e.CrawledPage.Uri.AbsoluteUri, new ImdbMovie()
                    {
                        Name = title,
                        Image = image,
                        Date = DateTime.Parse(year),
                        ImdbUrl = e.CrawledPage.Uri.AbsoluteUri,
                        Actors = actors
                    });
                }
            }
    复制代码
    复制代码

    该函数的主要功能就是解析电影页面,得到电影名字 电影图片 和 演员信息。这里面还有一个小trick ,由于IMDB的限制,需要把爬到的图片下载下来,否则在生产环境下<img src=””/>  图片是无法显示的.

    更多这个trick的细节可以参看 关于img 403 forbidden的一些思考

    对于所有的电影链接,可以采用Task 并行执行:

    复制代码
    复制代码
               Task[] movieTasks = new Task[ImdbFeedMovies.Count];
    
                System.Console.WriteLine("Start crawl Movies");
    
                for (var i = 0; i < ImdbFeedMovies.Count; i++)
                {
                    var url = ImdbFeedMovies[i];
                    movieTasks[i] = new Task(() =>
                    {
                        System.Console.WriteLine("Start crawl:" + url);
                        var crawler = GetManuallyConfiguredWebCrawler();
                        ConfigMovieCrawl(crawler);
    
                        crawler.Crawl(new Uri(url));
                        System.Console.WriteLine("End crawl:" + url);
                    });
    
                    movieTasks[i].Start();
                }
    
                Task.WaitAll(movieTasks);
    
                System.Console.WriteLine("End crawl Movies");
    复制代码
    复制代码

    结束后我们得到一堆JSON 数据

    image

    把它传到前端:

    复制代码
    复制代码
    @model List<ImdbMovie>
    
    <div class="clearfix" style=" position: relative">
        <div id="marvel-graph">
        </div>
    </div>
    
    @section PostScripts{
        <script type="text/javascript">
            $(function () {
                var nodes = [];
                var edges = [];
    
                @for (int i = 0; i < Model.Count; i++)
                {
                    var film = Model[i];
                    <text>
                    nodes.push({
                        id: '@film.ImdbUrl',
                        title: '@film.Name',
                        borderWidth: 4,
                        shapeProperties: {useBorderWithImage: true},
                        shape: "image",
                        image: '@(string.IsNullOrEmpty(film.Image) ? "" : (film.Image.StartsWith("http") ? film.Image : Href("../../Images/marvel/"+film.Image)))',
                        color: { border: '#4db6ac', background: '#009688' }
                    });
    
                    @if (i != Model.Count - 1)
                    {
                        <text>
                        edges.push({
                            from: '@film.ImdbUrl',
                            to: '@Model[i+1].ImdbUrl',
                            arrows: { to: true },
                             4,
                            length:360,
                            color: "red"
                        });
                        </text>
                    }
    
                    @foreach (var actor in film.Actors)
                    {
                        <text>
                        nodes.push({
                            id: '@film.ImdbUrl' + '@actor.ImdbUrl',
                            title: '@actor.Name',
                            borderWidth: 4,
                            shapeProperties: { useBorderWithImage: true },
                            shape: "circularImage",
                            image: '@(string.IsNullOrEmpty(actor.Image) ? "" : (actor.Image.StartsWith("http") ? actor.Image : Href("../../Images/marvel/"+actor.Image)))',
                        });
    
                        edges.push({
                            from: '@film.ImdbUrl',
                            to: '@film.ImdbUrl' + '@actor.ImdbUrl',
                            arrows: { to: true }
                        });
                        </text>
                    }
                    
                        </text>
                }
    
                var container = document.getElementById("marvel-graph");
         
                var visNodes = new vis.DataSet(nodes);
                var data = {
                    nodes: visNodes,
                    edges: edges
                };
    
                var options = {
                    layout: { improvedLayout: false },
                    nodes: {
                        borderWidth: 3,
                        font: {
                            color: '#000000',
                            size: 12,
                            face: 'Segoe UI'
                        },
                        color: { background: '#4db6ac', border: '#009688' }
                    },
                    edges: {
                        color: '#c1c1c1',
                         2,
                        font: {
                            color: '#2d2d2d',
                            size: 12
                        },
                        smooth: {
                            enabled: false,
                            type: 'continuous'
                        }
                    }
                };
    
                var network = new vis.Network(container, data, options);
            });
        </script>
    }
    复制代码
    复制代码

    vis network 主要就是 new Network(container, data, options); 传入节点 和 边即可。

    最终的效果如图:

    image

  • 相关阅读:
    【转载】AB测试结果分析
    【面试】HTTP post请求与get请求的区别
    如何做好接口测试?【转载】
    ContactsUtil 工具类
    接口测试第三课(HTTP协议简介) -- 转载
    接口测试第一课(基础知识篇)
    如果做好测试PM【转载】
    HttpClient发送Get和Post请求
    获取终端ip地址
    接口功能测试策略
  • 原文地址:https://www.cnblogs.com/GmrBrian/p/6212249.html
Copyright © 2011-2022 走看看