zoukankan      html  css  js  c++  java
  • c# WPF——完成一个简单的百度贴吧爬虫客户端

    话不多说先上图

     

     爬取10页大概500个帖子大概10s,500页2w多个帖子大概2min,由此可见性能并不是特别好,但是也没有很差。

    好了话不多说,我们来一步一步实现这么个简易的客户端。

    1.创建项目

    创建一个WPF空项目,导入需要的Devexpress的dll

    Devexpress可以到官网下载,基本16版本以上都可以。下载试用版的也可以,基本到期也不会限制你使用,只有开发的时候会弹出框,叉掉即可,比较良心。

    下载地址:https://www.devexpress.com/

     2.编辑界面

    基本就是xaml代码的编写,DevExpress的demo中心也有很多样例,直接上代码。

    <dx:ThemedWindow x:Class="SearchAnyWay.MainWindow"
            xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
            xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
            xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
            xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
            xmlns:dx="http://schemas.devexpress.com/winfx/2008/xaml/core"
            xmlns:dxmvvm="http://schemas.devexpress.com/winfx/2008/xaml/mvvm"
            xmlns:dxe="http://schemas.devexpress.com/winfx/2008/xaml/editors"
            xmlns:dxlc="http://schemas.devexpress.com/winfx/2008/xaml/layoutcontrol"   
            xmlns:dxg="http://schemas.devexpress.com/winfx/2008/xaml/grid"
            xmlns:local="clr-namespace:SearchAnyWay"
            mc:Ignorable="d"
            Title="百度贴吧搜索神器(v1.0)" Height="600" Width="800">
        <Grid>
            <dxlc:LayoutControl VerticalAlignment="Stretch" Orientation="Vertical" TextBlock.FontSize="11">
                <Label VerticalAlignment="Top" FontWeight="Bold" Content="输入您需要查找的关键字"></Label>
                <dxlc:LayoutGroup Orientation="Horizontal">
                    <dxlc:LayoutItem Label="关键字(K)" AddColonToLabel="True">
                        <dxe:TextEdit EditValue="{Binding Path=Name, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged, ValidatesOnDataErrors=True}" >
                            <dxmvvm:Interaction.Triggers>
                                <dxmvvm:KeyToCommand KeyGesture="Enter" Command="{Binding SearchCommand}"></dxmvvm:KeyToCommand>
                            </dxmvvm:Interaction.Triggers>
                        </dxe:TextEdit>
                    </dxlc:LayoutItem>
                    <dxlc:LayoutItem Label="贴吧名(N)" AddColonToLabel="True">
                        <dxe:TextEdit EditValue="{Binding Path=HubName, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged, ValidatesOnDataErrors=True}">
                        </dxe:TextEdit>
                    </dxlc:LayoutItem>
                    <dxlc:LayoutItem Label="爬取页数(P)" AddColonToLabel="True">
                        <dxe:ComboBoxEdit ItemsSource="{Binding PageRange}"
                                          SelectedItem="{Binding Page}"
                                          ShowSizeGrip="False"
                                          IsTextEditable="False">
                        </dxe:ComboBoxEdit>
                    </dxlc:LayoutItem>
                    <dxlc:LayoutGroup HorizontalAlignment="Right" VerticalAlignment="Center">
                        <dx:SimpleButton x:Name="btnSearch" Content="查找(S)" Width="80" Command="{Binding SearchCommand}"></dx:SimpleButton>
             
                    </dxlc:LayoutGroup>
                </dxlc:LayoutGroup>
                <dxg:TreeListControl x:Name="treeList"  Margin="0,10" ItemsSource="{Binding Source}"
                                             SelectionMode="Row" SelectedItem="{Binding SelectedRow}">
                    <dxg:TreeListControl.Columns>
                        <dxg:TreeListColumn  FieldName="Title" Header="标题"  Width="2*"/>
                        <dxg:TreeListColumn  FieldName="Brief" Width="2*" Header="详情"/>
                        <dxg:TreeListColumn Header="回复数" FieldName="CommentCount" Width="*"/>
                        <dxg:TreeListColumn Header="作者" FieldName="AuthorName" Width="*"/>
                    </dxg:TreeListControl.Columns>
                    <dxg:TreeListControl.View>
                        <dxg:TreeListView x:Name="view" VerticalScrollbarVisibility="Auto" AutoExpandAllNodes="True"  AllowEditing="False" NavigationStyle="Row" ShowIndicator="False"  TreeDerivationMode="ChildNodesSelector" ChildNodesPath="ICDItems">
                            <dxmvvm:Interaction.Triggers>
                                <dxmvvm:EventToCommand EventName="SourceUpdated" Command="{Binding Commands.ExpandAllNodes, ElementName=view}" />
                                <dxmvvm:EventToCommand EventName="RowDoubleClick" Command="{Binding SearchCommand}" CommandParameter="{Binding ElementName=treeList,Path=SelectedItem}" />
                            </dxmvvm:Interaction.Triggers>
                        </dxg:TreeListView>
                    </dxg:TreeListControl.View>
                </dxg:TreeListControl>
                <dxlc:LayoutGroup VerticalAlignment="Bottom" Orientation="Horizontal">
                    <Label Content="帖子总数:" HorizontalAlignment="Right"/>
                    <Label Content="{Binding Source.Count, UpdateSourceTrigger=PropertyChanged}"  HorizontalAlignment="Right">
                        </Label>
                </dxlc:LayoutGroup>
                <dxlc:LayoutGroup VerticalAlignment="Bottom" Orientation="Horizontal">
                    <dxe:CheckEdit IsChecked="{Binding IsAll}"  Content="Include All" HorizontalAlignment="Left"/>
                    <dx:SimpleButton Content="Copy VLPath To Clipboard" IsEnabled="{Binding CanNext}" Command="{Binding CopyVLPathCommand}" HorizontalAlignment="Left"></dx:SimpleButton>
                    <dxlc:LayoutGroup HorizontalAlignment="Right">
                        <dx:SimpleButton Content="下载(D)" Width="80" IsEnabled="{Binding CanNext}" Command="{Binding NextCommand}"></dx:SimpleButton>
                        <dx:SimpleButton Content="清除(C)" Width="80" IsEnabled="{Binding CanNext}" Command="{Binding OKCommand}"></dx:SimpleButton>
                        <dx:SimpleButton Content="合作(P)" Width="80" Command="{Binding CancelCommand}"></dx:SimpleButton>
                    </dxlc:LayoutGroup>
                </dxlc:LayoutGroup>
            </dxlc:LayoutControl>
            <dx:WaitIndicator  DeferedVisibility="{Binding IsLoading}" />
        </Grid>
    </dx:ThemedWindow>

    3.实现mvvm模式。

    这里采用了DevExpress自带的的mvvm模式,和WPF自带的去创建的框架基本一致。不了解mvvm的同学可以去园子里看看相关文章。

    (1)后台代码设置主题还有绑定视图模型。

    public partial class MainWindow
        {
            public MainWindow()
            {
                InitializeComponent();
                //设置样式
                ApplicationThemeHelper.UseLegacyDefaultTheme = true;
                ApplicationThemeHelper.ApplicationThemeName = Theme.VisualStudioCategory;
                this.WindowStyle = System.Windows.WindowStyle.SingleBorderWindow;
                this.Icon = new BitmapImage(new Uri("../../debug.png",UriKind.Relative));
                this.BorderThickness = new Thickness(0);
                this.Margin = new Thickness(0);
                this.Padding = new Thickness(0);
                this.DataContext = new MainViewModel();
            }
        }

    ( 2 ) 设计帖子的实体类。

    可以根据自己想要爬取的信息设计。

     public class ArticleModel
        {
            public string Title { get; set; }
            public string Brief { get; set; }
            public int CommentCount { get; set; }
            public string AuthorName { get; set; }
        }

    (3)页数,帖子集合,等属性在ViewModel中进行声明。

    //加载中
            private bool _loading;
            public bool IsLoading
            {
                get { return this._loading; }
                set
                {
                    SetProperty(ref _loading, value, () => IsLoading);
                }
            }
            //贴吧名
            private string _hub;
            public string HubName
            {
                get { return this._hub; }
                set
                {
                    SetProperty(ref _hub, value, () => HubName);
                }
            }
            //爬取页数
            private int _page;
            public int Page
            {
                get { return this._page; }
                set
                {
                    SetProperty(ref _page, value, () => Page);
                }
            }
            //帖子集合
            public ObservableCollection<ArticleModel> _source;
            public ObservableCollection<ArticleModel> Source
            {
                get { return _source; }
                set { SetProperty(ref _source, value, ()=>Source); }
            }

    (3)查询业务绑定到按钮的Command,下拉列表的绑定等。

    public AsyncCommand SearchCommand { get; set; }
    
    public IEnumerable<int> PageRange { get; private set; }
    public MainViewModel()
            {
                Page = 10;
                PageRange = new List<int>() { 10,50, 100, 200, 500, 1000, 10000 };
                Source = new ObservableCollection<ArticleModel>();
                SearchCommand = new AsyncCommand(Search);
            }

    4.爬虫业务的简单实现

    我们使用HttpClient进行请求获取html页面的代码

    使用AngleSharp解析html示例代码(按Ctrl+Shift+P快速安装NuGet包):Install-Package AngleSharp

    相关简单使用:

    //获取请求后response的页面代码。
                                    string pageData = await http.GetStringAsync($"https://tieba.baidu.com/f?kw={HubName}&ie=utf-8&pn={pnIndex}");
    //AngleSharp解析页面代码
                                    IHtmlDocument doc = await parser.ParseDocumentAsync(pageData);

    5.分析百度贴吧

     可以看到URL基本一致,主要是一个URL参数会跟着页数而变化就是pn(Page Number),规律就是(Page-1)*50。50大概就是每页有50个帖子

    那我们就好处理了,获取每个帖子的节点然后再去依次查找我们所需要的数据就可以了。

    爬取的核心代码如下

    await Task.Run(() =>
                    {
                        var http = new HttpClient();
                        var parser = new HtmlParser();
                        var result=Enumerable.Range(0, Page)
                            .AsParallel()
                            .AsOrdered()
                            .SelectMany(page =>
                            {
                                return Task.Run(async () =>
                                {
                                    var pnIndex = page * 50;
                                    //获取请求后response的页面代码。
                                    string pageData = await http.GetStringAsync($"https://tieba.baidu.com/f?kw={HubName}&ie=utf-8&pn={pnIndex}".Dump());
                                    //AngleSharp解析页面代码
                                    IHtmlDocument doc = await parser.ParseDocumentAsync(pageData);
                                    return doc.QuerySelectorAll(".t_con.cleafix").Select(tag => new ArticleModel()
                                    {
                                        Title = tag.QuerySelector(".j_th_tit").TextContent?.Trim(),
                                        Brief= tag.QuerySelector(".threadlist_abs.threadlist_abs_onlyline")?.TextContent?.Trim(),
                                        CommentCount=Convert.ToInt32(tag.QuerySelector(".threadlist_rep_num.center_text")?.TextContent),
                                        AuthorName=tag.QuerySelector(".frs-author-name.j_user_card")?.TextContent?.Trim(),
                                    }); ;
                                }).GetAwaiter().GetResult();
                            });
                        Source = new ObservableCollection<ArticleModel>(result);
                    });

    一个小细节就是dom元素如果class中有空格查找的时候一定要用'.'来代替,比如dom元素class是'ftt poot'那么查找的时候就应该是tag.QuerySelector(".ftt.poot")坑里了我很久!!!可能是我这方面没怎么接触过吧。。。

    好了,爬取的功能完成了,其他的边角料就自己随意发挥吧,哈哈。

    代码下载地址:https://github.com/BruceQiu1996/WPF-/tree/master

  • 相关阅读:
    DBA-常用到的动态视图分析语句
    SQL Server 复制(Replication) ——事务复制搭建
    SQL Server 不同网段IP通过名称访问
    [javaEE] HTTP协议总结
    [android] 从gallery获取图片
    [android] 加载大图片到内存
    [javaEE] web应用的目录结构&配置虚拟主机
    [android] 代码注册广播接收者&利用广播调用服务的方法
    [android] 采用aidl绑定远程服务
    [Linux] Linux的环境变量
  • 原文地址:https://www.cnblogs.com/qwqwQAQ/p/12014383.html
Copyright © 2011-2022 走看看