zoukankan      html  css  js  c++  java
  • [爬虫]抓取知乎百万用户信息之爬虫模块

                 点击我前往Github查看源代码   别忘记star

    本项目github地址:https://github.com/wangqifan/ZhiHu     

    UserManage是获取用户信息的爬虫模块

    public   class UserManage
        {
            private string html;
    
            private string url_token;
    
         }

    构造函数

     

    用户主页的uRL格式为"https://www.zhihu.com/people/"+url_token+"/following";

     public UserManage(string urltoken)
             {
                 url_token = urltoken;
             }

    先封装一个获取html页面的方法

     

     private bool GetHtml()
    
            {                
    
                string url="https://www.zhihu.com/people/"+url_token+"/following";
    
                html = HttpHelp.DownLoadString(url);
    
                return  !string.IsNullOrEmpty(html);
    
            }

    拿到了html页面,接下来是剥取页面中的JSON,借助HtmlAgilityPack

    public  void  analyse()
            {
                    if (GetHtml())
                    {
                        try
                        {
                            Stopwatch watch = new Stopwatch();
                            watch.Start();
                            HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                            doc.LoadHtml(html);
                            HtmlNode node = doc.GetElementbyId("data");
                            StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", ""));
                            stringbuilder.Replace(""", "'");           
                            stringbuilder.Replace("&lt;", "<");
                            stringbuilder.Replace("&gt;", ">");
                         
                            watch.Stop();
                           Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
                           
                        }
                        catch (Exception ex)
                        {
                            Console.WriteLine(ex.ToString());
                        }
                    }
                
                }    

    添加用户的关注列表的链接

     private void  GetUserFlowerandNext(string json)
    
            {
    
                     string foollowed = "https://www.zhihu.com/api/v4/members/" + url_token + "/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20";
    
                     string following = "https://www.zhihu.com/api/v4/members/" + url_token + "/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0";
    
                     RedisCore.PushIntoList(1, "nexturl", following);
    
                     RedisCore.PushIntoList(1, "nexturl", foollowed);
    
            }

    对json数据进一步剥取,只要用户的信息,借助JSON解析工具Newtonsoft.Json

    private void  GetUserInformation(string json)
            {  
                    JObject obj = JObject.Parse(json);
                    string xpath = "['" + url_token + "']";
                    JToken tocken = obj.SelectToken("['entities']").SelectToken("['users']").SelectToken(xpath);
                    RedisCore.PushIntoList(2, "User", tocken.ToString());
                  
            }  
     
    现在来完成下analyse函数
     public  void  analyse()
            {
                    if (GetHtml())
                    {
                        try
                        {
                            Stopwatch watch = new Stopwatch();
                            watch.Start();
                            HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                            doc.LoadHtml(html);
                            HtmlNode node = doc.GetElementbyId("data");
                            StringBuilder stringbuilder =new StringBuilder(node.GetAttributeValue("data-state", ""));
                            stringbuilder.Replace(""", "'");           
                            stringbuilder.Replace("<", "<");
                            stringbuilder.Replace(">", ">");
                            GetUserInformation(stringbuilder.ToString());
                            GetUserFlowerandNext(stringbuilder.ToString());
                            watch.Stop();
                            Console.WriteLine("分析Html用了{0}毫秒", watch.ElapsedMilliseconds.ToString());
                           
                        }
                        catch (Exception ex)
                        {
                            Console.WriteLine(ex.ToString());
                        }
                    }
                
                }    
            }
        
    

      

    UrlTask是从nexturl队列获取用户的关注列表的url,获取关注列表。服务器返回的Json的数据

    封装一个对象的序列化和反序列化的类

    public   class SerializeHelper
        {
            /// <summary>
            /// 对数据进行序列化
            /// </summary>
            /// <param name="value"></param>
            /// <returns></returns>
            public static string SerializeToString(object value)
            {
                return JsonConvert.SerializeObject(value);
            }
            /// <summary>
            /// 反序列化操作
            /// </summary>
            /// <typeparam name="T"></typeparam>
            /// <param name="str"></param>
            /// <returns></returns>
            public static T DeserializeToObject<T>(string str)
            {
              
                return JsonConvert.DeserializeObject<T>(str);
            }
    }

    封装UrlTask类

    
    
     public class UrlTask
        {
            private  string url { get; set; }
            private string JSONstring { get; set; }
            public UrlTask(string _url)
            {
                url = _url;  
            }
    }
    
    

    添加一个获取资源的方法

    
    
     private bool GetHtml()
            {
                JSONstring= HttpHelp.DownLoadString(url);
                Console.WriteLine("Json下载完成");
                return !string.IsNullOrEmpty(JSONstring);
            }
    解析json方法
    
    
     public  void  Analyse() 
            {
                try
                {
                    if (GetHtml())
                    {
                        Stopwatch watch = new Stopwatch();
                        watch.Start();
                   
                        followerResult result = SerializeHelper.DeserializeToObject<followerResult>(JSONstring);
                         if (!result.paging.is_end)
                         {
                             RedisCore.PushIntoList(1, "nexturl", result.paging.next);                
                          }                           
                        foreach (var item in result.data)
                        {
                             int type=Math.Abs(item.GetHashCode())% 3 + 3;
                             if (RedisCore.InsetIntoHash(type, "urltokenhash", item.url_token, "存在"))
                             {
                                 RedisCore.PushIntoList(1, "urltoken", item.url_token);
                               
                             }
                          
                        }
                        watch.Stop();
                        Console.WriteLine("解析json用了{0}毫秒",watch.ElapsedMilliseconds.ToString());
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.ToString());
                }
              
       }
    
    

    解析:如果result.paging.is_end为true,那么这个是用户关注列表的最后一页,那么它的nexturl应该加入队列,负责不要加入,对于后面的用户数组,因为信息不去全,不要了,有了Id前往主页获取详细信息。

    
    

     模块组合

    封装一个一个方法,从队列拿到nextutl,前往用户的关注列表,拿到更多用户ID

      private static void GetNexturl()
            {
                string nexturl = RedisCore.PopFromList(1, "nexturl");
                if (!string.IsNullOrEmpty(nexturl))
                {
                    UrlTask task = new UrlTask(nexturl);
                    task.Analyse();
                }
            }

    封装一个方法,循环从队列获取用户的urltoken(如果队列空了,执行GetNexturl),前往用户主页,获取信息

    private static void GetUser(object data)
            {
    
                while (true)
                {
                    string url_token = RedisCore.PopFromList(1, "urltoken");
                    Console.WriteLine(url_token);
                    if (!string.IsNullOrEmpty(url_token))
                    {
                        UserManage manage = new UserManage(url_token);
                        manage.analyse();
                    }
                    else
                    {
                        GetNexturl();
                    }
                }
                  
            }

    在main函数里面执行这些方法,由于任务量大,采用多线程,线程数视情况而定

     

    for (int i = 0; i < 10; i++)
    
                {
    
                    ThreadPool.QueueUserWorkItem(GetUser);
    
                }

    添加种子数据,用于刚开始时候队列都是空的,需要添加种子数据

    1. 手动添加,在redile-cl.exe敲命令
    2. 在main函数中加入

        

     UserTask task=new UserTask(“某个用户的uRLtoken”);
    
                   task.analyse();

    执行一次之后要注释掉,避免重复

  • 相关阅读:
    推荐系统多样性指标衡量
    deepfm代码参考
    tf多值离散embedding方法
    样本加权
    tensorflow 分布式搭建
    优化器
    协同过滤代码
    NLP
    双线性ffm
    各种总结
  • 原文地址:https://www.cnblogs.com/zuin/p/6261745.html
Copyright © 2011-2022 走看看