zoukankan      html  css  js  c++  java
  • 爬取当当网的图书信息之结尾

    由于当当网上的图书信息很丰富,全部抓取下来工作量很大。只抓取其中的一类

    在Main()方法里面

    首先用户输入种子URL

     string starturl = Console.ReadLine();

    构建数据库上下文对象

       BookStoreEntities storeDB = new BookStoreEntities();
    

    获取图书类的URL

     string html = Tool.GetHtml(starturl);
                ArrayList list = new ArrayList();
                list = Tool.GetList(html);
                foreach (var item in list)
                {
                    BookClass bookclass = new BookClass();
                    bookclass.Url = item.ToString();
                    storeDB.BookClass.Add(bookclass);
                }
                storeDB.SaveChanges();

    使用多线程爬取图书信息

      每个图书种类都开一个线程来爬取图书信息

    封装一个process类

     public class process
        {
            BookStoreEntities storeDB = new BookStoreEntities();
    
            public BookClass BookClass;
            public process(int BookClassId)
            {
                BookClass = storeDB.BookClass.Find(BookClassId);
            }
       
        }

    接下来要在这个类实现爬取图书信息

      public void threads()
            {
    }

    实现翻页

    图书种类展示页面是有规律的

    http://category.dangdang.com/cp01.54.06.00.00.00.html
    http://category.dangdang.com/pg2-cp01.54.06.00.00.00.html
    http://category.dangdang.com/pg3-cp01.54.06.00.00.00.html

    把第一页的URL拆成两部分 前部分http://category.dangdang.com/,后部分cp01.54.06.00.00.00.html

    第二页到100页都是  前部分+"pg"+页数+“-”+后部分

    for (int i = 1; i <= BookClass.Pages; i++)
                {
                    string url = "";
                    //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html
                    //http://book.dangdang.com/01.54.htm?ref=book-01-A
                    //http://category.dangdang.com/cp01.54.06.00.00.00.html
                    //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html
                    string tempurl = BookClass.Url;
                    int p1 = tempurl.IndexOf("cp");
                    string fast = "";
                    string rear = "";
                    if (p1 > 0)
                    {
                        
                            fast = tempurl.Substring(0, p1);
                           rear = tempurl.Substring(p1, tempurl.Length - p1);
                           url = fast + "pg" + i.ToString() + "-" + rear;                    
                    }
                    if (url == "")
                    {
                        return;
                    }
                    if (i==1)
                    {
                        url = BookClass.Url;
                    }
    }

    继续在这个方法里面添加

     public void threads()
            {
    
                ArrayList L = new ArrayList();
                for (int i = 1; i <= BookClass.Pages; i++)
                {
                    string url = "";
                    //http://category.dangdang.com/pg100-cp01.54.06.00.00.00.html
                    //http://book.dangdang.com/01.54.htm?ref=book-01-A
                    //http://category.dangdang.com/cp01.54.06.00.00.00.html
                    //http://category.dangdang.com/pg2-cp01.54.13.00.00.00.html
                    string tempurl = BookClass.Url;
                    int p1 = tempurl.IndexOf("cp");
                    string fast = "";
                    string rear = "";
                    if (p1 > 0)
                    {
                        
                            fast = tempurl.Substring(0, p1);
                           rear = tempurl.Substring(p1, tempurl.Length - p1);
                           url = fast + "pg" + i.ToString() + "-" + rear;                    
                    }
                    if (url == "")
                    {
                        return;
                    }
                    if (i==1)
                    {
                        url = BookClass.Url;
                    }
                    string internet = Tool.GetHtml(url);
                    L = Tool.GetProduct(internet);
                    foreach (var item in L)
                    {
                        Console.WriteLine(item.ToString());
                        string html = Tool.GetHtml(item.ToString());
                        Dictionary<int, string> dict;
                        dict = Tool.analysis(html);
                        Book book = new Book
                        {
                            AuthorName = dict[3],
                            BookName = dict[1],
                            Price = Convert.ToDecimal(dict[2]),
                            Publisher = dict[4],
                            PictureUrl = dict[5],
                            BookContent = dict[6]
                        };
                        BookClass.Books.Add(book);
                        storeDB.SaveChanges();
    
                    }
    
    
                }
            }

    回到Main函数

    var items = storeDB.BookClass;
    
                foreach (var bookclass in items )
                {
                    process p=new process(bookclass.BookClassId);
                    Thread th = new Thread(p.threads);
                    th.IsBackground = true;
                    th.Start();
                    Thread.Sleep(1000);
                }
                storeDB.SaveChanges();
                Console.ReadLine();
  • 相关阅读:
    Java List集合
    Java 集合删除重复元素、删除指定元素
    进程和线程
    Java 线程的同步与死锁
    Java 线程的常用操作方法
    Java 多线程实现
    统计分析
    递归方法(回文)
    素数的输出
    动手动脑二
  • 原文地址:https://www.cnblogs.com/zuin/p/6106468.html
Copyright © 2011-2022 走看看