zoukankan      html  css  js  c++  java
  • Python爬虫初识

    本文章是对网易云课堂中的Python网络爬虫实战课程进行总结。感兴趣的朋友可以观看视频课程。课程地址

    爬虫简介

    一段自动抓取互联网信息的程序
    

    非结构化数据

    没有固定的数据格式,如网页资料。
    必须通过ETL(Extract,Transformation,Loading)工具将数据转化为结构化数据才能使用。
    

    工具安装

    Anaconda

    pip install requests
    pip install BeautifulSoup4
    pip install jupyter
    

    打开jupyter

    jupyter notebook
    

    requests 网络资源截取插件

    取得页面

    import requests
    url = ''
    res = requests.get(url)
    res.encoding = 'utf-8'
    print (res.text)
    

    将网页读进BeautifulSoup中

    from bs4 import BeautifulSoup
    soup  = BeautifulSoup(res.text, 'html.parser')
    print (soup.text)
    

    使用select方法找找出特定标签的HTML元素,可取标签名或id,class返回的值是一个list

    select('h1')   select('a')
    id = 'thehead' select('#thehead')
    
    alink = soup.select('a')
    for link in alink:
        print (link['href'])
    

    例子

    • 1、取得新浪陕西的新闻时间标题和连接

      import requests
      from bs4 import BeautifulSoup
      res = requests.get('http://sx.sina.com.cn/')
      res.encoding = 'utf-8'
      soup = BeautifulSoup(res.text, 'html.parser')
      
      for newslist in soup.select('.news-list.cur'):
          for news in newslist:
              for li in news.select('li'):
                  title = li.select('h2')[0].text
                  href = li.select('a')[0]['href']
                  time = li.select('.fl')[0].text
                  print (time, title, href)
      
    • 2、获取文章的标题,来源,时间和正文

      import requests
      from bs4 import BeautifulSoup
      from datetime import datetime
      res = requests.get('http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew5095240.shtml')
      res.encoding = 'utf-8'
      soup = BeautifulSoup(res.text, 'html.parser')
      
      h1 = soup.select('h1')[0].text
      source = soup.select('.source-time span span')[0].text
      timesource = soup.select('.source-time')[0].contents[0].text
      date = datetime.strptime(timesource, '%Y-%m-%d %H:%M')
      
      article = []
      for p in soup.select('.article-body p')[:-1]:
          article.append(p.text.strip())
      
      ' '.join(article)
      

      简写为:

      ' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])
      

      说明:

      datatime 包用来格式化时间
      [:-1]去除最后一个元素
      strip() 移除字符串头尾指定的字符(默认为空格或换行符)
      ' '.join(article) 将列表以空格连接
      
    • 3、获取文章的评论数,评论数是通过js写入,不能通过上面的方法获取到,在js下,找到文章评论的js

      import requests
      import json
      
      comments = requests.get('http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-hcikcew5095240:0')
      jd = json.loads(comments.text.strip('var data ='))
      
      jd['result']['count']['sx:comos-hcikcew5095240:0']['total']
      
    • 4、将获得评论的方法总结成一个函数

      import re 
      import json
      
      commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0'
      
      def getCommentCounts(url):
          m = re.search('detail-i(.+).shtml' ,url)
          newsid = m.group(1)
          comments = requests.get(commenturl.format(newsid))
          jd = json.loads(comments.text.strip('var data ='))
          return jd['result']['count']['sx:comos-'+newsid+':0']['total']
      
      news = 'http://sx.sina.com.cn/news/b/2018-06-01/detail-ihcikcev8756673.shtml'
      getCommentCounts(news)
      
    • 5、输入地址得到文章的所有信息(标题、时间、来源、正文等)的函数(完整版)

      import requests
      import json
      import re
      from bs4 import BeautifulSoup
      from datetime import datetime
      
      commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0'
      
      def getCommentCounts(url):
          m = re.search('detail-i(.+).shtml' ,url)
          newsid = m.group(1)
          comments = requests.get(commenturl.format(newsid))
          jd = json.loads(comments.text.strip('var data ='))
          return jd['result']['count']['sx:comos-'+newsid+':0']['total']
      
      def getNewsDetail(newsurl):
          result = {}
          res = requests.get(newsurl)
          res.encoding = 'utf-8'
          soup = BeautifulSoup(res.text, 'html.parser')
          result['title'] = soup.select('h1')[0].text
          result['newssource'] = soup.select('.source-time span span')[0].text
          timesource = soup.select('.source-time')[0].contents[0].text
          result['date'] = datetime.strptime(timesource, '%Y-%m-%d %H:%M')
          result['article'] = ' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])
          result['comments'] = getCommentCounts(newsurl)
          return result
          
      news = 'http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew8995238.shtml'
      getNewsDetail(news)
      
  • 相关阅读:
    【安卓】安卓res文件夹下的资源文件与R.java文件里面类的对应关系
    超简单,安卓模拟器手动root
    C++成员初始化顺序
    C++,当类名和对象名称相同时会发生什么?
    C++ 修饰名的格式探究
    总结一下classpath
    卡鲁斯卡尔
    ST表
    P2672跳石头
    2019奥赛考前刷题计划
  • 原文地址:https://www.cnblogs.com/ghq120/p/9160214.html
Copyright © 2011-2022 走看看