zoukankan      html  css  js  c++  java
  • python爬取网页内容demo

     1 #html文本提取
     2 from bs4 import BeautifulSoup
     3 html_sample = '
     4 <html> 
     5 <body> 
     6 <h1 id = "title">Hello world</h1>
     7 <a href = "#www.baidu.com" class = "link"> This is link1</a>
     8 <a href = "#link2" class = "link"> This is link2</a> 
     9 </body> 
    10 </html>'
    11 soup = BeautifulSoup(html_sample,'html.parser')
    12 print(soup.text)
    13 soup.select('h1')
    14 print(soup.select('h1')[0].text)
    15 print(soup.select('a')[0].text)
    16 print(soup.select('a')[1].text)
    17 
    18 for alink in soup.select('a'):
    19     print(alink.text)
    20 
    21 print(soup.select('#title')[0].text)
    22 print(soup.select('.link')[0].text)
    23 
    24 alinks = soup.select('a')
    25 for link in alinks:
    26     print(link['href'])

    demo2:

     1 import requests
     2 from bs4 import BeautifulSoup
     3 res = requests.get('http://news.qq.com/')
     4 soup = BeautifulSoup(res.text,'html.parser')
     5 newsary = []
     6 for news in soup.select('.Q-tpWrap .text'):
     7     newsary.append({'title':news.select('a')[0].text, 'url':news.select('a')[0]['href']})
     8 
     9 import pandas 
    10 newsdf = pandas.DataFrame(newsary)
    11 newsdf.to_excel('news.xlsx')

     推荐使用:Jupyter Notebook 做练习,很方便。

    怕什么真理无穷,进一寸有一寸的欢喜。---胡适
  • 相关阅读:
    OCP-1Z0-051-V9.02-162题
    OCP-1Z0-051-V9.02-161题
    OCP-1Z0-051-V9.02-160题
    Matlab中矩阵的分解
    OCP-1Z0-051-V9.02-158题
    OCP-1Z0-051-V9.02-157题
    Matlab中特殊的矩阵函数
    求Matlab中矩阵的秩和迹
    Matlab中的条件数
    在android里使用boost c++
  • 原文地址:https://www.cnblogs.com/hujianglang/p/9650329.html
Copyright © 2011-2022 走看看