zoukankan      html  css  js  c++  java
  • 网络爬虫之爬取百度新闻链接

    1.安装beauitfulsoup4  cmd-> pip install beautifulsoup4
    python提供了一个支持处理网络链接的内置模块urllib,beatuifulsoup是用来解析html

     验证安装是否成功

    2. pycharm配置

     3.代码如下

    import urllib.request
    from bs4 import BeautifulSoup
    class Scraper:
    def __init__(self,site):
    self.site=site

    def scrape(self):
    r=urllib.request.urlopen(self.site)
    html=r.read()
    parser="html.parser"
    sp=BeautifulSoup(html,parser)
    for tag in sp.find_all("a"):
    url=tag.get("href")
    if url is None:
    continue
    if "html" in url:
    print(" "+url)

    news="http://news.baidu.com/"
    Scraper(news).scrape()


    4.运行结果就是获取了百度新闻的链接

    5. 如何把获取的链接保存到文件里呢?

    import urllib.request
    from bs4 import BeautifulSoup


    class Scraper:
    def __init__(self, site):
    self.site = site

    def scrape(self):
    response = urllib.request.urlopen(self.site)
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    with open("output.txt", "w") as f:
    for tag in soup.find_all('a'):
    url = tag.get('href')
    if url and 'html' in url:
    print(" " + url)
    f.write(url + " ")
    Scraper('http://news.baidu.com/').scrape()




  • 相关阅读:
    「Baltic2015」Network
    noip模拟赛 蒜头君的排序
    noip模拟赛 蒜头君的兔子
    codevs2171 棋盘覆盖
    noip模拟赛 蒜头君的坐骑
    noip模拟赛 蒜头君的树
    noip模拟赛 蒜头君打地鼠
    noip模拟赛 密码
    noip模拟赛 轰炸
    noip模拟赛 毁灭
  • 原文地址:https://www.cnblogs.com/JacquelineQA/p/12977380.html
Copyright © 2011-2022 走看看