zoukankan      html  css  js  c++  java
  • python 爬虫知识点

    1、使用库:request、BeautifulSoup

    2、request

    esponse =requests.get(
    url='https://www.autohome.com.cn/news/'
    )

    response.encoding = response.apparent_encoding
    response.text
    response.content
    response.status_code
    
    
    3、BeautifulSoup
    转换成soup对象
    soup = BeautifulSoup(response.text,features='html.parser') #默认用html.parser,生产用lxml,性能更好
    根据id查找
    soup.find(id="chazy")
    查找li、div、img等html标签下的文本
    target = soup.find(id="auto-channel-lazyload-article").find('li') # 找到第一个li
    li_list = soup.find(id="auto-channel-lazyload-article").find_all('li') # 找到所有li


    4、简单示例
    import requests
    from bs4 import BeautifulSoup

    response =requests.get(
    url='https://www.autohome.com.cn/news/'
    )
    response.encoding = response.apparent_encoding
    print(response.status_code)
    soup = BeautifulSoup(response.text,features='html.parser') #默认用html.parser,生产用lxml,性能更好

    #正则查找
    target = soup.find(id="auto-channel-lazyload-article").find('li') # 找到第一个li
    li_list = soup.find(id="auto-channel-lazyload-article").find_all('li') # 找到所有li

    for li in li_list:
    a = li.find('a') #找a标签
    if(a):
    pass
    print(a.attrs)
    print(a.attrs.get('href'))
         
      
      img = li.find('img').get('src')
      res = requests.get(img)
      file_name = "%s.jpg" %(title,)
      with open(file_name,'wb') as f:
      f.write(res.content)


  • 相关阅读:
    团队冲刺个人总结第二天
    Gym
    Codeforces Round #162 (Div. 2) A~D 题解
    Wormholes 虫洞 BZOJ 1715 spfa判断负环
    修剪草坪 单调队列优化dp BZOJ2442
    没有上司的舞会 树形dp
    餐巾计划问题 费用流
    最小路径覆盖问题 最大流
    [JSOI2007]麻将 模拟 BZOJ1028
    CF702F T-Shirts FHQ Treap
  • 原文地址:https://www.cnblogs.com/yoyo008/p/9284051.html
Copyright © 2011-2022 走看看