zoukankan      html  css  js  c++  java
  • python 爬虫知识点

    1、使用库:request、BeautifulSoup

    2、request

    esponse =requests.get(
    url='https://www.autohome.com.cn/news/'
    )

    response.encoding = response.apparent_encoding
    response.text
    response.content
    response.status_code
    
    
    3、BeautifulSoup
    转换成soup对象
    soup = BeautifulSoup(response.text,features='html.parser') #默认用html.parser,生产用lxml,性能更好
    根据id查找
    soup.find(id="chazy")
    查找li、div、img等html标签下的文本
    target = soup.find(id="auto-channel-lazyload-article").find('li') # 找到第一个li
    li_list = soup.find(id="auto-channel-lazyload-article").find_all('li') # 找到所有li


    4、简单示例
    import requests
    from bs4 import BeautifulSoup

    response =requests.get(
    url='https://www.autohome.com.cn/news/'
    )
    response.encoding = response.apparent_encoding
    print(response.status_code)
    soup = BeautifulSoup(response.text,features='html.parser') #默认用html.parser,生产用lxml,性能更好

    #正则查找
    target = soup.find(id="auto-channel-lazyload-article").find('li') # 找到第一个li
    li_list = soup.find(id="auto-channel-lazyload-article").find_all('li') # 找到所有li

    for li in li_list:
    a = li.find('a') #找a标签
    if(a):
    pass
    print(a.attrs)
    print(a.attrs.get('href'))
         
      
      img = li.find('img').get('src')
      res = requests.get(img)
      file_name = "%s.jpg" %(title,)
      with open(file_name,'wb') as f:
      f.write(res.content)


  • 相关阅读:
    网易2019校招C++研发工程师笔试编程题
    牛客网 数串
    ps aux 状态介绍
    阿里在线测评解析
    Ubuntu 18.04安装 Sublime
    file '/grub/i386-pc/normal.mod' not found.解决方案
    解决Windows10与Ubuntu系统时间不一致问题
    进程与线程的区别
    大端模式和小端模式
    2016湖南省赛----G
  • 原文地址:https://www.cnblogs.com/yoyo008/p/9284051.html
Copyright © 2011-2022 走看看