zoukankan html css js c++ java

python网络数据采集之beautifulsoup

beautifulsoup中常用的方法findall与find，清楚这俩个方法的关系和用法
其中还有  
.children标签

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
print(child)

兄弟标签next_siblings()

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
print(sibling)



这里通过上述的方法找到div class=pl2下的 a标签下的title

# coding=utf-8
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("https://book.douban.com/top250?start=0")
bsObj = BeautifulSoup(html)

for link in bsObj.findAll("div",attrs={"class":"pl2"}):
    name=link.find("a")
    print name.get('title')

如果改成

for link in bsObj.findAll("div",attrs={"class":"pl2"}):
    name=link.findAll("a")
    print name[0].get('title')
效果是一样的

还能通过name.text获取a标签中的文本内容
.get('href')
.val等方法获取各种属性

查看全文

相关阅读:
GDB 运行PYTHON 脚本+python 转换GDB调用栈到流程图
 GDB-Dashboard－GDB可视化界面
 使用gdb调试Python进程
 从底层理解Python的执行
 python 用pdb调试
 GDB反向调试 + 指令记录+函数历史记录
 linux 0.11 源码学习+ IO模型
 LINUX系统全部参数 sysctl -a + 网络参数设置
 Linux Kernel 排程機制介紹
 linux 系统调优2

原文地址：https://www.cnblogs.com/jinjidedale/p/6040368.html