Python Web-第四周-Programs that Surf the Web（Using Python to Access Web Data）

zoukankan html css js c++ java

Python Web-第四周-Programs that Surf the Web（Using Python to Access Web Data）
1.Understanding HTML

1.最简单的爬虫
import urllib fhand=urllib.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print line.strip()
2.Python 爬网页和直接访问网页

3.Scrape

2.Parsing HTML with BeautifulSoup

1.这次直接使用简单方法 BeautifulSoup

2.BeautifulSoup的安装

1.下载 http://www.crummy.com/software/BeautifulSoup/#Download

2.将下载后的文件解压，并拷贝到C：Python27目录下

3.CMD cd到该目录下运行 python setuyp.py install

3.初试BeautifulSoup(同样也是初试Python库)
import urllib from bs4 importBeautifulSoup url =raw_input('Enter - ') html = urllib.urlopen(url).read() soup=BeautifulSoup(html,"html.parser") tags = soup('a') for tag in tags: print tag.get('href',None)
注意点：

1.BeautifulSoup在地址后面要加参数

2.BS的引用方式

更多有关BS的教程：http://cuiqingcai.com/1319.html

4.raw_input() 与 input()

raw_input() 直接读取控制台的输入（任何类型的输入它都可以接收）。

而对于 input() ，它希望能够读取一个合法的 python 表达式，

即你输入字符串的时候必须使用引号将它括起来，否则它会引发一个 SyntaxError 。

一般若无特殊需求，多用raw_input()

input() 可接受合法的 python 表达式，input( 1 + 3 ) 会返回 int 型的 4

5.BS的高级用法（课后作业1）

http://python-data.dr-chuck.net/comments_222777.html

对上面网址中的comments求和
import urllib from bs4 importBeautifulSoup url = raw_input('Enter - ') html = urllib.urlopen(url).read() soup =BeautifulSoup(html,"html.parser") sc=soup.select('span[class="comments"]')#查找class为comments的span Sum=0 Count=0 for span in sc: # print 'span' ,span # print 'Attr:' ,span.attrs # print 'Contents:',span.contents[0] Sum+=int(span.contents[0])#提取span中的内容 Count+=1 print'Count:',Count print'Sum:',Sum
PS:

由于从Python 3 换成了 2 出现了 "Non-ASCII character" 问题

在源代码第一行添加：
#coding:utf-8
或是添加：
#-*- coding: UTF-8 -*-
来自为知笔记(Wiz)
查看全文

相关阅读:
怎么过滤JSON数组中反斜杠“”,反序列化
 ibatis教学实例
 jQuery给input CheckBox的值查询的一致就选中
 jQuery给CheckBox全选与不全选
 ThinkPHP5.1完全开发手册.CHM离线版下载
 4.2 执行环境及作用域【JavaScript高级程序设计第三版】
21.1 XMLHttpRequest 对象【JavaScript高级程序设计第三版】
13.4.3 鼠标与滚轮事件【JavaScript高级程序设计第三版】
13.6 模拟事件【JavaScript高级程序设计第三版】
14.5 富文本编辑【JavaScript高级程序设计第三版】

原文地址：https://www.cnblogs.com/moonache/p/5112088.html

Python Web-第四周-Programs that Surf the Web（Using Python to Access Web Data）

1.Understanding HTML

1.最简单的爬虫

2.Python 爬网页和直接访问网页

3.Scrape

2.Parsing HTML with BeautifulSoup

1.这次直接使用简单方法 BeautifulSoup

2.BeautifulSoup的安装

3.初试BeautifulSoup(同样也是初试Python库)

4.raw_input() 与 input()

5.BS的高级用法（课后作业1）

PS: