zoukankan html css js c++ java

《数据采集与网络爬虫》之数据解析

一、使用字符串查找的方式提取网页中的数据

# 使用字符串查找的方式提取网页中所有的城市名
import requests
url="http://www.yb21.cn/post/"
response=requests.get(url)
response.encoding="GBK" # 该网站使用的字符编码为GBK
html=response.text
'''
    <a href="/post/city/1301.html"><strong>石家庄市</strong></a>
    <a href="/post/city/1302.html"><strong>唐山市</strong></a>
'''
temp=html
str_begin="<strong>"
str_end="</strong>"
list_city=[]
while True:
    pos_begin=temp.find(str_begin) # <strong>位置
    if pos_begin==-1:
        break
    pos_end=temp.find(str_end) # </strong>位置
    city=temp[pos_begin+len(str_begin):pos_end] # 截取<strong>和</strong>之间的字符串
    list_city.append(city) # 加入列表
    temp=temp[pos_end+len(str_end):] # 下一次循环从</strong>后面开始找

# 清洗，删除所有的'辖区'和'辖县'
list_remove=['辖区','辖县']
for city_remove in list_remove:
    for city in list_city:
        if city==city_remove:
            list_city.remove(city)
print(list_city)
print(len(list_city)) #362

二.使用正则表达式查找的方式提取网页中的数据

例1：

# 使用正则表达式查找的方式提取网页中所有的城市名
import requests
import re # python的正则表达式库
url="http://www.yb21.cn/post/"
response=requests.get(url)
response.encoding="GBK"
html=response.text
'''
    <a href="/post/city/1301.html"><strong>石家庄市</strong></a>
    <a href="/post/city/1302.html"><strong>唐山市</strong></a>
'''
list_city=re.findall("<strong>(.+?)</strong>",html)
# 注意：括号表示要提取这一块的数据，？表示非贪婪匹配，即匹配尽可能少的。
list_remove=['辖区','辖县']
for city_remove in list_remove:
    for city in list_city:
        if city==city_remove:
            list_city.remove(city)
print(list_city)
print(len(list_city)) # 362

结论：字符串查找的方式比较繁琐，正则表达式方式相对较简单。

例2：

# 使用正则表达式查找的方式提取网页中所有的二级学院
import requests
import re # python的正则表达式库

# 1.得到html响应内容
url="https://www.whit.edu.cn/jgsz.htm"
response=requests.get(url)
response.encoding="UTF-8"
html=response.text

# 2.缩小查找范围，只从id="jx"的div里找
str_begin='id="jx"'
str_end="</ul>"
pos_begin=html.find(str_begin)
temp=html[pos_begin+len(str_begin):]
pos_end=temp.find(str_end)
temp=temp[:pos_end]

'''
    <a href="https://jxgc.whit.edu.cn/" target="_blank" onclick="_addDynClicks(&#34;wburl&#34;, 1655460640, 66257)">机械工程学院</a>
'''
# 3.正则表达式查找
list_department=re.findall(r"<a href=.*)">(.+?)</a>", temp)
# 注意：)和"表示括号和双引号本身，因为括号和双引号是正则表达式的特殊字符
print(list_department)

查看全文

相关阅读:
Sitecore Digital Marketing System, Part 1: Creating personalized, custom content for site visitors（自定义SiteCore中的 Item的Personalize的Condition） -摘自网络
 Send email alert from Performance Monitor using PowerShell script （检测windows服务器的cpu 硬盘服务等性能，发email的方法） -摘自网络
 使用Mono Cecil 动态获取运行时数据 (Atribute形式进行注入用于写Log) [此文报考 xxx is declared in another module and needs to be imported的解决方法]-摘自网络
 秒杀 ILSpy 等反编译利器 DotNet Resolver
Nagios：企业级系统监控方案
 C# Asp.net中的AOP框架 Microsoft.CCI， Mono.Cecil， Typemock Open-AOP API， PostSharp -摘自网络（可以利用反射 Attribute 进行面向切面编程可以用在记录整个方法的Log方面）
Windows性能监视器之CPU、硬盘、IO等监控方法详解-摘自网络
 网站防刷方案 -摘自网络
 利用XSD配合XSLT產出特定格式Word檔案 -摘自网络
 asp页面快速找到菜单按钮转向的页面的方法

原文地址：https://www.cnblogs.com/beast-king/p/14526698.html