zoukankan html css js c++ java

python3爬取网页中的邮箱地址

1、爬虫分析

分析结果对：
http://xxx.com?method=getrequest&gesnum=00000001
http://xxx.com?method=getrequest&gesnum=00000002
http://xxx.com?method=getrequest&gesnum=00000003
返回的数据进行爬取

由于返回的python3 JSON数据中存在单个转义字符“”的处理没有处理好
req =requests.get(url=url,headers=headers,verify=False,timeout=60).json()

于是通过返回的是 bytes 型的二进制数据进行处理。
req =requests.get(url=url,headers=headers,verify=False,allow_redirects=False,timeout=60)
data= json.dumps(bytes.decode(req.content,'UTF-8'))

2、python3爬虫编写

#!/usr/bin/python3
#-*- coding:utf-8 -*-

#编写环境  windows 7 x64  Notepad++ + Python3.5.0

import urllib3
urllib3.disable_warnings()
import sys
import requests
import re
import json

cookie = '''JSESSIONID=1B7407076DE01727BC48DCD56FF9BA70; entsoft=entsoft; JSESSIONID=4877B5AC1DF6307E90CF1641D3863A6C; radId=45991FBF-0BC4-3BA4-08E2-00072022FB2C'''

headers ={
    'Accept': 'application/json, text/plain, */*',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cookie': cookie,
}

#输出00000001-00000300存放在num.txt中 
def getNum():
    filename='C:\Users\Administrator\Desktop\脚本\num.txt'
    file = open(filename,'w')   
    for i in range(1,300):
        file.write(("%08d" % i)+'
')
    file.close()
    
    
def main():
    #url ='http://xxx.com?method=getrequest&gesnum=00000001'
    
    getNum()
    
    filename='C:\Users\Administrator\Desktop\脚本\num.txt'
    with open(filename,'r') as file:
        for line in file:
            url ='http://xxx.com?method=getrequest&gesnum={line}'.format(line=line)
            #print(url)
            
            #req =requests.get(url=url,headers=headers,verify=False,timeout=60).json()
            #遇到问题： python3  JSON数据中存在单个转义字符“”的处理没解决 于是使用下面的方式
            req =requests.get(url=url,headers=headers,verify=False,allow_redirects=False,timeout=60)
            
            #使用json.dumps的方法，可以将json对象转化为字符串
            #print(req.content)
            #response.text 返回的是一个 unicode 型的文本数据 
            #response.content 返回的是 bytes 型的二进制数据 
            #由于返回unicode 型的文本数据报错，使用返回bytes 型的二进制数据 
            data= json.dumps(bytes.decode(req.content,'UTF-8'))
            #print(data)
            
            #正则匹配邮箱地址
            emailRegex = r"[-_w.]{0,64}@([-w]{1,63}.)*[-w]{1,63}"
            email = re.search(emailRegex,data)
            
            print(email)
       
if __name__ == '__main__':
    main()

3、输出邮件格式如下：

<_sre.SRE_Match object; span=(158, 184), match='xxxx@hotmail.com'>
<_sre.SRE_Match object; span=(145, 170), match='xxxx@nordictelecom.net'>

4、对返回邮件格式进行处理如下：

#!/usr/bin/python3
#-*- coding:utf-8 -*-

#编写环境  windows 7 x64  Notepad++ + Python3.5.0
def main():
    
    filename = "C:\Users\Administrator\Desktop\脚本\email_handle.txt"
    filename1 = "C:\Users\Administrator\Desktop\脚本\email_handle_handle.txt"
    file1 = open(filename1,'w')
      
    with open(filename,'r') as file:
        for line in file:
            data=line[48:]
            print(data)
            file1.write(data)
        
    file.close()
    file1.close()     
   

if __name__ == '__main__':
    main()

5、处理后邮件格式如下，在txt文本中查找替换'>为空即可：

xxxx@hotmail.com'>
xxxx@nordictelecom.net'>

6、参考

python爬虫使用Cookie的两种方法
https://blog.csdn.net/weixin_38706928/article/details/80376572
Python3 关于UnicodeDecodeError/UnicodeEncodeError: ‘gbk’ codec can’t decode/encode bytes类似的文本编码问题
https://www.cnblogs.com/worstprogrammer/p/5189758.html
Python模拟登陆(使用requests库)
https://blog.csdn.net/majianfei1023/article/details/49927969
Python的urllib3软件包的证书认证及警告的禁用
https://blog.csdn.net/taiyangdao/article/details/72825735
JSON在线解析及格式化验证
https://www.json.cn/

查看全文

相关阅读:
[hive]case 语句中字符串匹配
 shell-删除指定时间前的文件
 tensorflow expand_dims和squeeze
nexus建立maven仓库私服及Snapshots、release的版本管理
 FileChannel指南
 java8关于时间的新特性
 java程序加到系统托盘的方法
 java程序避免重复启动的方法
 httpClient 进行get请求
 springboot 多线程的使用

原文地址：https://www.cnblogs.com/wmiot/p/11409738.html