zoukankan      html  css  js  c++  java
  • 爬取https://www.parenting.com/babynames/boys/earl网站top10男女生名字及相关信息

    爬取源代码如下:

    import requests
    import bs4
    from bs4 import BeautifulSoup
    import re
    import pandas as pd
    import io
    import sys
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
    
    lilist=[]
    
    r=requests.get('https://www.parenting.com/baby-names/boys/earl')
    soup=BeautifulSoup(r.text,"lxml")
    soup= soup.find_all('a',href=True)
    for i in soup:
        if 'https://www.parenting.com/pregnancy/baby-names/baby-boy-names/' in str(i)or'https://www.parenting.com/pregnancy/baby-names/girl-baby-names/' in str(i):
            lilist.append(i.get("href"))
    lilist1=[]
    results1=[]
    results=[]
    results2=[]
    
    for i in list(set(lilist)): 
        r=requests.get(i)
        soup=BeautifulSoup(r.text,"lxml")
        
     
        Source=soup.find_all('p')
        Source=soup.find_all(attrs={'class': 'description'})
        
        results0 = re.findall('<h4>(.*?)</h4>', r.text)
        for c in results0:
            if c!='':
                lilist1.append(c)
        #print(lilist1)
        #lilist1=[]
        pattern = re.compile('<p><strong>Origin:</strong>\s(.*?)</p>', re.S)
        results += re.findall(pattern, str(Source))
           
        pattern1 = re.compile('<p><strong>Meaning:</strong>\s(.*?)</p>', re.S)
        results1 += re.findall(pattern1, str(Source))
        pattern2 = re.compile("<p><strong>Why it’s big:</strong>\s(.*?)</p>", re.S)
        results2 += re.findall(pattern2, str(Source))
        
    
        
    print(lilist1)
    print(results1)
    print(results)
    print(results2)
    data = {
        'EnName':lilist1,
        'Meaning':results1,
        'Origin':results,
        'Description':results2
    }
    frame = pd.DataFrame(data)
    frame.to_csv('wt10.csv',encoding="gb18030")
    #print(results2)
     csv文件截图:
     
     
     
     
  • 相关阅读:
    SQL经典语句和要点整理
    XMLHTTPRequest状态status完整列表
    console和windows子系统
    QT的文件查找
    QT的编译原理
    AES加密算法
    多线程基础
    0210. Course Schedule II (M)
    ip段/数字,如192.168.0.1/24的意思是什么?
    Excel如何让日期单元格随着某个单元格的修改而自动更新
  • 原文地址:https://www.cnblogs.com/c1q2s3/p/12078047.html
Copyright © 2011-2022 走看看