zoukankan      html  css  js  c++  java
  • 歌词爬虫

    因为要做对话聊天系统,需要大量的语料,所以决定用歌词作为训练数据试试,自己写了个爬虫,爬了大概23w首歌曲的歌词;

    用此歌词用作问答对,然后用LSTM-QA模型做问答匹配,经过多次实验,达到一个不错的效果,基本上可以跟你正常聊天;

    import re
    import urllib
    import urlparse
    from BeautifulSoup import BeautifulSoup
    
    
    url = u'http://www.lrcgc.com/'
    def find_singers():
        singers_list = [] 
        response = urllib.urlopen('http://www.lrcgc.com/artist-00.html')
        data = response.read()
        soup = BeautifulSoup(data)    
        links = soup.findAll('a', href = re.compile(r'songlist.*.html'))
        for link in links:
            s = link.text
            l = link['href']
            singers_list.append([s, l])
        return singers_list
    
    def find_songs(singer):
        singer_name, urls_0 = singer[0], singer[1]
        songs_href = [] 
        songs_list = [urls_0]
        song_list_old = [] 
    
        while len(songs_list) >0: 
            url_i = songs_list.pop() 
            song_list_old.append(url_i)
            response = urllib.urlopen(url+url_i)
            data = response.read()
            soup = BeautifulSoup(data)
            songs_list_links = soup.findAll('a', href = re.compile(r'songlist.*.html'))
            for link in songs_list_links:
                if link['href'] not in song_list_old:
                    if link['href'] not in songs_list:
                        songs_list.append(link['href'])
    
            songs_href_list = soup.findAll('a', href = re.compile(r'lyric-.*.html'))
            for link in songs_href_list:
                songs_href.append(link['href'])
    
        return list(set(songs_href))
    singers_list = find_singers() dic
    = {} for singer in singers_list: try: ss = find_songs(singer) print singer[0].encode('utf-8') + '\t' + str(len(ss)) dic[singer[0]] = ss except: continue def parse_song_href(singer, song_url): complete_url = url + song_url response = urllib.urlopen(complete_url) data = response.read() soup = BeautifulSoup(data) name = soup.findAll('a', id = 'J_downlrc')[0]['href'] download_url = url + name try: content = urllib.urlopen(download_url.encode('utf-8')).read() with open('./' + name.encode('utf-8').split('/')[1], 'w') as f: f.write(content) return download_url except: return False for singer_name in dic.keys(): for song_url in dic[singer_name]: print parse_song_href(singer_name, song_url)
  • 相关阅读:
    L1->排列组合和古典概型
    HIT2019秋计算机网络->传输层一些总结
    HIT2019秋计算机网络->TCP连接3次握手
    HIT2019春软件构造->大文件读写方法NIO
    HIT2019春软件构造->重写hashCode()方法
    HIT2019春软件构造->正则表达式语法
    HIT2019春软件构造->Git&Github学习笔记
    YII2视图间共享数据
    Yii2 选择布局
    Yii2视图 使用$this->context获取当前的Module、Controller(控制器)、Action等
  • 原文地址:https://www.cnblogs.com/LarryGates/p/6559737.html
Copyright © 2011-2022 走看看