使用xpath爬取酷狗TOP500的歌曲信息, 将排名、歌手名、歌曲名、歌曲时长,提取的结果以文件形式保存下来。参考网址:http://www.kugou.com/yy/rank/home/1-8888.html
大概步骤:谷歌F12打开开发者工具-->在Elements找到想爬取的信息-->右键Copy XPath(或是根据Network里Response的代码手动写出Xpath路径)-->对比这类信息的xpath,再定位并提取这类信息
注意:我们获取的Network里Response的代码,Elements的html代码是经过渲染的,仅供参考
import time import json import requests from lxml import etree def get_one_page(url): try: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'} response = requests.get(url, headers=headers) response.encoding = response.apparent_encoding if response.status_code == 200: return response.text else: return None return None except RequestException: return None def parse_one_page(text, id): html = etree.HTML(text) ranking = html.xpath('//*[@id="rankWrap"]/div[2]/ul/li/span[3]//text()') #top3的text()文本信息在strong标签下,是span的子孙节点,所以用// title = html.xpath('//*[@id="rankWrap"]/div[2]/ul/li/@title') length = html.xpath('//*[@id="rankWrap"]/div[2]/ul/li/span[4]/span/text()') if (id == 1): ranking = [i for i in ranking if i.strip() != ''] #去除第一页提取到的空白信息 for i in range(len(length)): yield { 'ranking': ranking[i].strip(), 'singer': title[i].split('-')[0].strip(), #以-分割,取第0个,并去除首尾空白 'song': title[i].split('-')[1].strip(), 'length': length[i].strip() } def write_to_file(content): with open('kugou.txt', 'a', encoding='utf-8') as f: f.write(json.dumps(content, ensure_ascii=False) + ' ') #ensure_ascii=False不为acsii码,为中文 def main(id): url = "http://www.kugou.com/yy/rank/home/" + str(id) + "-8888.html" text = get_one_page(url) for item in parse_one_page(text, id): print(item) write_to_file(item) if __name__ == '__main__': for id in range(1,24): main(id) time.sleep(1)