zoukankan      html  css  js  c++  java
  • 爬虫(一)爬取鱼c淘贴信息

    掏出了以前的小练习;

    现在开始,每天复习下以前的爬虫练习,争取发现新的问题和可以优化的地方。

    # -*- coding:utf-8 -*-
    import requests
    import chardet
    import csv
    from lxml import etree
    import re
    
    def get_page(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
        header = {'User-Agent':user_agent}
        r = requests.get(url,headers=header)
        r.encoding = chardet.detect(r.content)['encoding']
        page = r.text
        return page
    
    def parse_data(page):
        result = []
        html = etree.HTML(page)
        next_url = html.xpath('//a[@class="nxt"]/@href')
        if len(next_url) > 0:
            next_url = next_url[0]
            print(next_url)
        sites = html.xpath('//*[@class="xld xlda cl"]')
        
        for site in sites:
            title = site.xpath('.//a[@class="xi2"]/text()')[0]
            author = site.xpath('.//p[@class="xg1"]/a/text()')[0]
            theme = site.xpath('.//strong[@class="xi2"]/text()')[0]
            r = site.xpath('./dl/dd[2]/p[2]/text()')[0]
            sub_num,com_num =r.split(',')
            com_num = com_num.strip()
            sub_num = sub_num.strip()
            content = (title,author,theme,sub_num,com_num)
            result.append(content)
        return result,next_url
    
    def main():
        url = 'http://bbs.fishc.org/forum.php?mod=collection'
        results = []
        page = get_page(url)
        result,next_url = parse_data(page)
        results.extend(result)
        q = True
        while q:
            if next_url:
                page = get_page(next_url)
                result,next_url = parse_data(page)
                results.extend(result)
            else:
                q = False
        headers = ['title','author','theme','sub_num','com_num']
        with open(r'taotie.csv','w',encoding = 'utf-8') as f:
            f_csv = csv.writer(f)
            f_csv.writerow(headers)
            try:
                f_csv.writerows(results)
            except UnicodeDecodeError as e:
                print(e)
                
                
    if __name__ =="__main__":
        main()
  • 相关阅读:
    CF1051F The Shortest Statement 题解
    CF819B Mister B and PR Shifts 题解
    HDU3686 Traffic Real Time Query System 题解
    HDU 5969 最大的位或 题解
    P3295 萌萌哒 题解
    BZOJ1854 连续攻击游戏 题解
    使用Python编写的对拍程序
    CF796C Bank Hacking 题解
    BZOJ2200 道路与航线 题解
    USACO07NOV Cow Relays G 题解
  • 原文地址:https://www.cnblogs.com/Alexisbusyblog/p/9343875.html
Copyright © 2011-2022 走看看