zoukankan      html  css  js  c++  java
  • 爬虫(一)爬取鱼c淘贴信息

    掏出了以前的小练习;

    现在开始,每天复习下以前的爬虫练习,争取发现新的问题和可以优化的地方。

    # -*- coding:utf-8 -*-
    import requests
    import chardet
    import csv
    from lxml import etree
    import re
    
    def get_page(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
        header = {'User-Agent':user_agent}
        r = requests.get(url,headers=header)
        r.encoding = chardet.detect(r.content)['encoding']
        page = r.text
        return page
    
    def parse_data(page):
        result = []
        html = etree.HTML(page)
        next_url = html.xpath('//a[@class="nxt"]/@href')
        if len(next_url) > 0:
            next_url = next_url[0]
            print(next_url)
        sites = html.xpath('//*[@class="xld xlda cl"]')
        
        for site in sites:
            title = site.xpath('.//a[@class="xi2"]/text()')[0]
            author = site.xpath('.//p[@class="xg1"]/a/text()')[0]
            theme = site.xpath('.//strong[@class="xi2"]/text()')[0]
            r = site.xpath('./dl/dd[2]/p[2]/text()')[0]
            sub_num,com_num =r.split(',')
            com_num = com_num.strip()
            sub_num = sub_num.strip()
            content = (title,author,theme,sub_num,com_num)
            result.append(content)
        return result,next_url
    
    def main():
        url = 'http://bbs.fishc.org/forum.php?mod=collection'
        results = []
        page = get_page(url)
        result,next_url = parse_data(page)
        results.extend(result)
        q = True
        while q:
            if next_url:
                page = get_page(next_url)
                result,next_url = parse_data(page)
                results.extend(result)
            else:
                q = False
        headers = ['title','author','theme','sub_num','com_num']
        with open(r'taotie.csv','w',encoding = 'utf-8') as f:
            f_csv = csv.writer(f)
            f_csv.writerow(headers)
            try:
                f_csv.writerows(results)
            except UnicodeDecodeError as e:
                print(e)
                
                
    if __name__ =="__main__":
        main()
  • 相关阅读:
    关于php操作windows计划任务管理
    学习: 导航器添加修饰符
    写给想学 Javascript 朋友的一点经验之谈
    Firebug Tutorial – Logging, Profiling and CommandLine (Part I)
    getElementsByClass(2)
    关于JavaScript的事件
    Javascript修改对象方法
    采用哪种方式(JS高级程序设计)
    getElementsByClass(1)
    让CSS更简洁、高效些,别再想当然了
  • 原文地址:https://www.cnblogs.com/Alexisbusyblog/p/9343875.html
Copyright © 2011-2022 走看看