zoukankan      html  css  js  c++  java
  • 哗啦啦python金融量化之路

    金融量化的第一步:数据统计和分析。

    我选择的教材是:利用python进行数据分析 O‘reilly出版


    实用案例

    1. 处理来自bit.ly的1.usa.gov数据。

      1) 数据: http://www.usa.gov/About/developer-resources/1usagov.shtml

        该数据为常见的json格式

      2)将json转换成字典

        注意事项:我是将该数据以TXT格式保存到本地进行处理的。需要去掉分隔符,同时因为内部有BOM字符,需要去除这些字符。再将这些字典读到列表中。

    import os
    import json,pickle

    from collections import defaultdict
    from collections import Counter

    records = [] for line in open("haha6.txt", encoding = "utf8"): line = line.strip(" ") if line.startswith(u'ufeff'): line = line.encode('utf8')[3:].decode('utf8') #去掉Bom字符 line = json.loads(line, encoding = "utf-8") records.append(line)

    print(records[0])

    #output:
    第一行数据如下:#{'u': 'http://today.lbl.gov/2016/06/24/saudi-minister-of-energy-visits-lab-on-june-20/#main',
    #'_id': '27e6808c-3750-e5ac-002a-cfb577e72a48', 'r': 'direct', 'sl': '2963Ceb', 'h': '2963Ceb',
    #'k': '', 'a': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML', 'c': 'FR',
    #'hc': 1466804416, 'nk': 0, 'll': [48.8582, 2.3387], 'g': '2963Fqo',
    #'t': 1467187377, 'hh': '1.usa.gov', 'l': 'anonymous', 'i': '', 'tz': 'Europe/Paris'}

      3) 查找所有的时区,并对其计数

    time_zones = [rec["tz"] for rec in records if "tz" in rec]
    
    ##时区统计,列表里的字典元素的key的统计
    #方法1
    def get_counts(sequence):
        counts = {}
        for x in sequence:
            if x in counts:
                counts[x] += 1
            else:
                counts[x] = 1
        return counts
    counts = get_counts(time_zones)
    print(counts["America/New_York"])
    
    #方法2
    def get_counts1(sequence):
        counts = defaultdict(int)
        for x in sequence:
            counts[x] += 1
        return counts
    counts = get_counts1(time_zones)
    print(counts["America/New_York"])

    #output: 353

      4) 取出前十的时区及其计数值

    #方法1
    def top_counts(count_dict, n = 10):
        value_key_pairs = [(count,tz) for tz, count in count_dict.items()]
        value_key_pairs.sort()
        return value_key_pairs[-n:]
    print(top_counts(counts))
    #方法2
    counts = Counter(time_zones)
    counts.most_common(10)
    print(counts.most_common(10))

      5) 用pandas简化,对时区进行计数,并给出前十的柱状图

    #用pandas对时区进行计数
    from pandas import DataFrame
    import pandas as pd
    import numpy as np
    frame = DataFrame(records)
    #print(frame)
    #tz_counts = frame["tz"].value_counts()
    #print(tz_counts[:10])
    clean_tz = frame["tz"].fillna("missing") #  缺失值处理
    clean_tz[clean_tz == ""] = "unknown" # 空字符串处理
    tz_counts = clean_tz.value_counts()
    print(tz_counts[:10].plot(kind = "barh", rot=0))

    #output是柱状图

    2. 处理movielens的数据集

      1) 数据: http://www.grouplens.org/node/73

        数据分三个文件:

        - 用户文件,格式是: 1::F::1::10::48067

        - 评分文件,格式是:1::1193::5::978300760

        - 电影文件,格式是: 1::Toy Story (1995)::Animation|Children's|Comedy

      2) 统计成表格

       

    import pandas as pd
    
    usernames = ["user_id", "gender", "age", "occupation", "zip"]
    users = pd.read_table("ml-1m/users.dat", sep = "::", header = None, names = usernames, engine = "python")
    
    rnames = ["user_id", "movie_id", "rating", "timestamp"]
    ratings = pd.read_table("ml-1m/ratings.dat", sep = "::", header = None, names = rnames, engine = "python")
    
    movienames = ["movie_id", "title", "genres"]
    movies = pd.read_table("ml-1m/movies.dat", sep = "::", header = None, names = movienames, engine = "python")

      3) 三张表格合并

     

    data = pd.merge(pd.merge(ratings, users), movies)
    #print(data)
    print(data.ix[0])

      4) 按性别计算每部电影的平均分

    mean_rating = data.pivot_table("rating", rows = "title", cols = "gender", aggfunc = "mean")
    print(mean_rating[:5])
  • 相关阅读:
    覆盖方法和重载方法 C++快速入门19
    访问控制 C++快速入门18
    继承机制中的构造器和析构器 C++快速入门17
    PEInfo编程思路讲解03 工具篇03|解密系列
    静态属性和静态方法 C++快速入门21
    PEInfo编程思路讲解03 工具篇03|解密系列
    继承机制中的构造器和析构器 C++快速入门17
    覆盖方法和重载方法 C++快速入门19
    linux系统中chmod命令
    linux系统中文件、目录的权限
  • 原文地址:https://www.cnblogs.com/hualala/p/5661251.html
Copyright © 2011-2022 走看看