1、Bike Sharing Demand
kaggle: https://www.kaggle.com/c/bike-sharing-demand
目的:根据日期、时间、天气、温度等特征,预测自行车的租借量
处理:1、将日期(含年月日时分秒)提取出年,月, 星期几,以及小时
2、season, weather都是类别标记的,利用哑变量编码
算法模型选取:
回归问题:1、RandomForestRegressor
2、GradientBoostingRegressor
# -*- coding: utf-8 -*- import csv import pandas as pd import numpy as np import matplotlib.pyplot as plt train = pd.read_csv('data/train.csv') test = pd.read_csv('data/test.csv') # 选取特征值 selected_features = ['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed'] #X_train = train[selected_features] Y_train = train["count"] result = test["datetime"] # 特征值处理 month = pd.DatetimeIndex(train.datetime).month day = pd.DatetimeIndex(train.datetime).dayofweek hour = pd.DatetimeIndex(train.datetime).hour season = pd.get_dummies(train.season) weather = pd.get_dummies(train.weather) X_train = pd.concat([season, weather], axis=1) X_test = pd.concat([pd.get_dummies(test.season), pd.get_dummies(test.weather)], axis=1) X_train['month'] = month X_test['month'] = pd.DatetimeIndex(test.datetime).month X_train['day'] = day X_test['day'] = pd.DatetimeIndex(test.datetime).dayofweek X_train['hour'] = hour X_test['hour'] = pd.DatetimeIndex(test.datetime).hour X_train['holiday'] = train['holiday'] X_test['holiday'] = test['holiday'] X_train['workingday'] = train['workingday'] X_test['workingday'] = test['workingday'] X_train['temp'] = train['temp'] X_test['temp'] = test['temp'] X_train['humidity'] = train['humidity'] X_test['humidity'] = test['humidity'] X_train['windspeed'] = train['windspeed'] X_test['windspeed'] = test['windspeed'] from sklearn.ensemble import * clf = GradientBoostingRegressor(n_estimators=200, max_depth=3) clf.fit(X_train, Y_train) result = clf.predict(X_test) result = np.expm1(result) df=pd.DataFrame({'datetime':test['datetime'], 'count':result}) df.to_csv('results1.csv', index = False, columns=['datetime','count']) from sklearn.ensemble import RandomForestRegressor gbr = RandomForestRegressor() gbr.fit(X_train, Y_train) y_predict = gbr.predict(X_test).astype(int) df = pd.DataFrame({'datetime': test.datetime, 'count': y_predict}) df.to_csv('result2.csv', index=False, columns=['datetime', 'count']) #predictions_file = open("RandomForestRegssor.csv", "wb") #open_file_object = csv.writer(predictions_file) #open_file_object.writerow(["datetime", "count"]) #open_file_object.writerows(zip(res_time, y_predict))
2、Daily News for Stock Market Prediction
通过历史数据:包含每日点击率最高的25条新闻,与当日股市涨跌,来预测未来股市涨跌
方法一:
1、将25条新闻合并成一篇新闻,然后对每个单词做预处理(去掉特殊字符,含数字的单词,删除停词,变成小写,取词干),然后用TF-IDF提取特征,用SVM训练
2、用word2vec提取特征
具体实现:
https://github.com/yjfiejd/News_predict
3、