Finally succeeded in optimizing the codes of lightfm model!
But the computational cost is very high, so I wil use only 1000/227427 of all the checkins.
And the results turned out to be good!
The original lightfm model running on my laptop:
unique user&venue checkin combination in test 195 unique user&venue checkin combination in test 778 max num in matrix 2 max num in train 4 I am beginning to model model has been fitted this is the model that consider the checkin times Time used: 0.3982211436102695 Train_auc is 0.932690 Test_aus is 0.159056 Collabrative Filtering testAUC is: 0.500707 Hybrid train auc is 0.958416 Hybrig test auc is 0.512641 logistic train auc is 0.822063 logistic test auc is 0.138891
we can see that due to the loss of data the train AUC is extremely low, and using hybrid model greaty improves it.
Now let's see the results of the model that considers the domain specific biases:
this is test for new lightfm, 1000 checkins unique user&venue checkin combination in test 195 unique user&venue checkin combination in test 778 max num in matrix 4 max num in train 3 I am beginning to get negtive examples object preprocess created calculate neighbor for item 0 calculate neighbor for item 1 calculate neighbor for item 2 calculate neighbor for item 3 calculate neighbor for item 4 ...... calculate neighbor for item 914 calculate neighbor for item 915 calculate neighbor for item 916 get neighbor time used: 31.218598 0 1 2 3 ..... 774 775 776 777 Time used for negative examples: 31.323323000000002 I am beginning to model,this is the new model model has been fitted this is the model that consider the checkin times Time used: 0.04152100000000303 Train_auc is 0.589729 Test_aus is 0.329315
Although the train AUC drops, the test AUC increases a lot (almost double). That is a really good result. although it does not out reach the result of the hybrid model.
It still shows that the new model still conpensate the information loss to some exetent
This is the 50000 checkins running on my laotop.
unique user&venue checkin combination in test 5010 unique user&venue checkin combination in test 20036 max num in matrix 35 max num in train 48 I am beginning to model model has been fitted this is the model that consider the checkin times Time used: 5.658149130902446 Train_auc is 0.999952 Test_aus is 0.465492 Collabrative Filtering testAUC is: 0.554559 Hybrid train auc is 0.596089 Hybrig test auc is 0.529985 logistic train auc is 0.774696 logistic test auc is 0.42213
The new lightfm model is still running on the cluster.....waiting for the results
Ok,here is the results:
this is test for new lightfm, 50000 checkins unique user&venue checkin combination in test 5010 unique user&venue checkin combination in test 20036 max num in matrix 48 max num in train 47 I am beginning to get negtive examples object preprocess created get neighbor time used: 9331.736611 Time used for negative examples: 9375.006032 I am beginning to model,this is the new model model has been fitted this is the model that consider the checkin times Time used: 0.9198419999993348 Train_auc is 0.553874 Test_aus is 0.485107 /home/s2013258/.local/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
A slight improvement in the AUC....
And as for the all data: the AUC declined anyway :
here is the result runnin on my laptop:
unique user&venue checkin combination in test 18205 unique user&venue checkin combination in test 72819 max num in matrix 155 max num in train 257 I am beginning to model model has been fitted this is the model that consider the checkin times Time used: 28.566111388207524 Train_auc is 0.999501 Test_aus is 0.654774 Collabrative Filtering testAUC is: 0.686022 Hybrid train auc is 0.513596 Hybrig test auc is 0.507019
and here is the result running on the cluster with the new model:
this is test for new lightfm, all checkins unique user&venue checkin combination in test 18205 unique user&venue checkin combination in test 72819 max num in matrix 219 max num in train 257 I am beginning to get negtive examples object preprocess created get neighbor time used: 51382.303583 Time used for negative examples: 51741.248447 I am beginning to model,this is the new model model has been fitted this is the model that consider the checkin times Time used: 3.28872599999886 Train_auc is 0.562395 Test_aus is 0.543550 /home/s2013258/.local/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
clearly we can see that the AUC in hybrid model and new model is lower than the AUC in the original CF model with warp loss, I think it may have something to do with the overfitting...or the redundancy of information
Temporarily I have two kinds of possible improvements in our minds:
1.changing the radius of neighbor area
2.improve the problem of overfitting...
solution1 is easy, but it requires some time to see the results, as for solutoin 2 I do not have any specific ideas yet.