zoukankan html css js c++ java

Python for Data Science

Chapter 6 - Other Popular Machine Learning Methods

Segment 1 - Association Rule Mining Using Apriori Algorithm

Association Rule Mining

Association rule mining is a process that deploys pattern recognition to identify and quantify relationships between different, yet related items.

A Simple Association Rules Use Case

Popular use case: product placement optimization at both brick and mortar and ecommerce stores.

Advantages of Association Rules

Fast

Works with small data

Feature Engineering

The term feature engineering refers to the process of engineering data into a predictive feature that fits the requirements (and/or improves the performance) of a machine learning model.

Three Ways to Measure Association

Support:

Support is the relative frequency of an item within a dataset. Support for an item can be calculated as:

[support(A->C) = support(A∪C) ]

Confidence：

Confidence is the probability of seeing the consequent item (a "then" term) within data, given that the data also contains the antecedent(the "if" term) item.

In other words, confidence tells you:

THEN How likely it is for 1 item to be purchase given that,

IF another item is purchased.

Confidence determines how many if-then statements are found to be true within a dataset.

[confidence(A->C) = frac{support(A->C)}{support(A)} ]

Lift

Lift is a metric that measures how much more often the antecedent and consequent occur together rather than them occurring independently.

[lift(A->C)=frac{confidence(A->C)}{support(C)} ]

Lift Scores

Lift score > 1: A is highly associated with C. If A is purchased, it is likely that C will also be purchased
Lift score < 1: If A is purchased, it is unlikely that C will be purchased
Lift score = 1: Indicates that there is no association between items A and C

Where Apriori Fits In

The Apriori algorithm is the algorithm that you use to implement association rule mining over structured data.

Import the required libraries

pip install mlxtend

Defaulting to user installation because normal site-packages is not writeable
Collecting mlxtend
  Downloading mlxtend-0.18.0-py2.py3-none-any.whl (1.3 MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.3 MB 1.1 MB/s eta 0:00:01
[?25hRequirement already satisfied: scipy>=1.2.1 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.5.4)
Requirement already satisfied: pandas>=0.24.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.1.5)
Requirement already satisfied: setuptools in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (51.1.0.post20201221)
Requirement already satisfied: matplotlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (3.1.1)
Requirement already satisfied: joblib>=0.13.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.0.0)
Requirement already satisfied: numpy>=1.16.2 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (1.19.4)
Requirement already satisfied: scikit-learn>=0.20.3 in /home/ericwei/.local/lib/python3.7/site-packages (from mlxtend) (0.24.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (2.8.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: six in /home/ericwei/.local/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=3.0.0->mlxtend) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /home/ericwei/.local/lib/python3.7/site-packages (from pandas>=0.24.2->mlxtend) (2020.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ericwei/.local/lib/python3.7/site-packages (from scikit-learn>=0.20.3->mlxtend) (2.1.0)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.18.0
[33mWARNING: You are using pip version 20.3.3; however, version 21.0 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Data Format

address = '~/Data/groceries.csv'
data = pd.read_csv(address)

data.head()

	1	2	3	4	5	6	7	8	9
0	citrus fruit	semi-finished bread	margarine	ready soups	NaN	NaN	NaN	NaN	NaN
1	tropical fruit	yogurt	coffee	NaN	NaN	NaN	NaN	NaN	NaN
2	whole milk	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	pip fruit	yogurt	cream cheese	meat spreads	NaN	NaN	NaN	NaN	NaN
4	other vegetables	whole milk	condensed milk	long life bakery product	NaN	NaN	NaN	NaN	NaN

Data Coversion

basket_sets = pd.get_dummies(data)

basket_sets.head()

	1_Instant food products	1_UHT-milk	1_artif. sweetener	1_baby cosmetics	1_bags	1_baking powder	1_bathroom cleaner	1_beef	1_berries	1_beverages	...	9_sweet spreads	9_tea	9_vinegar	9_waffles	9_whipped/sour cream	9_white bread	9_white wine	9_whole milk	9_yogurt	9_zwieback
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows Ã— 1113 columns

Support Calculation

apriori(basket_sets, min_support=0.02)

	support	itemsets
0	0.030421	(7)
1	0.034951	(17)
2	0.029126	(23)
3	0.049191	(26)
4	0.064401	(47)
5	0.044660	(83)
6	0.024272	(90)
7	0.040453	(92)
8	0.038835	(99)
9	0.033981	(100)
10	0.076052	(105)
11	0.028803	(111)
12	0.044984	(123)
13	0.073463	(130)
14	0.022977	(131)
15	0.028803	(159)
16	0.058900	(217)
17	0.022977	(224)
18	0.040129	(232)
19	0.036893	(233)
20	0.031068	(243)
21	0.034628	(256)
22	0.062136	(263)
23	0.028479	(264)
24	0.045955	(351)
25	0.033010	(366)
26	0.024272	(378)
27	0.057929	(397)
28	0.023301	(398)
29	0.020712	(479)
30	0.024595	(497)
31	0.024272	(510)
32	0.033333	(531)
33	0.023301	(532)
34	0.020065	(631)
35	0.021036	(217, 397)

apriori(basket_sets, min_support=0.02, use_colnames=True)

	support	itemsets
0	0.030421	(1_beef)
1	0.034951	(1_canned beer)
2	0.029126	(1_chicken)
3	0.049191	(1_citrus fruit)
4	0.064401	(1_frankfurter)
5	0.044660	(1_other vegetables)
6	0.024272	(1_pip fruit)
7	0.040453	(1_pork)
8	0.038835	(1_rolls/buns)
9	0.033981	(1_root vegetables)
10	0.076052	(1_sausage)
11	0.028803	(1_soda)
12	0.044984	(1_tropical fruit)
13	0.073463	(1_whole milk)
14	0.022977	(1_yogurt)
15	0.028803	(2_citrus fruit)
16	0.058900	(2_other vegetables)
17	0.022977	(2_pip fruit)
18	0.040129	(2_rolls/buns)
19	0.036893	(2_root vegetables)
20	0.031068	(2_soda)
21	0.034628	(2_tropical fruit)
22	0.062136	(2_whole milk)
23	0.028479	(2_yogurt)
24	0.045955	(3_other vegetables)
25	0.033010	(3_rolls/buns)
26	0.024272	(3_soda)
27	0.057929	(3_whole milk)
28	0.023301	(3_yogurt)
29	0.020712	(4_other vegetables)
30	0.024595	(4_rolls/buns)
31	0.024272	(4_soda)
32	0.033333	(4_whole milk)
33	0.023301	(4_yogurt)
34	0.020065	(5_rolls/buns)
35	0.021036	(3_whole milk, 2_other vegetables)

df = basket_sets

frequent_itemsets = apriori(basket_sets, min_support=0.002, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

	support	itemsets	length
0	0.006472	(1_UHT-milk)	1
1	0.030421	(1_beef)	1
2	0.011974	(1_berries)	1
3	0.008414	(1_beverages)	1
4	0.014887	(1_bottled beer)	1
...	...	...	...
844	0.002265	(6_whole milk, 3_pip fruit, 5_other vegetables)	3
845	0.002589	(4_other vegetables, 5_whole milk, 3_root vege...	3
846	0.002913	(5_yogurt, 4_curd, 3_whole milk)	3
847	0.003236	(6_whole milk, 4_root vegetables, 5_other vege...	3
848	0.002265	(6_whole milk, 7_butter, 5_other vegetables)	3

849 rows Ã— 3 columns

frequent_itemsets[frequent_itemsets['length'] >= 3]

	support	itemsets	length
820	0.002589	(2_root vegetables, 1_beef, 3_other vegetables)	3
821	0.002589	(3_whole milk, 2_other vegetables, 1_chicken)	3
822	0.002589	(1_citrus fruit, 3_whole milk, 2_other vegetab...	3
823	0.003236	(1_citrus fruit, 2_tropical fruit, 3_pip fruit)	3
824	0.002589	(1_citrus fruit, 4_whole milk, 3_other vegetab...	3
825	0.002265	(1_frankfurter, 6_whole milk, 5_other vegetables)	3
826	0.002265	(3_other vegetables, 4_whole milk, 1_pork)	3
827	0.003560	(1_root vegetables, 3_whole milk, 2_other vege...	3
828	0.002589	(1_sausage, 2_rolls/buns, 3_soda)	3
829	0.002265	(3_other vegetables, 1_sausage, 4_whole milk)	3
830	0.002265	(1_sausage, 4_other vegetables, 5_whole milk)	3
831	0.002913	(1_tropical fruit, 3_whole milk, 2_other veget...	3
832	0.002265	(4_other vegetables, 5_whole milk, 2_citrus fr...	3
833	0.002265	(4_butter, 2_other vegetables, 3_whole milk)	3
834	0.003560	(4_curd, 3_whole milk, 2_other vegetables)	3
835	0.003883	(4_yogurt, 3_whole milk, 2_other vegetables)	3
836	0.002265	(3_whole milk, 2_other vegetables, 6_rolls/buns)	3
837	0.003236	(3_other vegetables, 4_whole milk, 2_pip fruit)	3
838	0.005825	(2_root vegetables, 4_whole milk, 3_other vege...	3
839	0.002265	(4_other vegetables, 2_tropical fruit, 3_pip f...	3
840	0.003560	(5_butter, 4_whole milk, 3_other vegetables)	3
841	0.002913	(3_other vegetables, 4_whole milk, 5_yogurt)	3
842	0.003560	(3_other vegetables, 6_yogurt, 4_whole milk)	3
843	0.002265	(3_pip fruit, 4_root vegetables, 5_other veget...	3
844	0.002265	(6_whole milk, 3_pip fruit, 5_other vegetables)	3
845	0.002589	(4_other vegetables, 5_whole milk, 3_root vege...	3
846	0.002913	(5_yogurt, 4_curd, 3_whole milk)	3
847	0.003236	(6_whole milk, 4_root vegetables, 5_other vege...	3
848	0.002265	(6_whole milk, 7_butter, 5_other vegetables)	3

Association Rules

Confidence

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(2_sausage)	(1_frankfurter)	0.011327	0.064401	0.011327	1.000000	15.527638	0.010597	inf
1	(7_pastry)	(1_frankfurter)	0.005178	0.064401	0.002589	0.500000	7.763819	0.002256	1.871197
2	(2_ham)	(1_sausage)	0.007120	0.076052	0.004531	0.636364	8.367505	0.003989	2.540858
3	(2_meat)	(1_sausage)	0.006796	0.076052	0.004854	0.714286	9.392097	0.004338	3.233819
4	(3_beef)	(1_sausage)	0.004854	0.076052	0.002589	0.533333	7.012766	0.002220	1.979889

Lift

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.5)
rules.head()

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(1_beef)	(2_citrus fruit)	0.030421	0.028803	0.005502	0.180851	6.278986	0.004625	1.185618
1	(2_citrus fruit)	(1_beef)	0.028803	0.030421	0.005502	0.191011	6.278986	0.004625	1.198508
2	(1_beef)	(2_other vegetables)	0.030421	0.058900	0.003236	0.106383	1.806173	0.001444	1.053136
3	(2_other vegetables)	(1_beef)	0.058900	0.030421	0.003236	0.054945	1.806173	0.001444	1.025950
4	(2_root vegetables)	(1_beef)	0.036893	0.030421	0.005502	0.149123	4.902016	0.004379	1.139506

Lift and Confidence

rules[(rules['lift'] >= 5) & (rules['confidence'] >=0.5)]

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
94	(2_sausage)	(1_frankfurter)	0.011327	0.064401	0.011327	1.000000	15.527638	0.010597	inf
141	(7_pastry)	(1_frankfurter)	0.005178	0.064401	0.002589	0.500000	7.763819	0.002256	1.871197
243	(2_ham)	(1_sausage)	0.007120	0.076052	0.004531	0.636364	8.367505	0.003989	2.540858
247	(2_meat)	(1_sausage)	0.006796	0.076052	0.004854	0.714286	9.392097	0.004338	3.233819
262	(3_beef)	(1_sausage)	0.004854	0.076052	0.002589	0.533333	7.012766	0.002220	1.979889
...	...	...	...	...	...	...	...	...	...
962	(6_whole milk, 4_root vegetables)	(5_other vegetables)	0.003883	0.012621	0.003236	0.833333	66.025641	0.003187	5.924272
964	(4_root vegetables, 5_other vegetables)	(6_whole milk)	0.005178	0.009385	0.003236	0.625000	66.594828	0.003188	2.641640
968	(6_whole milk, 7_butter)	(5_other vegetables)	0.002913	0.012621	0.002265	0.777778	61.623932	0.002229	4.443204
970	(7_butter, 5_other vegetables)	(6_whole milk)	0.002589	0.009385	0.002265	0.875000	93.232759	0.002241	7.924919
972	(7_butter)	(6_whole milk, 5_other vegetables)	0.004207	0.007443	0.002265	0.538462	72.341137	0.002234	2.150539

76 rows Ã— 9 columns

查看全文

相关阅读:
MapReduce学习总结之简介
 Hive Cli相关操作
 使用Hive UDF和GeoIP库为Hive加入IP识别功能
 Google Maps-IP地址的可视化查询
 hive多表联合查询(GroupLens->Users,Movies,Ratings表)
云计算平台管理的三大利器Nagios、Ganglia和Splunk
机器大数据也离不开Hadoop
hive与hbase的整合
 hive优化之------控制hive任务中的map数和reduce数
 Hadoop管理员的十个最佳实践(转)

原文地址：https://www.cnblogs.com/keepmoving1113/p/14321855.html

	1_Instant food products	1_UHT-milk	1_artif. sweetener	1_baby cosmetics	1_bags	1_baking powder	1_bathroom cleaner	1_beef	1_berries	1_beverages	...	9_sweet spreads	9_tea	9_vinegar	9_waffles	9_whipped/sour cream	9_white bread	9_white wine	9_whole milk	9_yogurt	9_zwieback
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	1_Instant food products	1_UHT-milk	1_artif. sweetener	1_baby cosmetics	1_bags	1_baking powder	1_bathroom cleaner	1_beef	1_berries	1_beverages	...	9_sweet spreads	9_tea	9_vinegar	9_waffles	9_whipped/sour cream	9_white bread	9_white wine	9_whole milk	9_yogurt	9_zwieback
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	1_Instant food products	1_UHT-milk	1_artif. sweetener	1_baby cosmetics	1_bags	1_baking powder	1_bathroom cleaner	1_beef	1_berries	1_beverages	...	9_sweet spreads	9_tea	9_vinegar	9_waffles	9_whipped/sour cream	9_white bread	9_white wine	9_whole milk	9_yogurt	9_zwieback
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0