zoukankan      html  css  js  c++  java
  • 项目实战使用PySpark处理文本多分类问题

    原文链接:https://cloud.tencent.com/developer/article/1096712

    在大神创作的基础上,学习了一些新知识,并加以注释。

    TARGET:将旧金山犯罪记录(San Francisco Crime Description)分类到33个类目中

    源代码及数据集:之后提交。

    一、载入数据集data

     1 import time
     2 from pyspark.sql import SQLContext
     3 from pyspark import SparkContext
     4 # 利用spark的csv库直接载入csv格式的数据
     5 sc = SparkContext()
     6 sqlContext = SQLContext(sc)
     7 data = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     8                                                                   inferschema='true').load('train.csv')
     9 # 选10000条数据集,减少运行时间
    10 data = data.sample(False, 0.01, 100)
    11 print(data.count())
    结果:
    8703

    1.1 除去与需求无关的列
    1 # 除去一些不要的列,并展示前五行
    2 drop_list = ['Dates', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
    3 data = data.select([column for column in data.columns if column not in drop_list])
    4 data.show(5)

    1.2 显示数据结构

    1 # 利用printSchema()方法显示数据的结构
    2 data.printSchema()

    结果:

    root
     |-- Category: string (nullable = true)
     |-- Descript: string (nullable = true)

    1.3 查看犯罪类型最多的前20个
    1 # 包含数量最多的20类犯罪
    2 from pyspark.sql.functions import col
    3 data.groupBy('Category').count().orderBy(col('count').desc()).show()

    结果:

    +--------------------+-----+
    |            Category|count|
    +--------------------+-----+
    |       LARCENY/THEFT| 1725|
    |      OTHER OFFENSES| 1230|
    |        NON-CRIMINAL|  962|
    |             ASSAULT|  763|
    |       VEHICLE THEFT|  541|
    |       DRUG/NARCOTIC|  494|
    |           VANDALISM|  447|
    |            WARRANTS|  406|
    |            BURGLARY|  347|
    |      SUSPICIOUS OCC|  295|
    |      MISSING PERSON|  284|
    |             ROBBERY|  225|
    |               FRAUD|  159|
    |     SECONDARY CODES|  124|
    |FORGERY/COUNTERFE...|  109|
    |         WEAPON LAWS|   86|
    |            TRESPASS|   63|
    |        PROSTITUTION|   59|
    |  DISORDERLY CONDUCT|   54|
    |         DRUNKENNESS|   52|
    +--------------------+-----+
    only showing top 20 rows
    

     1.4 查看犯罪描述最多的前20个

    1 # 包含犯罪数量最多的20个描述
    2 data.groupBy('Descript').count().orderBy(col('count').desc()).show()
    结果:

    +--------------------+-----+ | Descript|count| +--------------------+-----+ |GRAND THEFT FROM ...| 569| | LOST PROPERTY| 323| | BATTERY| 301| | STOLEN AUTOMOBILE| 262| |DRIVERS LICENSE, ...| 244| |AIDED CASE, MENTA...| 223| | WARRANT ARREST| 222| |PETTY THEFT FROM ...| 216| |SUSPICIOUS OCCURR...| 211| |MALICIOUS MISCHIE...| 184| | TRAFFIC VIOLATION| 168| |THREATS AGAINST LIFE| 154| |PETTY THEFT OF PR...| 152| | FOUND PROPERTY| 138| |MALICIOUS MISCHIE...| 138| |ENROUTE TO OUTSID...| 121| |GRAND THEFT OF PR...| 115| |MISCELLANEOUS INV...| 101| | DOMESTIC VIOLENCE| 99| | FOUND PERSON| 98| +--------------------+-----+ only showing top 20 rows

    二、对犯罪描述进行分词
    2.1 对Descript分词,先切分单词,再删除停用词

    流程和scikit-learn版本的很相似,包含3个步骤:
    1.regexTokenizer: 利用正则切分单词
    2.stopwordsRemover: 移除停用词
    3.countVectors: 构建词频向量

    RegexTokenizer:基于正则的方式进行文档切分成单词组
    inputCol: 输入字段
    outputCol: 输出字段
    pattern: 匹配模式,根据匹配到的内容切分单词

    CountVectorizer:构建词频向量
    covabSize: 限制的词频数
    minDF:如果是float,则表示出现的百分比小于minDF,不会被当做关键词
    如果是int,则表示出现是次数小于minDF,不会被当做关键词

     1 from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
     2 from pyspark.ml.classification import LogisticRegression
     3 
     4 # 正则切分单词
     5 # inputCol:输入字段名
     6 # outputCol:输出字段名
     7 regexTokenizer = RegexTokenizer(inputCol='Descript', outputCol='words', pattern='\\W')
     8 # 停用词
     9 add_stopwords = ['http', 'https', 'amp', 'rt', 't', 'c', 'the']
    10 stopwords_remover = StopWordsRemover(inputCol='words', outputCol='filtered').setStopWords(add_stopwords)
    11 # 构建词频向量
    12 count_vectors = CountVectorizer(inputCol='filtered', outputCol='features', vocabSize=10000, minDF=5)

    2.2 对分词后的词频率排序,最频繁出现的设置为0

    StringIndexer
    StringIndexer将一列字符串label编码为一列索引号,根据label出现的频率排序,最频繁出现的label的index为0
    该例子中,label会被编码成从0-32的整数,最频繁的label被编码成0

    Pipeline是基于DataFrame的高层API,可以方便用户构建和调试机器学习流水线,可以使得多个机器学习算法顺序执行,达到高效的数据处理的目的。

    fit():将DataFrame转换成一个Transformer的算法,将label列转化为特征向量
    transform(): 将特征向量作为新列添加到DataFrame

    1 from pyspark.ml import Pipeline
    2 from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
    3 label_stringIdx = StringIndexer(inputCol='Category', outputCol='label')
    4 pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, count_vectors, label_stringIdx])
    5 # fit the pipeline to training documents
    6 pipeline_fit = pipeline.fit(data)
    7 dataset = pipeline_fit.transform(data)
    8 dataset.show(5)

    结果:

    +---------------+--------------------+--------------------+--------------------+--------------------+-----+
    |       Category|            Descript|               words|            filtered|            features|label|
    +---------------+--------------------+--------------------+--------------------+--------------------+-----+
    |  LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(309,[0,2,3,4,6],...|  0.0|
    |  VEHICLE THEFT|   STOLEN AUTOMOBILE|[stolen, automobile]|[stolen, automobile]|(309,[9,27],[1.0,...|  4.0|
    |   NON-CRIMINAL|      FOUND PROPERTY|   [found, property]|   [found, property]|(309,[5,32],[1.0,...|  2.0|
    |SECONDARY CODES|   JUVENILE INVOLVED|[juvenile, involved]|[juvenile, involved]|(309,[67,218],[1....| 13.0|
    | OTHER OFFENSES|DRIVERS LICENSE, ...|[drivers, license...|[drivers, license...|(309,[14,23,28,30...|  1.0|
    +---------------+--------------------+--------------------+--------------------+--------------------+-----+
    only showing top 5 rows

    三、训练/测试集划分
    1 # set seed for reproducibility
    2 # 数据集划分训练集和测试集,比例7:3, 设置随机种子100
    3 (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
    4 print('Training Dataset Count:{}'.format(trainingData.count()))
    5 print('Test Dataset Count:{}'.format(testData.count()))

    结果:

    Training Dataset Count:6117
    Test Dataset Count:2586

    四、模型训练和评价
    4.1 以词频作为特征,利用逻辑回归进行分类
    模型在测试集上预测和打分,查看10个预测概率值最高的结果:

    LogisticRegression:逻辑回归模型
    maxIter:最大迭代次数
    regParam:正则化参数
    elasticNetParam:正则化。0:l1;1:l2

    1 start_time = time.time()
    2 lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
    3 lrModel = lr.fit(trainingData)
    4 predictions = lrModel.transform(testData)
    5 # 过滤prediction类别为0数据集
    6 predictions.filter(predictions['prediction'] == 0).select('Descript', 'Category', 'probability', 'label', 'prediction').orderBy('probability', accending=False).show(n=10, truncate=30)

    结果:

    +--------------------------+--------+------------------------------+-----+----------+
    |                  Descript|Category|                   probability|label|prediction|
    +--------------------------+--------+------------------------------+-----+----------+
    |        ARSON OF A VEHICLE|   ARSON|[0.1194196587417514,0.10724...| 26.0|       0.0|
    |        ARSON OF A VEHICLE|   ARSON|[0.1194196587417514,0.10724...| 26.0|       0.0|
    |        ARSON OF A VEHICLE|   ARSON|[0.1194196587417514,0.10724...| 26.0|       0.0|
    |           ATTEMPTED ARSON|   ARSON|[0.12978385966276762,0.1084...| 26.0|       0.0|
    |     CREDIT CARD, THEFT OF|   FRAUD|[0.21637136655265077,0.0836...| 12.0|       0.0|
    |     CREDIT CARD, THEFT OF|   FRAUD|[0.21637136655265077,0.0836...| 12.0|       0.0|
    |     CREDIT CARD, THEFT OF|   FRAUD|[0.21637136655265077,0.0836...| 12.0|       0.0|
    |     CREDIT CARD, THEFT OF|   FRAUD|[0.21637136655265077,0.0836...| 12.0|       0.0|
    |     CREDIT CARD, THEFT OF|   FRAUD|[0.21637136655265077,0.0836...| 12.0|       0.0|
    |ARSON OF A VACANT BUILDING|   ARSON|[0.22897903829071928,0.0980...| 26.0|       0.0|
    +--------------------------+--------+------------------------------+-----+----------+
    only showing top 10 rows

    1 from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    2 # predictionCol: 预测列的名称
    3 evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
    4 # 预测准确率
    5 print(evaluator.evaluate(predictions))
    6 end_time = time.time()
    7 print(end_time - start_time)

    结果:

    0.9641817609126011
    8.245999813079834

    4.2 以TF-ID作为特征,利用逻辑回归进行分类
     1 from pyspark.ml.feature import HashingTF, IDF
     2 start_time = time.time()
     3 # numFeatures: 最大特征数
     4 hashingTF = HashingTF(inputCol='filtered', outputCol='rawFeatures', numFeatures=10000)
     5 # minDocFreq:过滤的最少文档数量
     6 idf = IDF(inputCol='rawFeatures', outputCol='features', minDocFreq=5)
     7 pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, hashingTF, idf, label_stringIdx])
     8 pipeline_fit = pipeline.fit(data)
     9 dataset = pipeline_fit.transform(data)
    10 (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
    11 
    12 lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
    13 lr_model = lr.fit(trainingData)
    14 predictions = lr_model.transform(testData)
    15 predictions.filter(predictions['prediction'] == 0).select('Descript', 'Category', 'probability', 'label', 'prediction').\
    16 orderBy('probability', ascending=False).show(n=10, truncate=30)

    结果:

    +----------------------------+-------------+------------------------------+-----+----------+
    |                    Descript|     Category|                   probability|label|prediction|
    +----------------------------+-------------+------------------------------+-----+----------+
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...|  0.0|       0.0|
    +----------------------------+-------------+------------------------------+-----+----------+
    only showing top 10 rows
    
    1 evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
    2 print(evaluator.evaluate(predictions))
    3 end_time = time.time()
    4 print(end_time - start_time)

    结果:

    0.9653361434618551
    12.998999834060669

    4.3 交叉验证
    用交叉验证来优化参数,这里针对基于词频特征的逻辑回归模型进行优化
     1 from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
     2 start_time = time.time()
     3 pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, count_vectors, label_stringIdx])
     4 pipeline_fit = pipeline.fit(data)
     5 (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
     6 lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
     7 # 为交叉验证创建参数
     8 # ParamGridBuilder:用于基于网格搜索的模型选择的参数网格的生成器
     9 # addGrid:将网格中给定参数设置为固定值
    10 # parameter:正则化参数
    11 # maxIter:迭代次数
    12 # numFeatures:特征值
    13 paramGrid = (ParamGridBuilder()
    14              .addGrid(lr.regParam, [0.1, 0.3, 0.5])
    15              .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2])
    16              .addGrid(lr.maxIter, [10, 20, 50])
    17 #              .addGrid(idf.numFeatures, [10, 100, 1000])
    18              .build())
    19 
    20 # 创建五折交叉验证
    21 # estimator:要交叉验证的估计器
    22 # estimatorParamMaps:网格搜索的最优参数
    23 # evaluator:评估器
    24 # numFolds:交叉次数
    25 cv = CrossValidator(estimator=lr,\
    26                    estimatorParamMaps=paramGrid,\
    27                    evaluator=evaluator,\
    28                    numFolds=5)
    29 cv_model = cv.fit(trainingData)
    30 predictions = cv_model.transform(testData)
    31 
    32 # 模型评估
    33 evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
    34 print(evaluator.evaluate(predictions))
    35 end_time = time.time()
    36 print(end_time - start_time)

    结果:

    0.9807684755923513
    368.97300004959106

    4.4 朴素贝叶斯
     1 from pyspark.ml.classification import NaiveBayes
     2 start_time = time.time()
     3 # smoothing:平滑参数
     4 nb = NaiveBayes(smoothing=1)
     5 model = nb.fit(trainingData)
     6 predictions = model.transform(testData)
     7 predictions.filter(predictions['prediction'] == 0) \
     8     .select('Descript', 'Category', 'probability', 'label', 'prediction') \
     9     .orderBy('probability', ascending=False) \
    10     .show(n=10, truncate=30)

    结果:

    +----------------------+-------------+------------------------------+-----+----------+
    |              Descript|     Category|                   probability|label|prediction|
    +----------------------+-------------+------------------------------+-----+----------+
    |   PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...|  0.0|       0.0|
    |   PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...|  0.0|       0.0|
    |   PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    |GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...|  0.0|       0.0|
    +----------------------+-------------+------------------------------+-----+----------+
    only showing top 10 rows

    1 evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
    2 print(evaluator.evaluate(predictions))
    3 end_time = time.time()
    4 print(end_time - start_time)

    结果:

    0.977432832447723
    5.371000051498413

    4.5 随机森林
     1 from pyspark.ml.classification import RandomForestClassifier
     2 start_time = time.time()
     3 # numTree:训练树的个数
     4 # maxDepth:最大深度
     5 # maxBins:连续特征离散化的最大分类数
     6 rf = RandomForestClassifier(labelCol='label', \
     7                             featuresCol='features', \
     8                             numTrees=100, \
     9                             maxDepth=4, \
    10                             maxBins=32)
    11 # Train model with Training Data
    12 rfModel = rf.fit(trainingData)
    13 predictions = rfModel.transform(testData)
    14 predictions.filter(predictions['prediction'] == 0) \
    15     .select('Descript','Category','probability','label','prediction') \
    16     .orderBy('probability', ascending=False) \
    17     .show(n = 10, truncate = 30)

    结果:

    +----------------------------+-------------+------------------------------+-----+----------+
    |                    Descript|     Category|                   probability|label|prediction|
    +----------------------------+-------------+------------------------------+-----+----------+
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    |PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...|  0.0|       0.0|
    +----------------------------+-------------+------------------------------+-----+----------+
    only showing top 10 rows

    1 evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
    2 print(evaluator.evaluate(predictions))
    3 end_time = time.time()
    4 print(end_time - start_time)

    结果:

    0.27929770811242954
    36.63699984550476

    上面的结果可以看出:随机森林是优秀的、鲁棒的通用模型,但对于高维稀疏数据来说,并不是一个很好的选择。
    明显,选择使用交叉验证的逻辑回归

    但是选择交叉验证的逻辑回归时需要注意一点:由于使用了交叉验证,训练时间会过长,在实际的应用场景中要根据业务选择最合适的模型。










  • 相关阅读:
    多维DP UVA 11552 Fewest Flop
    思维/构造 HDOJ 5353 Average
    map Codeforces Round #Pi (Div. 2) C. Geometric Progression
    构造 Codeforces Round #Pi (Div. 2) B. Berland National Library
    贪心+优先队列 HDOJ 5360 Hiking
    贪心 HDOJ 5355 Cake
    LIS UVA 10534 Wavio Sequence
    又见斐波那契~矩阵快速幂入门题
    Big Christmas Tree(poj-3013)最短路
    poj 2449 Remmarguts' Date 第k短路 (最短路变形)
  • 原文地址:https://www.cnblogs.com/cymx66688/p/10699018.html
Copyright © 2011-2022 走看看