zoukankan      html  css  js  c++  java
  • spark mllib lda 简单示例

    舆情系统每日热词用到了lda主题聚类

    原先的版本是python项目,分词应用Jieba,LDA应用Gensim

    项目工作良好

    有以下几点问题


    1 舆情产品基于elasticsearch大数据,es内应用lucene分词,python的jieba分词和lucene分词结果并不一致(或需额外的工作保持一致),早期需求只是展示每日热词,分词不一致并不是个问题,现在的新的需求,要求lda和数据无缝结合,es集成jieba,再把es内的数据全用全量数据重新分词,考虑工作量和技术难度上都不现实,只好改lda的分词算法了(实际应用上,不同的分词算法在lda提取主题和热词的场景下几乎没有影响)

    2 python项目限于单点计算,不好扩展

    应用lucene分词,再计算lda,切换至java的技术栈是最简单的办法

    java下也有很多lda的算法实现,只作可行性验证

    这篇主要调研的是spark-mllib,官方有lda实现

    http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda


    ls examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

    root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample
    2019-03-21 10:28:07 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
    2019-03-21 10:28:07 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
    2019-03-21 10:28:07 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Error: Missing argument <input>...
    LDAExample: an example LDA app for plain text data.
    Usage: LDAExample [options] <input>...

    --k <value> number of topics. default: 20
    --maxIterations <value> number of iterations of learning. default: 10
    --docConcentration <value>
    amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0
    --topicConcentration <value>
    amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0
    --vocabSize <value> number of distinct word types to use, chosen by frequency. (-1=all) default: 10000
    --stopwordFile <value> filepath for a list of stopwords. Note: This must fit on a single machine. default:
    --algorithm <value> inference algorithm to use. em and online are supported. default: em
    --checkpointDir <value> Directory for checkpointing intermediate results. Checkpointing helps with recovery and eliminates temporary shuffle files on disk. default: None
    --checkpointInterval <value>
    Iterations between each checkpoint. Only used if checkpointDir is set. default: 10
    <input>... input paths (directories) to plain text corpora. Each text file line should hold 1 document.
    2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Shutdown hook called
    2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0022e162-fa2a-4e0e-8e89-c78ea7962dd1


    官方示例
    Refer to the LDA Scala docs and DistributedLDAModel Scala docs for details on the API.

    import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
    import org.apache.spark.mllib.linalg.Vectors

    // Load and parse the data
    val data = sc.textFile("data/mllib/sample_lda_data.txt")
    val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
    // Index documents with unique IDs
    val corpus = parsedData.zipWithIndex.map(_.swap).cache()

    // Cluster the documents into three topics using LDA
    val ldaModel = new LDA().setK(3).run(corpus)

    // Output topics. Each is a distribution over words (matching word count vectors)
    println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
    val topics = ldaModel.topicsMatrix
    for (topic <- Range(0, 3)) {
    print(s"Topic $topic :")
    for (word <- Range(0, ldaModel.vocabSize)) {
    print(s"${topics(word, topic)}")
    }
    println()
    }

    // Save and load model.
    ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
    val sameModel = DistributedLDAModel.load(sc,
    "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")


    文件为"data/mllib/sample_lda_data.txt"

    root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt
    2019-03-21 10:29:55 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
    2019-03-21 10:29:55 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
    2019-03-21 10:29:56 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2019-03-21 10:29:58 INFO SparkContext:54 - Running Spark version 2.4.0
    2019-03-21 10:29:58 INFO SparkContext:54 - Submitted application: LDAExample with {
    input: List(data/mllib/sample_lda_data.txt),
    k: 20,
    maxIterations: 10,
    docConcentration: -1.0,
    topicConcentration: -1.0,
    vocabSize: 10000,
    stopwordFile: ,
    algorithm: em,
    checkpointDir: None,
    checkpointInterval: 10
    }
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls to: root
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls to: root
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls groups to:
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-03-21 10:29:58 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
    2019-03-21 10:29:58 INFO Utils:54 - Successfully started service 'sparkDriver' on port 63504.
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-03-21 10:29:58 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-ac172871-a0e5-41af-9e76-a09ea01c3428
    2019-03-21 10:29:58 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-03-21 10:29:59 INFO log:192 - Logging initialized @4022ms
    2019-03-21 10:29:59 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
    2019-03-21 10:29:59 INFO Server:419 - Started @4109ms
    2019-03-21 10:29:59 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    2019-03-21 10:29:59 INFO AbstractConnector:278 - Started ServerConnector@1386313f{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
    2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921994e{/jobs,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a8df0b3{/jobs/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c112f5f{/jobs/job,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fd05028{/jobs/job/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a2d3909{/stages,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fb392c4{/stages/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@194d329e{/stages/stage,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@541179e7{/stages/stage/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24386839{/stages/pool,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b32b129{/stages/pool/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439e3cb4{/storage,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c9fbb61{/storage/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b81616b{/storage/rdd,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15d42ccb{/storage/rdd/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@279dd959{/environment,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46383a78{/environment/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36c281ed{/executors,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@244418a{/executors/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b5a078a{/executors/threadDump,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c361f63{/executors/threadDump/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ed922e1{/static,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35f3a22c{/,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a0c5e9{/api,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32e652b6{/jobs/job/kill,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ba02375{/stages/stage/kill,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://10.10.3.47:4041
    2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/scopt_2.11-3.7.0.jar at spark://10.10.3.47:63504/jars/scopt_2.11-3.7.0.jar with timestamp 1553164199210
    2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/spark-examples_2.11-2.4.0.jar at spark://10.10.3.47:63504/jars/spark-examples_2.11-2.4.0.jar with timestamp 1553164199211
    2019-03-21 10:29:59 INFO Executor:54 - Starting executor ID driver on host localhost
    2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29540.
    2019-03-21 10:29:59 INFO NettyBlockTransferService:54 - Server created on 10.10.3.47:29540
    2019-03-21 10:29:59 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.10.3.47:29540 with 366.3 MB RAM, BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6587305a{/metrics/json,null,AVAILABLE,@Spark}

    Corpus summary:
    Training set size: 12 documents
    Vocabulary size: 10 terms
    Training set size: 62 tokens
    Preprocessing time: 5.008004889 sec

    Finished training LDA model. Summary:
    Training time: 3.565145577 sec
    Training data average log likelihood: -20.14862821928427

    20 topics:
    TOPIC 0
    0 0.2776616245970018
    1 0.2467437437143153
    2 0.162799944092254
    3 0.14357469330901143
    4 0.06928101473026803
    9 0.04699050355830519
    5 0.030203661449368247
    6 0.007640498460509008
    7 0.007552251606432597
    8 0.007552064482534436

    TOPIC 1
    0 0.36469051153508564
    1 0.18646127739213272
    2 0.15522432783704704
    3 0.13106181033451642
    4 0.06584596207691033
    9 0.04395627892970295
    5 0.030062711796482375
    8 0.007597985215275549
    7 0.00759765588135554
    6 0.007501479001491501

    TOPIC 2
    0 0.31931887563375516
    1 0.1962082806692423
    2 0.17231677857837657
    3 0.14159208272582544
    4 0.07269228847838642
    9 0.044804492369582984
    5 0.030074171215304125
    7 0.007707667829625457
    8 0.007707603060281631
    6 0.007577759439619957

    TOPIC 3
    0 0.33172427138380095
    1 0.2099908538476618
    2 0.1529315801366396
    3 0.13055578109706914
    4 0.07654434907166915
    9 0.045380994421043694
    5 0.03023101958436096
    8 0.007550809491464853
    7 0.007550747971239613
    6 0.00753959299505035

    TOPIC 4
    0 0.32678177168416583
    1 0.2207731082949569
    2 0.1515085115377772
    3 0.13531105489561282
    4 0.06780498015453475
    9 0.04539833136958728
    5 0.02983504958532268
    6 0.007557895637294794
    7 0.007514688233324302
    8 0.007514608607423405

    TOPIC 5
    0 0.3239146979457089
    1 0.21573894993478243
    2 0.14525808340440868
    3 0.14073654896536467
    4 0.07677900991703532
    9 0.04501240647651081
    5 0.03001971166224193
    6 0.007555937737508214
    7 0.00749238505224968
    8 0.007492268904189478

    TOPIC 6
    0 0.31540967245137175
    1 0.2455372478674326
    2 0.15387618434880607
    3 0.11776433445702658
    4 0.06958508618823488
    9 0.0454694806523595
    5 0.029862323423880312
    6 0.007550844343998941
    7 0.007472470469363186
    8 0.007472355797526257

    TOPIC 7
    0 0.3288373150630124
    1 0.1984831879713159
    2 0.14673351329227885
    3 0.13721294614635954
    4 0.09099655027113343
    9 0.04408525595608986
    5 0.031035301494144973
    8 0.007542328434232405
    7 0.007542260756614362
    6 0.007531340614818258

    TOPIC 8
    0 0.3193586979466173
    2 0.19372032290375615
    1 0.17385196290359722
    3 0.15067297324203144
    4 0.06431584983802775
    9 0.04459867057873216
    5 0.030109171006041484
    7 0.007889712095309039
    8 0.007889665041634011
    6 0.007592974444253261

    TOPIC 9
    0 0.30194242995872983
    1 0.19005321586219182
    2 0.18325896599060398
    3 0.13814239333008232
    4 0.08613769346253837
    9 0.04615261626883708
    5 0.031097388101704492
    8 0.007811116080942582
    7 0.00781095323650845
    6 0.007593227707861054

    TOPIC 10
    0 0.3018333962608689
    1 0.2101327107384649
    2 0.17976426989184977
    3 0.13028433691427393
    4 0.0785628713958075
    9 0.04589483819874114
    5 0.030473519782385595
    8 0.007732641246914153
    7 0.007732610576531091
    6 0.007588804994162942

    TOPIC 11
    0 0.2731844012758353
    1 0.2437828502953355
    2 0.15602289141674863
    3 0.14124037120394928
    4 0.08621787369766198
    9 0.04595175938704448
    5 0.03092529389836489
    6 0.007624331986973666
    8 0.007525230917852694
    7 0.007524995920233587

    TOPIC 12
    0 0.2754166406308875
    1 0.23790442921451685
    2 0.16407059546150252
    3 0.15224175554500483
    4 0.06911748308147463
    9 0.04812626048120002
    5 0.030286734901316135
    6 0.007658288924620451
    8 0.007588920737731278
    7 0.0075888910217458685

    TOPIC 13
    0 0.30106045484425825
    1 0.24631760902734479
    3 0.14410465573241527
    2 0.12524822006978883
    4 0.08639636802254201
    9 0.04423754911967344
    5 0.030456963732366213
    6 0.007563767806759953
    7 0.00730722647659661
    8 0.0073071851682546506

    TOPIC 14
    0 0.31562044416659885
    1 0.2526998226062497
    2 0.14535277752245734
    3 0.12218888456011873
    4 0.0662622160602047
    9 0.04569769393216447
    5 0.02981990651641202
    6 0.007553483942629782
    7 0.007402409879010876
    8 0.00740236081415358

    TOPIC 15
    0 0.32058662644832936
    1 0.20394629563372832
    2 0.14716518634378675
    3 0.1428368219219527
    4 0.08608013435759047
    9 0.046286592753669746
    5 0.030458243185553725
    6 0.007557253951254248
    8 0.007541483951171858
    7 0.007541361452962736

    TOPIC 16
    0 0.34720661993137836
    1 0.19335921917713203
    2 0.1574642973682748
    3 0.1315805755079897
    4 0.07262519571107433
    9 0.044460628102199806
    5 0.03055492723901837
    7 0.0076115120495338995
    8 0.007611307835359196
    6 0.0075257170780395

    TOPIC 17
    0 0.276113881630822
    1 0.23014600960245993
    2 0.1791490549120877
    3 0.13887566068288787
    4 0.0755460207226412
    9 0.04662339718458672
    5 0.03051544813778674
    7 0.007695876502113754
    8 0.007695757393934347
    6 0.00763889323067976

    TOPIC 18
    0 0.3143607517778961
    1 0.23372981830486098
    2 0.15462561577264133
    3 0.12748419732486796
    4 0.07250495549430869
    9 0.04461512010586714
    5 0.030101883279393397
    6 0.007561984484167818
    8 0.007507862285458234
    7 0.007507811170538331

    TOPIC 19
    0 0.27641365880548574
    1 0.25811740868556055
    2 0.15554256853273
    3 0.1299929833331776
    4 0.08206677473861888
    9 0.04536961458219285
    5 0.029948048909439938
    6 0.007602171935254187
    7 0.007473418458492894
    8 0.007473352019047361


    官方用例测试通过

    ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt --k=3

    Corpus summary:
    Training set size: 12 documents
    Vocabulary size: 10 terms
    Training set size: 62 tokens
    Preprocessing time: 3.952237148 sec

    Finished training LDA model. Summary:
    Training time: 2.096613705 sec
    Training data average log likelihood: -19.94356723200465

    3 topics:
    TOPIC 0
    0 0.3859910534194794
    1 0.27784612698328287
    3 0.11309885992982183
    2 0.07754302744583665
    4 0.05944898611629052
    9 0.03956337141492645
    5 0.026961271416303733
    6 0.00729179016688132
    7 0.006153234509962162
    8 0.006102278597215016

    TOPIC 1
    0 0.27760966762034767
    2 0.20133458026264603
    1 0.17674769647471528
    3 0.17119003866520263
    4 0.05856966790108555
    9 0.054142861315096644
    5 0.036444242827270906
    6 0.008131736037023057
    7 0.008064284551869805
    8 0.007765224344742546

    TOPIC 2
    0 0.26781477963691924
    1 0.20419276965972555
    2 0.1988283635339889
    3 0.12491741400170082
    4 0.10934994319103938
    9 0.0426866734196486
    5 0.027519743427499636
    8 0.008867804538854353
    7 0.008517400571616986
    6 0.007305108019006569

    看sample_lda_data.txt的内容
    root@wx-social-consume2:/usr/spark-2.4.0# cat data/mllib/sample_lda_data.txt
    1 2 6 0 2 3 1 1 0 0 3
    1 3 0 1 3 0 0 2 0 0 1
    1 4 1 0 0 4 9 0 1 2 0
    2 1 0 3 0 0 5 0 2 3 9
    3 1 1 9 3 0 2 0 0 1 3
    4 2 0 3 4 5 1 1 1 4 0
    2 1 0 3 0 0 5 0 2 2 9
    1 1 1 9 2 1 2 0 0 1 3
    4 4 0 3 4 2 1 3 0 0 0
    2 8 2 0 3 0 2 0 2 7 2
    1 1 1 9 0 2 2 0 0 3 3
    4 1 0 0 4 5 1 3 0 1 0

    矩阵11行,怎么只输出了10列,理解有错?
    TOPIC 2
    0 0.26781477963691924
    1 0.20419276965972555
    2 0.1988283635339889
    3 0.12491741400170082
    4 0.10934994319103938
    9 0.0426866734196486
    5 0.027519743427499636
    8 0.008867804538854353
    7 0.008517400571616986
    6 0.007305108019006569

    看代码
    val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
    是只取出了top10的词,理解应该没问题


    离线单点计算

  • 相关阅读:
    关于网络调试助手
    阿里云之设备连接方法学习
    阿里云学习
    Jquery ThickBox的使用
    推荐几款制作网页滚动动画的 JavaScript 库
    Javascript动态操作CSS总结
    css3动画属性系列之transform细讲scale缩放
    JS函数重载解决方案
    从Java开发者的视角解释JavaScript
    理解JavaScript中的事件路由冒泡过程及委托代理机制
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/10591799.html
Copyright © 2011-2022 走看看