zoukankan      html  css  js  c++  java
  • task04-论文种类分类


    4.1 任务说明

    • 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;
    • 学习任务:使用论文标题完成类别分类;
    • 学习成果:学会文本分类的基本方法(IF-IDF、Fastext、WordVec、Bert)

    4.2 数据处理步骤


    * 对论文标题和摘要进行理;
    * 对论文类别进行处理;
    * 构建文本分类模型;

    4.3 文本分类思路

    • 思路1:TF-IDF+机器学习分类器


    • 思路2:FastText


    • 思路3:WordVec+深度学习分类器


    • 思路4:Bert词向量


    4.4 代码

    import seaborn as sns
    from bs4 import BeautifulSoup
    import re
    import requests
    import json
    import pandas as pd
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    D:anaconda3libsite-packagesIPythoncoreinteractiveshell.py:3146: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    title categories abstract
    0 Calculation of prompt diphoton production cros... hep-ph A fully differential calculation in perturba...
    1 Sparsity-certifying Graph Decompositions math.CO cs.CG We describe a new algorithm, the $(k,ell)$-...
    2 The evolution of the Earth-Moon system based o... physics.gen-ph The evolution of Earth-Moon system is descri...
    3 A determinant of Stirling cycle numbers counts... math.CO We show that a determinant of Stirling cycle...
    4 From dyadic $Lambda_{alpha}$ to $Lambda_{a... math.CA math.FA In this paper we show how to compute the $L...
    data['text']=data['text'].apply(lambda x:x.replace('
    data['text']=data['text'].apply(lambda x:x.lower())
    categories text
    0 hep-ph calculation of prompt diphoton production cros...
    1 math.CO cs.CG sparsity-certifying graph decompositions we d...
    2 physics.gen-ph the evolution of the earth-moon system based o...
    3 math.CO a determinant of stirling cycle numbers counts...
    4 math.CA math.FA from dyadic $lambda_{alpha}$ to $lambda_{a...
    'calculation of prompt diphoton production cross sections at tevatron and  lhc energies  a fully differential calculation in perturbative quantum chromodynamics ispresented for the production of massive photon pairs at hadron colliders. allnext-to-leading order perturbative contributions from quark-antiquark,gluon-(anti)quark, and gluon-gluon subprocesses are included, as well asall-orders resummation of initial-state gluon radiation valid atnext-to-next-to-leading logarithmic accuracy. the region of phase space isspecified in which the calculation is most reliable. good agreement isdemonstrated with data from the fermilab tevatron, and predictions are made formore detailed tests with cdf and do data. predictions are shown fordistributions of diphoton pairs produced at the energy of the large hadroncollider (lhc). distributions of the diphoton pairs from the decay of a higgsboson are contrasted with those produced from qcd processes at the lhc, showingthat enhanced sensitivity to the signal can be obtained with judiciousselection of events.'
    data['categories']=data['categories'].apply(lambda x:x.split(' '))
    data['categories_big']=data['categories'].apply(lambda x:[xx.split('.')[0] for xx in x])
    categories text categories_big
    0 [hep-ph] calculation of prompt diphoton production cros... [hep-ph]
    1 [math.CO, cs.CG] sparsity-certifying graph decompositions we d... [math, cs]
    2 [physics.gen-ph] the evolution of the earth-moon system based o... [physics]
    3 [math.CO] a determinant of stirling cycle numbers counts... [math]
    4 [math.CA, math.FA] from dyadic $lambda_{alpha}$ to $lambda_{a... [math, math]
    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    data_label = mlb.fit_transform(data['categories_big'].iloc[:])
    array([[0, 0, 0, ..., 0, 0, 0],
           [0, 0, 1, ..., 0, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [0, 1, 0, ..., 0, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [0, 1, 0, ..., 0, 0, 0]])
    [0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
    [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]



    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(max_features=4000)
    data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
    # 划分训练集和验证集
    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,test_size = 0.2,random_state = 1)
    # 构建多标签分类模型
    from sklearn.multioutput import MultiOutputClassifier
    from sklearn.naive_bayes import MultinomialNB
    clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)


    from sklearn.metrics import classification_report
    print(classification_report(y_test, clf.predict(x_test)))
                  precision    recall  f1-score   support
               0       0.95      0.84      0.89      7925
               1       0.86      0.78      0.82      7339
               2       0.77      0.70      0.73      2944
               3       0.00      0.00      0.00         4
               4       0.73      0.44      0.55      2123
               5       0.52      0.64      0.58       987
               6       0.85      0.33      0.47       544
               7       0.71      0.67      0.69      3649
               8       0.77      0.58      0.66      3388
               9       0.85      0.88      0.86     10745
              10       0.46      0.10      0.16      1757
              11       0.90      0.04      0.07       729
              12       0.45      0.31      0.37       507
              13       0.55      0.32      0.41      1083
              14       0.68      0.12      0.20      3441
              15       0.82      0.16      0.27       655
              16       0.93      0.14      0.24       268
              17       0.87      0.40      0.55      2484
              18       0.84      0.34      0.49       692
       micro avg       0.82      0.63      0.71     51264
       macro avg       0.71      0.41      0.47     51264
    weighted avg       0.80      0.63      0.68     51264
     samples avg       0.71      0.70      0.69     51264

    D:anaconda3libsite-packagessklearnmetrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, msg_start, len(result))
    D:anaconda3libsite-packagessklearnmetrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.
      _warn_prf(average, modifier, msg_start, len(result))




    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:], data_label,test_size = 0.2,random_state = 1)
    # parameter
    max_features= 500
    max_len= 150
    batch_size = 128
    epochs = 1 
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing import sequence
    tokens = Tokenizer(num_words = max_features)
    x_sub_train = tokens.texts_to_sequences(x_train)
    x_sub_test = tokens.texts_to_sequences(x_test)
    {'the': 1,
     'of': 2,
     'a': 3,
     'and': 4,
     'in': 5,
     'to': 6,
     'we': 7,
     'is': 8,
     'for': 9,
     'with': 10,
     'that': 11,
     'on': 12,
     'are': 13,
     'by': 14,
     'this': 15,
     'as': 16,
     'an': 17,
     'from': 18,
     'at': 19,
     'be': 20,
     '1': 21,
     '2': 22,
     'which': 23,
     '0': 24,
     'model': 25,
     'can': 26,
     'n': 27,
     'two': 28,
     'it': 29,
     'x': 30,
     'field': 31,
     'these': 32,
     'results': 33,
     'show': 34,
     'quantum': 35,
     'our': 36,
     'also': 37,
     '3': 38,
     'energy': 39,
     'using': 40,
     'have': 41,
     'one': 42,
     'time': 43,
     'theory': 44,
     'or': 45,
     'between': 46,
     'k': 47,
     'study': 48,
     'non': 49,
     'mass': 50,
     'data': 51,
     'm': 52,
     'has': 53,
     's': 54,
     'such': 55,
     'not': 56,
     'system': 57,
     't': 58,
     'i': 59,
     'new': 60,
     'p': 61,
     'high': 62,
     'based': 63,
     'paper': 64,
     'present': 65,
     'e': 66,
     'order': 67,
     'state': 68,
     'space': 69,
     'models': 70,
     'd': 71,
     'phase': 72,
     'c': 73,
     'large': 74,
     'spin': 75,
     'r': 76,
     'magnetic': 77,
     'its': 78,
     'g': 79,
     'than': 80,
     'all': 81,
     '4': 82,
     'systems': 83,
     'find': 84,
     'function': 85,
     'well': 86,
     'b': 87,
     'some': 88,
     'properties': 89,
     'where': 90,
     'first': 91,
     'density': 92,
     'their': 93,
     'dimensional': 94,
     'both': 95,
     '5': 96,
     'number': 97,
     'structure': 98,
     'case': 99,
     'states': 100,
     'method': 101,
     'h': 102,
     'low': 103,
     'been': 104,
     'when': 105,
     'type': 106,
     'ray': 107,
     'if': 108,
     'analysis': 109,
     'f': 110,
     '10': 111,
     'z': 112,
     'used': 113,
     'different': 114,
     'temperature': 115,
     'but': 116,
     'problem': 117,
     'l': 118,
     'over': 119,
     'into': 120,
     'only': 121,
     'observed': 122,
     'galaxies': 123,
     'more': 124,
     'stars': 125,
     'finite': 126,
     'star': 127,
     'obtained': 128,
     'group': 129,
     'distribution': 130,
     'found': 131,
     'three': 132,
     'equation': 133,
     'approach': 134,
     'other': 135,
     'gamma': 136,
     'may': 137,
     'effect': 138,
     'dynamics': 139,
     'ofthe': 140,
     'general': 141,
     'use': 142,
     'point': 143,
     'result': 144,
     'functions': 145,
     'q': 146,
     'emission': 147,
     'set': 148,
     'range': 149,
     'equations': 150,
     'single': 151,
     'effects': 152,
     'then': 153,
     'matter': 154,
     'there': 155,
     'due': 156,
     'linear': 157,
     'local': 158,
     'fields': 159,
     'scale': 160,
     'form': 161,
     'given': 162,
     'transition': 163,
     'small': 164,
     'evolution': 165,
     'observations': 166,
     'terms': 167,
     'potential': 168,
     'limit': 169,
     'very': 170,
     'wave': 171,
     'optical': 172,
     'parameters': 173,
     'surface': 174,
     'any': 175,
     'shown': 176,
     'gas': 177,
     'formation': 178,
     'solutions': 179,
     'light': 180,
     'within': 181,
     'particular': 182,
     'was': 183,
     'under': 184,
     'up': 185,
     'black': 186,
     'spectrum': 187,
     'rate': 188,
     'dark': 189,
     'most': 190,
     'electron': 191,
     'v': 192,
     '6': 193,
     'here': 194,
     'consider': 195,
     'out': 196,
     'prove': 197,
     'known': 198,
     'through': 199,
     'alpha': 200,
     'strong': 201,
     'possible': 202,
     'they': 203,
     'class': 204,
     'parameter': 205,
     'simple': 206,
     'will': 207,
     'power': 208,
     'lattice': 209,
     'like': 210,
     'discuss': 211,
     'while': 212,
     'galaxy': 213,
     'each': 214,
     'how': 215,
     'particle': 216,
     'give': 217,
     'process': 218,
     'conditions': 219,
     'interaction': 220,
     'however': 221,
     'free': 222,
     'line': 223,
     'work': 224,
     'u': 225,
     'random': 226,
     'symmetry': 227,
     'no': 228,
     'coupling': 229,
     'simulations': 230,
     'region': 231,
     'measurements': 232,
     'solution': 233,
     'about': 234,
     'o': 235,
     'level': 236,
     'current': 237,
     'classical': 238,
     'standard': 239,
     'recent': 240,
     'spectral': 241,
     'j': 242,
     'so': 243,
     'same': 244,
     'information': 245,
     'many': 246,
     'cluster': 247,
     'provide': 248,
     'pi': 249,
     'value': 250,
     'algorithm': 251,
     'values': 252,
     'stellar': 253,
     'scattering': 254,
     'matrix': 255,
     'gauge': 256,
     'hole': 257,
     'investigate': 258,
     'associated': 259,
     '8': 260,
     'near': 261,
     'size': 262,
     'complex': 263,
     'constant': 264,
     'times': 265,
     'several': 266,
     'numerical': 267,
     'effective': 268,
     'long': 269,
     'critical': 270,
     'behavior': 271,
     'spectra': 272,
     'studied': 273,
     'second': 274,
     'obtain': 275,
     'methods': 276,
     'higher': 277,
     'were': 278,
     'proposed': 279,
     'zero': 280,
     'groups': 281,
     'source': 282,
     'velocity': 283,
     'self': 284,
     'sources': 285,
     'even': 286,
     'experimental': 287,
     'particles': 288,
     'frequency': 289,
     'lambda': 290,
     'sample': 291,
     '7': 292,
     'decay': 293,
     'consistent': 294,
     'presented': 295,
     'interactions': 296,
     'sigma': 297,
     'gravity': 298,
     'those': 299,
     'similar': 300,
     'flow': 301,
     'algebra': 302,
     'bound': 303,
     'dependence': 304,
     'physics': 305,
     'way': 306,
     'disk': 307,
     'derived': 308,
     'flux': 309,
     'clusters': 310,
     'radio': 311,
     'delta': 312,
     'processes': 313,
     'quark': 314,
     'mean': 315,
     'main': 316,
     'network': 317,
     'theorem': 318,
     'via': 319,
     'presence': 320,
     'boundary': 321,
     'induced': 322,
     'correlation': 323,
     'w': 324,
     'including': 325,
     'mu': 326,
     'band': 327,
     'related': 328,
     'charge': 329,
     'ii': 330,
     'dynamical': 331,
     'corresponding': 332,
     'dependent': 333,
     'real': 334,
     'momentum': 335,
     'discussed': 336,
     'ratio': 337,
     'networks': 338,
     'scalar': 339,
     'weak': 340,
     'existence': 341,
     'masses': 342,
     'structures': 343,
     'inthe': 344,
     'spaces': 345,
     'production': 346,
     'experiments': 347,
     'physical': 348,
     'solar': 349,
     'vector': 350,
     'relation': 351,
     'thus': 352,
     'lower': 353,
     'thermal': 354,
     'approximation': 355,
     'measure': 356,
     'describe': 357,
     'initial': 358,
     'compared': 359,
     'derive': 360,
     'important': 361,
     'framework': 362,
     'noise': 363,
     'factor': 364,
     'cases': 365,
     'theoretical': 366,
     'motion': 367,
     'massive': 368,
     'certain': 369,
     'plane': 370,
     'law': 371,
     'four': 372,
     'finally': 373,
     'cross': 374,
     'propose': 375,
     'calculations': 376,
     'operators': 377,
     'problems': 378,
     'invariant': 379,
     'gravitational': 380,
     'allows': 381,
     'various': 382,
     'recently': 383,
     'algebras': 384,
     'modes': 385,
     'previous': 386,
     'during': 387,
     'considered': 388,
     'y': 389,
     'measured': 390,
     'component': 391,
     'part': 392,
     'photon': 393,
     'us': 394,
     'length': 395,
     'without': 396,
     'molecular': 397,
     'could': 398,
     'distance': 399,
     'generalized': 400,
     'defined': 401,
     'multi': 402,
     'applications': 403,
     'compact': 404,
     'scheme': 405,
     'total': 406,
     'regions': 407,
     'symmetric': 408,
     'applied': 409,
     'independent': 410,
     'constraints': 411,
     'objects': 412,
     'neutrino': 413,
     'transport': 414,
     'galactic': 415,
     'channel': 416,
     'around': 417,
     'points': 418,
     'coupled': 419,
     'along': 420,
     'dimension': 421,
     'qcd': 422,
     'distributions': 423,
     'detection': 424,
     'regime': 425,
     'probability': 426,
     'resolution': 427,
     'mode': 428,
     'role': 429,
     'cosmic': 430,
     'survey': 431,
     'omega': 432,
     'nonlinear': 433,
     'shows': 434,
     'evidence': 435,
     'geometry': 436,
     'beta': 437,
     'theories': 438,
     'does': 439,
     'background': 440,
     'binary': 441,
     'operator': 442,
     'infrared': 443,
     'studies': 444,
     'exact': 445,
     'universe': 446,
     'mechanism': 447,
     'report': 448,
     'central': 449,
     'loop': 450,
     'features': 451,
     'cosmological': 452,
     'luminosity': 453,
     'entropy': 454,
     'nuclear': 455,
     'al': 456,
     'them': 457,
     'demonstrate': 458,
     'dimensions': 459,
     'do': 460,
     'term': 461,
     'strongly': 462,
     'lines': 463,
     'measurement': 464,
     'graph': 465,
     'extended': 466,
     'fluctuations': 467,
     'above': 468,
     'ground': 469,
     'degree': 470,
     'relativistic': 471,
     'after': 472,
     'determine': 473,
     'provides': 474,
     'complete': 475,
     'heavy': 476,
     'radiation': 477,
     'stable': 478,
     'fermi': 479,
     'application': 480,
     'action': 481,
     'control': 482,
     'called': 483,
     'series': 484,
     'body': 485,
     'redshift': 486,
     'gev': 487,
     'expansion': 488,
     'described': 489,
     'positive': 490,
     'fixed': 491,
     'further': 492,
     'leads': 493,
     'short': 494,
     'bar': 495,
     'larger': 496,
     'differential': 497,
     'description': 498,
     'direct': 499,
     'close': 500,
     'waves': 501,
     'scaling': 502,
     'agreement': 503,
     'optimal': 504,
     'dust': 505,
     'et': 506,
     'condition': 507,
     'core': 508,
     'entanglement': 509,
     'signal': 510,
     'global': 511,
     'expected': 512,
     'phi': 513,
     'pair': 514,
     'neutron': 515,
     'search': 516,
     'construct': 517,
     'significant': 518,
     'test': 519,
     'polarization': 520,
     'equilibrium': 521,
     'collisions': 522,
     'open': 523,
     'higgs': 524,
     'respect': 525,
     'account': 526,
     'technique': 527,
     'stability': 528,
     'holes': 529,
     'scales': 530,
     '9': 531,
     'string': 532,
     'spatial': 533,
     'sets': 534,
     'determined': 535,
     'upper': 536,
     'simulation': 537,
     'sequence': 538,
     'example': 539,
     'correlations': 540,
     'much': 541,
     'nu': 542,
     'energies': 543,
     'examples': 544,
     'population': 545,
     'whose': 546,
     'addition': 547,
     'quasi': 548,
     'rates': 549,
     'tau': 550,
     'statistical': 551,
     'multiple': 552,
     'graphene': 553,
     'leading': 554,
     'proof': 555,
     'curves': 556,
     'metric': 557,
     'estimate': 558,
     'investigated': 559,
     'nature': 560,
     'let': 561,
     'accretion': 562,
     'arbitrary': 563,
     'lie': 564,
     'medium': 565,
     'gap': 566,
     'sum': 567,
     'generated': 568,
     'fe': 569,
     'normal': 570,
     'full': 571,
     'double': 572,
     'atoms': 573,
     'growth': 574,
     'graphs': 575,
     'components': 576,
     'minimal': 577,
     'jet': 578,
     'co': 579,
     'special': 580,
     'formula': 581,
     'angular': 582,
     'force': 583,
     'asymptotic': 584,
     'de': 585,
     'detected': 586,
     'pressure': 587,
     'domain': 588,
     'integral': 589,
     'few': 590,
     'moreover': 591,
     'topological': 592,
     'maximum': 593,
     'good': 594,
     'developed': 595,
     'fraction': 596,
     'respectively': 597,
     'since': 598,
     'calculate': 599,
     'discrete': 600,
     'infinite': 601,
     'representation': 602,
     'elements': 603,
     'continuous': 604,
     'resonance': 605,
     'being': 606,
     'relative': 607,
     'log': 608,
     'introduce': 609,
     'techniques': 610,
     'mixing': 611,
     'stochastic': 612,
     'bounds': 613,
     'experiment': 614,
     'fundamental': 615,
     'below': 616,
     'specific': 617,
     'maps': 618,
     'gaussian': 619,
     'decays': 620,
     'performance': 621,
     'means': 622,
     'error': 623,
     'radius': 624,
     '20': 625,
     'diffusion': 626,
     'average': 627,
     'numbers': 628,
     'should': 629,
     'comparison': 630,
     'product': 631,
     'su': 632,
     'calculated': 633,
     'corrections': 634,
     'einstein': 635,
     'either': 636,
     'context': 637,
     'surfaces': 638,
     'predictions': 639,
     'less': 640,
     'period': 641,
     'chiral': 642,
     'hamiltonian': 643,
     'algorithms': 644,
     'basis': 645,
     'functional': 646,
     'among': 647,
     'ring': 648,
     'suggest': 649,
     'introduced': 650,
     'compute': 651,
     'explicit': 652,
     'rotation': 653,
     'magnitude': 654,
     'closed': 655,
     'index': 656,
     'transitions': 657,
     'amplitude': 658,
     'driven': 659,
     'telescope': 660,
     'construction': 661,
     'metal': 662,
     'conjecture': 663,
     'atomic': 664,
     'early': 665,
     'analyze': 666,
     'compare': 667,
     'map': 668,
     'origin': 669,
     'family': 670,
     'periodic': 671,
     'electronic': 672,
     'absorption': 673,
     'curvature': 674,
     'made': 675,
     'electrons': 676,
     'bulk': 677,
     'natural': 678,
     'orbital': 679,
     'estimates': 680,
     'performed': 681,
     'change': 682,
     'manifolds': 683,
     'plasma': 684,
     '100': 685,
     'least': 686,
     'relations': 687,
     'tensor': 688,
     'variables': 689,
     'would': 690,
     'transfer': 691,
     'variable': 692,
     'types': 693,
     'classes': 694,
     'resulting': 695,
     'electric': 696,
     'wide': 697,
     'negative': 698,
     'mathbb': 699,
     'contribution': 700,
     'gives': 701,
     'polynomial': 702,
     'far': 703,
     'algebraic': 704,
     'lhc': 705,
     'hard': 706,
     'shape': 707,
     'interacting': 708,
     'furthermore': 709,
     'universal': 710,
     'vacuum': 711,
     'limits': 712,
     'future': 713,
     'because': 714,
     'manifold': 715,
     'sim': 716,
     'pm': 717,
     'coefficients': 718,
     'property': 719,
     'design': 720,
     'fluid': 721,
     'previously': 722,
     'geometric': 723,
     'temperatures': 724,
     'events': 725,
     'breaking': 726,
     'tothe': 727,
     'phases': 728,
     'ion': 729,
     'monte': 730,
     'rm': 731,
     'matrices': 732,
     'therefore': 733,
     'rho': 734,
     'volume': 735,
     'partial': 736,
     'although': 737,
     '12': 738,
     'theta': 739,
     'lead': 740,
     'spectroscopy': 741,
     'pairs': 742,
     'review': 743,
     'almost': 744,
     'analytic': 745,
     'strength': 746,
     'superconducting': 747,
     'radial': 748,
     'curve': 749,
     'characteristic': 750,
     'available': 751,
     'detector': 752,
     'liquid': 753,
     'chain': 754,
     'edge': 755,
     'agn': 756,
     'code': 757,
     'halo': 758,
     'carlo': 759,
     'angle': 760,
     'produced': 761,
     'extension': 762,
     'beam': 763,
     'charged': 764,
     'increase': 765,
     'version': 766,
     'equivalent': 767,
     'key': 768,
     'efficient': 769,
     'layer': 770,
     'apply': 771,
     'cm': 772,
     'orbit': 773,
     'significantly': 774,
     'oscillations': 775,
     'smooth': 776,
     'formalism': 777,
     'peak': 778,
     'nuclei': 779,
     'down': 780,
     'observation': 781,
     'analytical': 782,
     'center': 783,
     'images': 784,
     'changes': 785,
     'dual': 786,
     '15': 787,
     'difference': 788,
     'scenario': 789,
     'every': 790,
     'coherent': 791,
     'infty': 792,
     'loss': 793,
     'response': 794,
     'chemical': 795,
     'exchange': 796,
     'section': 797,
     'generation': 798,
     'detailed': 799,
     'external': 800,
     'active': 801,
     'direction': 802,
     'additional': 803,
     'channels': 804,
     'flat': 805,
     'laser': 806,
     'fast': 807,
     'principle': 808,
     'explain': 809,
     'forms': 810,
     'reduced': 811,
     'diagram': 812,
     'half': 813,
     'heat': 814,
     'highly': 815,
     'depends': 816,
     'statistics': 817,
     'off': 818,
     'factors': 819,
     'area': 820,
     'novel': 821,
     'fact': 822,
     'allow': 823,
     'calculation': 824,
     'semi': 825,
     'bounded': 826,
     'able': 827,
     'smaller': 828,
     'complexity': 829,
     'codes': 830,
     'transverse': 831,
     'sun': 832,
     'dynamic': 833,
     'abelian': 834,
     'eta': 835,
     'boson': 836,
     'thin': 837,
     'influence': 838,
     'predicted': 839,
     'connected': 840,
     'article': 841,
     'rather': 842,
     'relevant': 843,
     'environment': 844,
     'make': 845,
     'best': 846,
     'continuum': 847,
     'sub': 848,
     'provided': 849,
     'include': 850,
     'contrast': 851,
     'dirac': 852,
     'mathcal': 853,
     'fit': 854,
     'years': 855,
     'au': 856,
     'perturbation': 857,
     'beyond': 858,
     'soft': 859,
     'end': 860,
     'degrees': 861,
     'modified': 862,
     'imaging': 863,
     'square': 864,
     'proton': 865,
     'develop': 866,
     'dispersion': 867,
     'reduction': 868,
     'material': 869,
     'forthe': 870,
     'minimum': 871,
     'following': 872,
     'possibility': 873,
     'cp': 874,
     'regular': 875,
     'relaxation': 876,
     'cannot': 877,
     'rays': 878,
     'polynomials': 879,
     'instability': 880,
     'dwarf': 881,
     'extend': 882,
     'onthe': 883,
     'densities': 884,
     'define': 885,
     'must': 886,
     'pure': 887,
     'taken': 888,
     'ir': 889,
     'threshold': 890,
     'supersymmetric': 891,
     'variety': 892,
     'representations': 893,
     'brane': 894,
     'parallel': 895,
     'next': 896,
     'km': 897,
     'latter': 898,
     'impact': 899,
     'intermediate': 900,
     'photons': 901,
     'required': 902,
     'understanding': 903,
     'mechanics': 904,
     'path': 905,
     'identify': 906,
     'increasing': 907,
     'identified': 908,
     'potentials': 909,
     'hot': 910,
     'accuracy': 911,
     '11': 912,
     'better': 913,
     'still': 914,
     'final': 915,
     'procedure': 916,
     '30': 917,
     'towards': 918,
     'conformal': 919,
     'uniform': 920,
     'constructed': 921,
     'crystal': 922,
     'approaches': 923,
     'samples': 924,
     'width': 925,
     'increases': 926,
     'analyzed': 927,
     'atom': 928,
     'inner': 929,
     'ngc': 930,
     'mev': 931,
     'homogeneous': 932,
     'seen': 933,
     'color': 934,
     'convergence': 935,
     'electromagnetic': 936,
     'efficiency': 937,
     'inverse': 938,
     'bose': 939,
     'equal': 940,
     'image': 941,
     'probe': 942,
     'cloud': 943,
     'ads': 944,
     'necessary': 945,
     'fully': 946,
     'profile': 947,
     'step': 948,
     'what': 949,
     'static': 950,
     'exist': 951,
     'top': 952,
     'speed': 953,
     'position': 954,
     'view': 955,
     'andthe': 956,
     'shock': 957,
     'likely': 958,
     'correlated': 959,
     'epsilon': 960,
     'psi': 961,
     'exhibit': 962,
     'together': 963,
     'basic': 964,
     'modeling': 965,
     'per': 966,
     'young': 967,
     'disks': 968,
     'behaviour': 969,
     'whether': 970,
     'object': 971,
     'note': 972,
     'rich': 973,
     'tev': 974,
     'ratios': 975,
     'common': 976,
     'estimation': 977,
     'focus': 978,
     'rank': 979,
     'deep': 980,
     'connection': 981,
     'giant': 982,
     'excitation': 983,
     'assuming': 984,
     'fermions': 985,
     'forming': 986,
     'appear': 987,
     'combined': 988,
     'thatthe': 989,
     'contributions': 990,
     'measures': 991,
     'spacetime': 992,
     'exists': 993,
     'feature': 994,
     'sqrt': 995,
     'shell': 996,
     'question': 997,
     'nucleon': 998,
     'flavor': 999,
     'flows': 1000,
    x_sub_train=sequence.pad_sequences(x_sub_train, maxlen=max_len)
    x_sub_test=sequence.pad_sequences(x_sub_test, maxlen=max_len)


    from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
    from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
    from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
    from keras.callbacks import Callback
    from keras.callbacks import EarlyStopping,ModelCheckpoint
    from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
    from keras.models import Model
    from keras.optimizers import Adam
    # 建立LSTM模型
    sequence_input = Input(shape=(max_len, ))
    x = Embedding(max_features, embed_size,trainable = False)(sequence_input)
    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(GRU(128, 
    x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = 
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    x = concatenate([avg_pool, max_pool]) 
    preds = Dense(19, activation="sigmoid")(x)


    (160000, 150)
    (160000, 19)
    model = Model(sequence_input, preds)
    model.fit(x_sub_train, y_train, batch_size=batch_size, epochs=epochs)
    1250/1250 [==============================] - 2934s 2s/step - loss: 0.1984 - accuracy: 0.3651
    <tensorflow.python.keras.callbacks.History at 0x18c48aa9c70>
    prediction=model.predict(x_sub_test, batch_size=batch_size)
    array([[0.02252209, 0.21710196, 0.05228466, ..., 0.00359941, 0.04740071,
           [0.06721482, 0.07629076, 0.1790415 , ..., 0.0033555 , 0.04327598,
           [0.03353414, 0.19576049, 0.06734261, ..., 0.00558475, 0.02856755,
           [0.01758364, 0.7247476 , 0.0205844 , ..., 0.00262681, 0.12628657,
            0.0036349 ],
           [0.02018005, 0.04203317, 0.03764346, ..., 0.00153542, 0.00649127,
           [0.01772627, 0.1940268 , 0.06325474, ..., 0.00296167, 0.03129855,
            0.01045477]], dtype=float32)
    (40000, 19)
    array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])
    def lastprocess(prediction,yuzhi):
        import numpy as np
        for i in range(0,len(prediction)):
            for j in range(0,len(prediction[0])):
                if prediction[i][j]>=yuzhi:
            if sum(myarray[i])==0:
        return myarray
    array([[0., 1., 0., ..., 0., 0., 0.],
           [0., 0., 1., ..., 0., 0., 0.],
           [0., 1., 0., ..., 0., 0., 0.],
           [0., 1., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 1., 0., ..., 0., 0., 0.]])
    from sklearn.metrics import classification_report
    print(classification_report(y_test, lastprocess(prediction,0.15)))
                  precision    recall  f1-score   support
               0       0.64      0.79      0.71      7925
               1       0.39      0.77      0.52      7339
               2       0.28      0.57      0.38      2944
               3       0.00      0.00      0.00         4
               4       0.48      0.44      0.46      2123
               5       0.32      0.35      0.34       987
               6       0.33      0.12      0.18       544
               7       0.39      0.56      0.46      3649
               8       0.44      0.43      0.44      3388
               9       0.43      0.96      0.59     10745
              10       0.09      0.00      0.00      1757
              11       0.00      0.00      0.00       729
              12       0.04      0.00      0.00       507
              13       0.15      0.07      0.09      1083
              14       0.22      0.05      0.08      3441
              15       0.00      0.00      0.00       655
              16       0.00      0.00      0.00       268
              17       0.39      0.63      0.48      2484
              18       0.00      0.00      0.00       692
       micro avg       0.43      0.60      0.50     51264
       macro avg       0.24      0.30      0.25     51264
    weighted avg       0.39      0.60      0.45     51264
     samples avg       0.50      0.65      0.53     51264


  • 相关阅读:
    java 泛型详解
    栈的应用 函数调用
    java中ArrayList 遍历方式、默认容量、扩容机制
    java代码实现自定义栈 + 时间复杂度分析
    Mybatis 动态SQL注解 in操作符的用法
    设计模式之 外观模式
    设计模式之 装饰器模式
    设计模式之 组合模式
  • 原文地址:https://www.cnblogs.com/Zfancy/p/14315831.html
Copyright © 2011-2022 走看看