zoukankan      html  css  js  c++  java
  • Deep Learning for Natural Language Processeing:vector space models

    三种分类:

    term–document

    word–context

    pair–pattern

    semantics:the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning

    特点:

    从语料库中自动获取信息,节省工作量

    衡量词、短语、文档相似性

    The Term–Document Matrix:行向量为词terms,列向量为文档documents

    bag:可包含重复元素的集合,表示为矩阵X in which each column x:j corresponds to a bag, each row xi: corresponds to a unique member, and an element xij is the frequency of the i-th member in the j-th bag

    word–context

    The distributional hypothesis in linguistics is that words that occur in similar contexts
    tend to have similar meanings

    pair–pattern

    mason : stone
    carpenter : wood

    X cuts Y

    “X works with Y

    extended distributional hypothesis, that patterns that co-occur with similar pairs tend to have similar meanings、

    latent relation hypothesis is that pairs of words that co-occur in similar patterns
    tend to have similar semantic relations

    attributional similarity: word–context sima(a, b) ∈R

    relational similarity:pair–pattern simr(a : b, c : d) ∈R

    A token is a single instance of a symbol, whereas a type is a general class of tokens

    Statistical semantics hypothesis

    If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings

    Bag of words hypothesis

    If documents and pseudodocuments (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings

    Distributional hypothesis

    If words have similar row vectors in a word–context matrix, then they tend to have similar meanings

    Extended distributional hypothesis

    If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations

    Latent relation hypothesis

    If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations

    Linguistic Processing for Vector Space Models

    1.tokenize the raw text: decide what constitutes a term and how to extract terms from raw text

    punctuation (e.g., don’t, Jane’s, and/or), hyphenation (e.g., state-of-the-art versus state of the art), and recognize multi-word terms (e.g., Barack Obama and ice hockey)

    2.normalize the raw text: convert superficially different strings of characters to the same form

    Case folding

    3.annotate the raw text: mark identical strings of characters as being different

    Mathematical Processing for Vector Space Models

    1. generate a matrix of frequencies

    First, scan sequentially through the corpus, recording events and their frequencies in a hash table, a database, or a search engine index. Second, use the resulting data structure to generate the frequency matrix, with a sparse matrix representation

    2.adjust the weights of the elements in the matrix

    tf-idf (term frequency × inverse document frequency) family of weighting functions

    length normalization

    Term weighting

    Pointwise Mutual Information (PMI) problem: infrequent events

    3.smooth the matrix to reduce the amount of random noise and to fill in some of the zero elements in a sparse matrix

    Singular Value Decomposition (SVD)奇异值分解

    latent meaning, noise reduction, high-order co-occurrence, and sparsity reduction

    Optimizations and parallelization for similarity computing

    sparse-matrix multiplication 相关性分解成三个部分,X的非零值,Y的非零值,X,Y中的非零值

    分布式处理mapreduce hadoop

    randomized algorithm: dimension reduction

    machine learning

  • 相关阅读:
    第三届NSCTF之easy ssrf
    第三届NSCTF测试题Welcome
    第三届NSCTF测试题Code php 之MD5碰撞和php strcmp函数绕过
    PUT方法提交message
    使用John the Ripper破解sha512加密的shadow文件密码
    第2章 SQL注入攻击:课时1:注入攻击原理及自己编写一个注入点
    配置IIS10支持php语言脚本
    身边的base64解码工具
    什么是JWT
    IDEA 远程调试
  • 原文地址:https://www.cnblogs.com/learnmuch/p/5971268.html
Copyright © 2011-2022 走看看