通用型单句Embedding
- 词向量平均
- 词向量乘上词权重后进行累加
- 综合考虑词频、词向量矩阵分解
- 类似word2vec训练sentence2vec
- Distributed Representations of Sentences and Documents
- 实现: https://rare-technologies.com/doc2vec-tutorial/
- 如果句子中有数字
- Sequential (Denoising) Autoencoders (SDAE)
- 在句子中加入噪声后作为数据,原句子作为target,构建seq2seq模型
- Learning Distributed Representations of Sentences from Unlabelled Data
- An Overview of Sentence Embedding Methods
场景相关Embedding
- seq2seq, RNN encoder-decoder,机器翻译的训练语料是sentence pair(Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)
- 搜索引擎的结果点击,也能形成sentence-pair
- self-attention + bi-lstm
- 形成一个sentence embedding matrix,矩阵的每一行是对句子的一个语义的representation
- A structured self-attentive sentence embedding https://github.com/kaushalshetty/Structured-Self-Attention https://github.com/jx00109/structured-self-attentive-sentence-embedding