Faiss向量相似性搜索

zoukankan html css js c++ java

Faiss向量相似性搜索
移步bdata-cap.com

Faiss 快速入门（1）

Faiss 更快的索引（2）

Faiss低内存占用（3）

Faiss 构建: clustering, PCA, quantization（4）

如何选择Faiss索引（5）
官网地址，你也可以访问我的Github，运行代码。

本文是基于官网整理，为了防止理解偏差或简化，有些直接用英文。另外，也加了一点自己的理解。

数据准备

Faiss可以处理固定维度d的向量集合，typically a few 10s to 100s。向量集合被保存在矩阵中。我们假设行主存储（例如，the j'th component of vector number i is stored in row i, column j of the matrix）。Faiss只能使用32位浮点矩阵。

我们需要两个矩阵：

xb 为语料, that contains all the vectors that must be indexed, and that we are going to search in. 它的大小为nb-by-d

xq 为查询的向量集合, for which we need to find the nearest neighbors. 大小为nq-by-d. 如果我们只有一个查询向量，那么nq=1.

下面例子，我们将学习在d=64维空间中向量，是0-1均匀分布，他们的值在(0,1)范围内。为了增加娱乐性，我们在第一个向量上加个小平移。

import numpy as np d = 64 # dimension nb = 100000 # database size nq = 10000 # nb of queries np.random.seed(1234) # make reproducible xb = np.random.random((nb, d)).astype('float32') xb[:, 0] += np.arange(nb) / 1000. xq = np.random.random((nq, d)).astype('float32') xq[:, 0] += np.arange(nq) / 1000. # import matplotlib.pyplot as plt # plt.hist(xb[6]) # plt.show()

创建一个索引，并向它添加向量

Faiss始终围绕着索引对象展开的. 它封装了数据向量集合, 并且可以对他们进行预处理，以提高搜索效率. 有很多类型的索引, 我们使用最简单的一个，执行暴力L2距离搜索（brute-force L2 distance search）：IndexFlatL2.

所有索引构建时都必须指定向量的维度d。而大多数索引还需要一个训练阶段，以便分析向量的分布。对于IndexFlatL2来说，可以跳过训练这步（因为是暴力搜索，不用分析向量）.

当构建和训练索引后，在索引上执行两个操作：add和search.

向索引添加数据，在xb上调用add方法. 有两个索引的状态变量：

is_trained, 布尔型，表示是否需要训练

ntotal, 被索引的向量集合的大小

一些索引也可以对每个向量存储整型ID(IndexFlatL2不用). 如果不提供ID，使用向量的序号作为id，例如，第一个向量为0，第二个为1……以此类推

import faiss # make faiss available index = faiss.IndexFlatL2(d) # build the index print(index.is_trained) index.add(xb) # add vectors to the index print(index.ntotal)

结果：

True True 100000

搜索

在索引上可以执行的最基本操作是 k-nearest-neighbor search(knn), 例如，对每个向量，在数据库中查找它的 k近邻.

结果保存在大小为 nq-by-k 的矩阵中, 其中，第i行是其向量i的近邻id, 按距离升序排序. 除了k近邻矩阵外, 还会返回一个平方距离(squared distances)的矩阵，其大小为nq-by-k的浮点矩阵。

常用距离计算方法：https://zhuanlan.zhihu.com/p/101277851

先来一个简单测试，用数据库中的小部分向量进行检索，来确保其最近邻确实是向量本身

先用训练数据进行检索，理论上，会返回自己。

k = 4 # we want to see 4 nearest neighbors D, I = index.search(xb[:5], k) # sanity check print(xb.shape) print('I', I) print('D', D)

结果：

(100000, 64) I [[ 0 393 363 78] [ 1 555 277 364] [ 2 304 101 13] [ 3 173 18 182] [ 4 288 370 531]] D [[0. 7.1751733 7.207629 7.2511625] [0. 6.3235645 6.684581 6.7999454] [0. 5.7964087 6.391736 7.2815123] [0. 7.2779055 7.5279865 7.6628466] [0. 6.7638035 7.2951202 7.3688145]]

再用查询向量搜索

D, I = index.search(xq, k) # actual search print('I[:5]', I[:5]) # neighbors of the 5 first queries print('D[:5]', D[:5]) print('-----') print('I[-5:]', I[-5:]) # neighbors of the 5 last queries print('D[-5:]', D[-5:])

结果：

I[:5] [[ 381 207 210 477] [ 526 911 142 72] [ 838 527 1290 425] [ 196 184 164 359] [ 526 377 120 425]] D[:5] [[6.8154984 6.8894653 7.3956795 7.4290257] [6.6041107 6.679695 6.7209625 6.828682 ] [6.4703865 6.8578606 7.0043793 7.036564 ] [5.573681 6.407543 7.1395226 7.3555984] [5.409401 6.232216 6.4173393 6.5743675]] ----- I[-5:] [[ 9900 10500 9309 9831] [11055 10895 10812 11321] [11353 11103 10164 9787] [10571 10664 10632 9638] [ 9628 9554 10036 9582]] D[-5:] [[6.5315704 6.97876 7.0039215 7.013794 ] [4.335266 5.2369385 5.3194275 5.7032776] [6.072693 6.5767517 6.6139526 6.7323 ] [6.637512 6.6487427 6.8578796 7.0096436] [6.2183685 6.4525146 6.548767 6.581299 ]]

结果

进行一下结果的合理性检查，如果是用训练数据搜索，得到如下结果

[[ 0 393 363 78] [ 1 555 277 364] [ 2 304 101 13] [ 3 173 18 182] [ 4 288 370 531]] [[ 0. 7.17517328 7.2076292 7.25116253] [ 0. 6.32356453 6.6845808 6.79994535] [ 0. 5.79640865 6.39173603 7.28151226] [ 0. 7.27790546 7.52798653 7.66284657] [ 0. 6.76380348 7.29512024 7.36881447]]

可以看到：

上面是knn矩阵，结果的确是它自己

下面距离矩阵，相应的距离是0，按升序排序

如果用查询向量搜索，会得到如下结果

[[ 381 207 210 477] [ 526 911 142 72] [ 838 527 1290 425] [ 196 184 164 359] [ 526 377 120 425]] [[ 9900 10500 9309 9831] [11055 10895 10812 11321] [11353 11103 10164 9787] [10571 10664 10632 9638] [ 9628 9554 10036 9582]]

Because of the value added to the first component of the vectors, the dataset is smeared along the first axis in d-dim space. So the neighbors of the first few vectors are around the beginning of the dataset, and the ones of the vectors around ~10000 are also around index 10000 in the dataset.

(END.)
查看全文

相关阅读:
ssh：connect to host xxx port 22: Connection refused---------you don't have permission to acccess the requested location--------remote host identification has changed
YOLOV3算法笔记
 jQuery元素查找方法
 jquery获取元素方式
 jQuery常用事件
 Jquery常用方法
 JavaScript数组操作
 jQuery中.bind() .live() .delegate() .on()的区别
 CSS初始化代码
 jQuery判断浏览器

原文地址：https://www.cnblogs.com/liuning8023/p/12623339.html

Faiss向量相似性搜索

移步bdata-cap.com

Faiss 快速入门（1）

Faiss 更快的索引（2）

Faiss低内存占用（3）

Faiss 构建: clustering, PCA, quantization（4）

如何选择Faiss索引（5）

官网地址 ，你也可以访问我的Github，运行代码。

本文是基于官网整理，为了防止理解偏差或简化，有些直接用英文。另外，也加了一点自己的理解。

数据准备

创建一个索引，并向它添加向量

搜索

结果

官网地址，你也可以访问我的Github，运行代码。