关于通过标签取得相关文章的算法

zoukankan html css js c++ java

关于通过标签取得相关文章的算法

有10000篇文章，每篇可能有0－10个标签，不同的标签共有1000个，用什么算法能最快地获取与指定文章相关度最高的其它文章？

用一个1000bit(归约为1024bit)数据类型来记录每篇文章包含了哪些标签，然后对这个数据进行与运算，以结果里出现的1的个数为标准排序即可。

规模大约为：
数据传输：1024bit=128Byte, 128Byte*10000=128B*10K=1MB（可以缓存，不是太大）
数据运算：比较次数为10000,每次比较1024bit。

得写个示例程序测试一下可行性。

using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;

public class Test
{
static readonly int tagsCount=3000;
static readonly int articleCount=30000;
static List<BitArray> articleTags=new List<BitArray>(articleCount);

public static void Main()
{
Stopwatch sw=new Stopwatch();
sw.Start();
for(int i=0; i<articleCount; i++)
{
articleTags.Add(new BitArray(tagsCount));
}
List<CountAndIndex> countsAndIndex=new List<CountAndIndex>(articleCount);
for(int i=0; i<articleCount; i++)
{
countsAndIndex.Add(new CountAndIndex(Count(articleTags[0].And(articleTags[i])), i));
}
countsAndIndex.Sort();
sw.Stop();
Console.WriteLine(sw.Elapsed);
}

static int Count(BitArray bits)
{
int result=0;
foreach(bool bit in bits)
{
if(bit)
++result;
}
return result;
}
}

struct CountAndIndex : IComparable<CountAndIndex>
{
public int Count;
public int Index;

public CountAndIndex(int count, int index)
{
Count=count;
Index=index;
}

public int CompareTo(CountAndIndex other)
{
return this.Count.CompareTo(other.Count);
}
}
1K Tags * 10K articles: 00:00:00.4684715
2K Tags * 20K articles: 00:00:01.7932927
3K tags * 30K articles: 00:00:04.0203271
10K tags * 100K articles: 00:00:44.2125127

查看全文

相关阅读:
4-17 文字图片绘制
 4-16 矩形圆形任意多边形绘制
 4-15 线段绘制
 4-14 图像特效小结
 4-13 油画特效
 4-12 颜色映射
 4-11 浮雕效果
 Linux文本截取命令cut笔记
 45张令程序员泪流满面的趣图
 45张令程序员泪流满面的趣图

原文地址：https://www.cnblogs.com/zerogo/p/2209120.html