之前讲过汉字注音问题,也发过关于拼音匹配问题,但是没法处理多音字问题
例如:
汉字:不能说的秘密
拼音:bu|fou nai|neng shuo|shui|yue de|di bi|mi mi
当我们输入:bunengshuodebimi,bunegnshuodemimi,bnsdmm,bnengyuedebimi,buneshdimmi等都可以匹配成功
这里很多字都有多音字,要判断出每个字的确切的读音比较困难。这里我们就对每一种读音都进行匹配
思路:
这里对每个字的拼音首字母进行匹配
每个字的首字母为: b|f n s|y d b|m m
这里假设模式串为:nengshuodemimi (假设为target)
在首字母中找到 target[0],匹配后面的首字母
然后找到匹配的首字母,遍历一遍拼音,看是否包含模式串,如果包含,则返回真
说的不是很清楚,看代码把
先发一个获取汉字拼音的类,拼音数据文件在后面
public static class PinyinHelper { private static Dictionary<int, string> pinyindictionary = new Dictionary<int, string>(); //问题:如何初始化静态类的成员?? public static void Init() { using (StreamReader reader = new StreamReader("Data/pinyindata.txt", Encoding.UTF8)) { string content = reader.ReadToEnd(); StringReader stringreader = new StringReader(content); string readline = string.Empty; string[] lines = new string[2]; while ((readline = stringreader.ReadLine()) != null) { lines = readline.Split(' '); pinyindictionary.Add(Convert.ToInt32(lines[0], 16), lines[1]); } } } private static string GetPinyin(char ch) { return pinyindictionary[ch]; } public static string GetPinyin(string hanzis) { StringBuilder builder = new StringBuilder(); for (int i = 0; i < hanzis.Length - 1; i++) { builder.Append(GetPinyin(hanzis[i])); builder.Append(' '); } builder.Append(GetPinyin(hanzis[hanzis.Length - 1])); return builder.ToString(); } //是否是汉字 private static bool IsCharChinese(char c) { if (0x4e00 < c && c < 0x9fa5) { return true; } return false; } }
接下来是匹配算法
private bool IsFirstPinyinContains(string pinyin, char ch) { string[] pinyins = pinyin.Split(','); foreach (string py in pinyins) { if (py[0] == ch) { return true; } } return false; } private bool IsPinyinContainTarget(string[] pinyins, string target, int start, int end) { int k = 0; for (int i = start; i < end; i++) { foreach (char ch in pinyins[i]) { if (ch == target[k]) { k++; if (k == target.Length) { return true; } } } } return false; } private bool PinyinMatch(string[] pinyins, string target) { int start, end; for (int i = 0; i < pinyins.Length; i++) { if (pinyins[i][0] == target[0]) //找到第一个 { start = i; int j = i + 1; for (int k = 1; k < target.Length; k++) { if (j < pinyins.Length && IsFirstPinyinContains(pinyins[j], target[k])) { j++; } } end = j; //判断从start到end的拼音是否包含target if (IsPinyinContainTarget(pinyins, target, start, end)) { return true; } } } return false; }
算法有些不足,大家有什么建议不