比如内容格式是HTML格式的,里面有一堆的链接,希望从内容里提取出链接和标题。
如:
1 <a href='http://www.xx.cn/art/2017/12/26/art_8801_1776064.html' title='标题1' target="_blank"></a> <p>2017-12-26</p> </li> ]]></record> 2 <record><![CDATA[ 3 <li> <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776063.html' title='标题2' target="_blank"></a> <p>2017-12-26</p> </li> ]]></record> 4 <record><![CDATA[ 5 <li> <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776060.html' title='标题3' target="_blank"></a> <p>2017-12-26</p> </li> ]]></record> 6 <record><![CDATA[ 7 <li> <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776059.html' title='标题4' target="_blank"></a> <p>2017-12-26</p> </li> ]]></record> 8 <record><![CDATA[ 9 <li> <a href='http://www.xx.gov.cn/art/2017/12/25/art_8801_1775473.html' title='标题5' target="_blank"></a> <p>2017-12-25</p> </li> ]]></record> 10 <record><![CDATA[ 11 <li> <a href='http://www.xx.gov.cn/art/2017/12/22/art_8801_1775476.html' title='标题6' target="_blank"></a> <p>2017-12-22</p> </li> ]]></record> 12 <record><![CDATA[
方法正则表达式
1 string htmlcontext = “”; 2 3 Regex regex = new Regex(@"<a.*hrefs*=s*(?:""(?<url>[^""]*)""|'(?<url>[^']*)'|(?<url>[^>^s]+)).*>(?<title>[^<^>]*)<[^</a>]*/a>", RegexOptions.IgnoreCase); 4 5 for (Match m = regex.Match(htmlcontext); m.Success; m = m.NextMatch()) 6 { 7 string stringurl = m.Groups[1].Value.ToString(); 8 string stringtitle = m.Groups[2].Value.ToString(); 9 }
输出结果:
http://www.xx.cn/art/2017/12/26/art_8801_1776064.html 标题1