昨天,我在做网络爬虫的时候,遇到了一个网站对文本框的输入作了编码处理:
<script type="text/javascript"> function encodeInput(form){ var cleanQuery = form.elements['query'].value.replace(new RegExp( "\+", "g" ),"%2B"); cleanQuery = cleanQuery.replace(/#/g, "%23"); cleanQuery = cleanQuery.replace(/( )/g, " "); cleanQuery = cleanQuery.trim(); var ascii = /^[ -~]+$/; if( !ascii.test( cleanQuery ) ) { var fixedUseQuery = ""; for (var i = 0, len = cleanQuery.length; i < len; i++) { var str = ""; if( !ascii.test(cleanQuery[i]) ) { str = "%26%23" + cleanQuery[i].charCodeAt(0) + ";"; } else { str = cleanQuery[i]; } fixedUseQuery = fixedUseQuery + str; } cleanQuery = fixedUseQuery; } form.elements['query'].value = cleanQuery; } </script>
具体作了什么样的处理,前面几句用了js的replace方法,替换了一些特殊符号,后面用了一个正则表达式进行了特殊的编码工作。我当时的文本:ACM task force on K–12 education and technology。js执行后,k之后的-进行了编码。我一开始不明白这个正则的含义。/^[ -~]+$,它怎么就不能匹配-能?后来用c#写了程序:
string s = "ACM task force on K–12 education and technology"; var ascii = "^[ -~]+$"; var reg = new Regex(ascii); foreach (var item in s) { if (!reg.IsMatch(item.ToString())) { Console.WriteLine("current char:" + item.ToString() + " not match"); } else { Console.WriteLine("current char:" + item.ToString() + " match"); } }
运行程序后,才突然想起来,中括号中的-是表示范围,如果表示自己的话,就得转义啊。比如[0-9],[a-z] 都表示一个连续的范围,后来想起ASCII码,然后查了下,原来这个正则表示的是空格到~之间的字符。真相终于大白。
ASCII码 十进制32到126之间的字符。