昨天,我在做网络爬虫的时候,遇到了一个网站对文本框的输入作了编码处理:
<script type="text/javascript">
function encodeInput(form){
var cleanQuery = form.elements['query'].value.replace(new RegExp( "\+", "g" ),"%2B");
cleanQuery = cleanQuery.replace(/#/g, "%23");
cleanQuery = cleanQuery.replace(/(
)/g, " ");
cleanQuery = cleanQuery.trim();
var ascii = /^[ -~]+$/;
if( !ascii.test( cleanQuery ) ) {
var fixedUseQuery = "";
for (var i = 0, len = cleanQuery.length; i < len; i++) {
var str = "";
if( !ascii.test(cleanQuery[i]) ) {
str = "%26%23" + cleanQuery[i].charCodeAt(0) + ";";
} else {
str = cleanQuery[i];
}
fixedUseQuery = fixedUseQuery + str;
}
cleanQuery = fixedUseQuery;
}
form.elements['query'].value = cleanQuery;
}
</script>
具体作了什么样的处理,前面几句用了js的replace方法,替换了一些特殊符号,后面用了一个正则表达式进行了特殊的编码工作。我当时的文本:ACM task force on K–12 education and technology。js执行后,k之后的-进行了编码。我一开始不明白这个正则的含义。/^[ -~]+$,它怎么就不能匹配-能?后来用c#写了程序:
string s = "ACM task force on K–12 education and technology"; var ascii = "^[ -~]+$"; var reg = new Regex(ascii); foreach (var item in s) { if (!reg.IsMatch(item.ToString())) { Console.WriteLine("current char:" + item.ToString() + " not match"); } else { Console.WriteLine("current char:" + item.ToString() + " match"); } }
运行程序后,才突然想起来,中括号中的-是表示范围,如果表示自己的话,就得转义啊。比如[0-9],[a-z] 都表示一个连续的范围,后来想起ASCII码,然后查了下,原来这个正则表示的是空格到~之间的字符。真相终于大白。


ASCII码 十进制32到126之间的字符。