写python爬虫是遇到编码错误
报错为:
UnicodeEncodeError: 'gbk' codec can't encode character 'xa0'
经过多方查找发现 xa0是html网页源码中的空格
解决方法
替换掉字符 :replace(u'xa0', u' ')
下面是一些html中的常见符号
chr |
HexCode |
Numeric |
HTML entity |
" |
x22 |
" |
" |
& |
x26 |
& |
& |
< |
x3C |
< |
< |
> |
x3E |
> |
> |
空格 |
xA0 |
  |
|
¡ |
xA1 |
¡ |
¡ |
¢ |
xA2 |
¢ |
¢ |
£ |
xA3 |
£ |
£ |
¤ |
xA4 |
¤ |
¤ |
¥ |
xA5 |
¥ |
¥ |
¦ |
xA6 |
¦ |
¦ |
§ |
xA7 |
§ |
§ |
¨ |
xA8 |
¨ |
¨ |
© |
xA9 |
© |
© |
ª |
xAA |
ª |
ª |
« |
xAB |
« |
« |
¬ |
xAC |
¬ |
¬ |
|
xAD |
­ |
­ |
® |
xAE |
® |
® |
¯ |
xAF |
¯ |
¯ |
° |
xB0 |
° |
° |
± |
xB1 |
± |
± |
² |
xB2 |
² |
² |
³ |
xB3 |
³ |
³ |
´ |
xB4 |
´ |
´ |
µ |
xB5 |
µ |
µ |
¶ |
xB6 |
¶ |
¶ |
· |
xB7 |
· |
· |
¸ |
xB8 |
¸ |
¸ |
¹ |
xB9 |
¹ |
¹ |
º |
xBA |
º |
º |
» |
xBB |
» |
» |
¼ |
xBC |
¼ |
¼ |
½ |
xBD |
½ |
½ |
¾ |
xBE |
¾ |
¾ |
¿ |
xBF |
¿ |
¿ |
× |
xD7 |
× |
× |
÷ |
xF7 |
÷ |
÷ |
ƒ |
u0192 |
ƒ |
ƒ |
ˆ |
u02C6 |
ˆ |
ˆ |
˜ |
u02DC |
˜ |
˜ |
|
u2002 |
  |
  |
|
u2003 |
  |
  |
|
u2009 |
  |
  |
|
u200C |
‌ |
‌ |
|
u200D |
‍ |
‍ |
|
u200E |
‎ |
‎ |
|
u200F |
‏ |
‏ |
– |
u2013 |
– |
– |
— |
u2014 |
— |
— |
‘ |
u2018 |
‘ |
‘ |
’ |
u2019 |
’ |
’ |
‚ |
u201A |
‚ |
‚ |
“ |
u201C |
“ |
“ |
” |
u201D |
” |
” |
„ |
u201E |
„ |
„ |
† |
u2020 |
† |
† |
‡ |
u2021 |
‡ |
‡ |
• |
u2022 |
• |
• |
… |
u2026 |
… |
… |
‰ |
u2030 |
‰ |
‰ |
′ |
u2032 |
′ |
′ |
″ |
u2033 |
″ |
″ |
‹ |
u2039 |
‹ |
‹ |
› |
u203A |
› |
› |
‾ |
u203E |
‾ |
‾ |
⁄ |
u2044 |
⁄ |
⁄ |
€ |
u20AC |
€ |
€ |
ℑ |
u2111 |
ℑ |
ℑ |
? |
u2113 |
ℓ |
|
№ |
u2116 |
№ |
|
℘ |
u2118 |
℘ |
℘ |
ℜ |
u211C |
ℜ |
ℜ |
™ |
u2122 |
™ |
™ |
ℵ |
u2135 |
ℵ |
ℵ |
← |
u2190 |
← |
← |
↑ |
u2191 |
↑ |
↑ |
→ |
u2192 |
→ |
→ |
↓ |
u2193 |
↓ |
↓ |
↔ |
u2194 |
↔ |
↔ |
↵ |
u21B5 |
↵ |
↵ |
⇐ |
u21D0 |
⇐ |
⇐ |
⇑ |
u21D1 |
⇑ |
⇑ |
⇒ |
u21D2 |
⇒ |
⇒ |
⇓ |
u21D3 |
⇓ |
⇓ |
⇔ |
u21D4 |
⇔ |
⇔ |
∀ |
u2200 |
∀ |
∀ |
∂ |
u2202 |
∂ |
∂ |
∃ |
u2203 |
∃ |
∃ |
∅ |
u2205 |
∅ |
∅ |
∇ |
u2207 |
∇ |
∇ |
∈ |
u2208 |
∈ |
∈ |
∉ |
u2209 |
∉ |
∉ |
∋ |
u220B |
∋ |
∋ |
∏ |
u220F |
∏ |
∏ |
∑ |
u2211 |
∑ |
∑ |
− |
u2212 |
− |
− |
∗ |
u2217 |
∗ |
∗ |
√ |
u221A |
√ |
√ |
∝ |
u221D |
∝ |
∝ |
∞ |
u221E |
∞ |
∞ |
∠ |
u2220 |
∠ |
∠ |
∧ |
u2227 |
∧ |
∧ |
∨ |
u2228 |
∨ |
∨ |
∩ |
u2229 |
∩ |
∩ |
∪ |
u222A |
∪ |
∪ |
∫ |
u222B |
∫ |
∫ |
∴ |
u2234 |
∴ |
∴ |
∼ |
u223C |
∼ |
∼ |
≅ |
u2245 |
≅ |
≅ |
≈ |
u2248 |
≈ |
≈ |
≠ |
u2260 |
≠ |
≠ |
≡ |
u2261 |
≡ |
≡ |
≤ |
u2264 |
≤ |
≤ |
≥ |
u2265 |
≥ |
≥ |
⊂ |
u2282 |
⊂ |
⊂ |
⊃ |
u2283 |
⊃ |
⊃ |
⊄ |
u2284 |
⊄ |
⊄ |
⊆ |
u2286 |
⊆ |
⊆ |
⊇ |
u2287 |
⊇ |
⊇ |
⊕ |
u2295 |
⊕ |
⊕ |
⊗ |
u2297 |
⊗ |
⊗ |
⊥ |
u22A5 |
⊥ |
⊥ |
⋅ |
u22C5 |
⋅ |
⋅ |
⌈ |
u2308 |
⌈ |
⌈ |
⌉ |
u2309 |
⌉ |
⌉ |
⌊ |
u230A |
⌊ |
⌊ |
⌋ |
u230B |
⌋ |
⌋ |
⟨ |
u2329 |
〈 |
⟨ |
⟩ |
u232A |
〉 |
⟩ |
◊ |
u25CA |
◊ |
◊ |
♠ |
u2660 |
♠ |
♠ |
♣ |
u2663 |
♣ |
♣ |
♥ |
u2665 |
♥ |
♥ |
♦ |
u2666 |
♦ |
♦ |