工具:python3
解释:Ajax 是一种用于创建快速动态网页的技术,在无需重新加载整个网页的情况下,能够更新部分网页的技术。
目标:爬取使用Ajex结束的豆瓣网页
import urllib.request
# url为抓包(get请求)获取的,而不是web页面上的 url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }
# fiddle中webforms中得到的表格数据 formdata ={ "page_limit": "20", "page_start": "80", "sort": "recommend", "tag" : "热门", "type": "movie" } data = urllib.parse.urlencode(formdata) data = bytes(data, "utf8")
request = urllib.request.Request(url, data=data, headers=headers) response = urllib.request.urlopen(request).read()
# response = response.decode("utf-8")
with open("douban.json","w") as f: f.write(str(response))
执行上述代码后,将得到的内容在json.cn中转码,出现如下错误:
说明文件格式不对,没能正确转码,尝试将返回值response进行解码:response=response.decode("utf-8")
得到正确的json格式的文件:
观察发现url中包含了formdata中的全部数据,尝试将formdata删除:
import urllib.request url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", } # formdata ={ # "page_limit": "20", # "page_start": "80", # "sort": "recommend", # "tag" : "热门", # "type": "movie" # } # data = urllib.parse.urlencode(formdata) # data = bytes(data, "utf8") request = urllib.request.Request(url, headers=headers) response = urllib.request.urlopen(request).read() response = response.decode("utf-8") with open("douban.json","w") as f: f.write(str(response))
运行结果与之前相同!