零、用什么工具爬取网站
之前的两个游戏谜面,都是眼看,手动输入的,这给解谜带来了一些不方便。尤其是那种special daily battle之类的,谜面都很大,一个个写很费时。有没有什么方法能快速拿到谜面,并且把谜面直接输出到文件里?答案是爬虫,网页抓取。
只是puzzle team club的网页防爬虫措施做得太好,网页里没有关于谜面的信息,抓来的数据包分析不出(我会说是包的数量太多了吗),只能用无头浏览器。
开始使用phantomJS,获取网页代码部分Python代码如下:
def getChessByPhantomJS(): driver = webdriver.PhantomJS() driver.get('https://www.puzzle-dominosa.com/?size=8') source = driver.page_source driver.quit() #
但是运行结果不如意,最终只给了一个没有谜面的基本模板网页。
用Chrome效果有如何呢?(不晓得如何配置chrome无头浏览器的可以右转baidu)
def getChessByChrome(): path = r'D:chromedriver.exe' chrome_options = Options() #后面的两个是固定写法 必须这么写 chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') driver = webdriver.Chrome(executable_path=path,chrome_options=chrome_options) try: driver.get('https://www.puzzle-dominosa.com/?size=8') except Exception as e: print(e) source = driver.page_source driver.quit() return source
运行结果(不如说是运行过程,因为这个B一直不退出)
DevTools listening on ws://127.0.0.1:62344/devtools/browser/8c9f8f4a-407a-4045-b 41c-b9f898d4d37b [1203/174652.884:INFO:CONSOLE(1)] "Uncaught TypeError: window.googletag.pubads i s not a function", source: https://www.puzzle-dominosa.com/build/js/public/new/d ominosa-95ac3646ef.js (1)
可以给程序加个超时退出:
def getChessByChrome(): path = r'D:chromedriver.exe' chrome_options = Options() #后面的两个是固定写法 必须这么写 chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') driver = webdriver.Chrome(executable_path=path,chrome_options=chrome_options) try: driver.set_page_load_timeout(30) driver.get('https://www.puzzle-dominosa.com/?size=8') except Exception as e: print(e) source = driver.page_source driver.quit() return source
这样就能把网页代码交给分析函数,输出谜面了。
一、如何拿到dominosa谜面
不过就做到这里还没完,我们要的是谜面。为此,我们需要分析代码:
图1.dominosa游戏的谜面代码
看到了吧?这里的谜面直接反映在代码的class名上,cell3对应谜面的3,而且同级元素超过谜面单位长度时,谜面会换行。
代码可以这样写:
def solve(): source = getChessByChrome() htree = etree.HTML(source) chessSize = len(htree.xpath('//div[@id="game"]/div/div/div/..')) puzzleId = htree.xpath('//div[@class="puzzleInfo"]/p/span/text()') if len(puzzleId) != 0: puzzleId = puzzleId[0] else: puzzleId = htree.xpath('//div[@class="puzzleInfo"]/p/text()')[0] x = (round((4 * chessSize + 1)**0.5) - 1) // 2 print(x) print(x+1) chess = '' for i,className in enumerate(htree.xpath('//div[@id="game"]/div/div/div/..')): value = className.xpath('./@class')[0].split(' ')[1][4:] if i % (x+1) == x: chess += value + ' ' else: chess += value + ' ' with open('dominosaChess' + puzzleId + '.txt','w') as f:f.write(chess[:-1])
这样就可以拿到使用Dancing link X (舞蹈链)求解dominosa游戏这里面要求的谜面文件了。
附带一提,这里为了查询谜面方便,输出的文件名字带有谜面ID;如果这是特别谜题,则输出的文件名字带有特别谜题的标题。
附带一些运行结果与谜面对比图(文件名dominosaChess7,092,762.txt):
4 5 2 2 7 3 3 0 6 2 7 5 6 2 6 4 1 5 4 4 5 6 0 2 6 0 2 7 3 3 5 0 0 3 4 4 0 1 3 3 4 1 3 2 1 5 7 0 5 3 2 1 1 6 1 6 6 7 5 2 6 7 1 7 4 0 0 4 5 1 7 7
对应谜面截图:
图2.ID为7,092,762的谜面
二、如何拿到star battle谜面
拿到符合使用深度优先搜索DFS求解star battle游戏这里面要求的谜面文件要费点功夫。
咱们查看下图吧:
图3.star battle谜面代码
这里的谜面代码class名字都有一定意义,比如bl表示左侧有分割线,br表示右侧有分割线。
这里只给我们提供了分割线,我们需要的是标示每个方格所属是哪个块的那种排布。要做到这种,我们需要使用BFS,宽度优先搜索。
def solve(): if url.find('size=') == -1: limit = 1 else: size = url.split('size=')[1] size = int(size) if size >= 1 and size <= 4: limit = 1 elif size <= 6: limit = 2 elif size <= 8: limit = 3 else: limit = size - 5 source = getChessByFile() htree = etree.HTML(source) chessSize = len(htree.xpath('//div[@id="game"]/div/div')) puzzleId = htree.xpath('//div[@class="puzzleInfo"]/p/span/text()') if len(puzzleId) != 0: puzzleId = puzzleId[0] else: puzzleId = htree.xpath('//div[@class="puzzleInfo"]/p/text()')[0] chessSize = round(chessSize**0.5) chess = [[-1 for _ in range(chessSize)] for __ in range(chessSize)] borderss = [['' for _ in range(chessSize)] for __ in range(chessSize)] chessStr = '' maxBlockNumber = 0 # br: on the right; bl: on the left; bb: on the down; bt: on the up for i,className in enumerate(htree.xpath('//div[@id="game"]/div/div[contains(@class,"cell")]')): x = i // chessSize y = i % chessSize value = className.xpath('./@class')[0] if value[:4] != 'cell': continue value = value.replace('cell selectable','') value = value.replace('cell-off','') borderss[x][y] = value for i in range(chessSize): for j in range(chessSize): if chess[i][j] != -1: continue queue = [(i, j)] chess[i][j] = str(maxBlockNumber) while len(queue) > 0: oldQueue = deepcopy(queue) queue = [] for pos in oldQueue: x, y = pos[0], pos[1] # if x > 0 and borderss[x][y].find('bt') == -1 and chess[x-1][y] == -1: queue.append((x-1, y)) chess[x-1][y] = chess[i][j] # if x < chessSize - 1 and borderss[x][y].find('bb') == -1 and chess[x+1][y] == -1: queue.append((x+1, y)) chess[x+1][y] = chess[i][j] # if y > 0 and borderss[x][y].find('bl') == -1 and chess[x][y-1] == -1: queue.append((x, y-1)) chess[x][y-1] = chess[i][j] # if y < chessSize - 1 and borderss[x][y].find('br') == -1 and chess[x][y+1] == -1: queue.append((x, y+1)) chess[x][y+1] = chess[i][j] # maxBlockNumber += 1 chessStr = ' '.join(' '.join(chessRow) for chessRow in chess) with open('starBattleChess' + puzzleId + '.txt','w') as f:f.write(str(limit)+' '+chessStr)
附带一些运行结果与谜面对比图(文件名starBattleChess3,876,706.txt):
1 0 0 1 1 2 0 0 3 1 2 0 0 3 4 4 0 3 3 4 4 0 3 3 4 4
对应谜面截图:
图4.ID为3,876,706的谜面