已经研究生二年级下学期了,已经为了这个检索项目写了差不多2年代码了,回想大四下学期就开始接触的这个项目,在研一的时候根本不知道科研如何做,而且项目就自己一个人,也是胡乱写了代码,而且心事太多,简直只能用一个词语形容就是混乱。
但是在大二上学期10月份的时候,随着一位同学加入简直就是可以说这个项目才真正开始。在我们的系统完成后,我便心血来潮整理我之前写过的代码,因为我们要写论文,所以需要做很多的数据处理来完成实验对比部分,其实这部分数据处理我在大一的时候就已经写过类似的代码,结果现在不得不重新再写,因为写的时间比回想代码时候更短,所以我发现好多代码都重复写了,这是我整理代码的初衷。我更加想的是用一个文件树的数据结构+数据处理算法流程去流水化我们数据处理模块,以后数据处理的代码就可以复用,干苦力的总是应该想办法提高自己的工作效率。所以我带着这个想法实现了下面这个类。用Python写的,因为Python做数据处理,字符处理,批处理真的太便利。其实这个类或许只能我自己用,为什么我会写出一个博客来,或许是因为以后我带研一新生做论文的时候我会让他去看回我们所写过的代码。让他去用我们写过的代码,我并没太多时间带一个新生,所以我让他来看我的博客。
我的数据结构其实就是个多叉树,用来表示文件目录结构。每一个结点其实就是一个文件,并且用栈和队列实现遍历树的算法,实现添加节点的算法。直接上代码了,以后有时间的时候在回来写注释:
import os from strOp import strExt from collections import deque from tblOp import tblConcat class FileNode: def __init__(self, _fileName_s='', _brothers=None, _sons=[], _isDir_b=False, _parent= None ): self.fileName_s = _fileName_s self.bro = _brothers self.sons = _sons self.isDir_b = _isDir_b self.parent = _parent def addNodeUnderPathUnrecur(root, _path_s): ''' inputs: root -> the root of directory tree. It must give the root of the d _path_s -> add the sons under the path of _path_s. if _path_s is equal to 'D:\CS_DATA\' then all the file under it is added as sons of the node named 'CS_DATA' outputs: Add all the files under _path_s as its sons. The input must give the root of directory ''' node = searchNodeFromGivenFilePath(root, _path_s) filesUnderPath = os.listdir(_path_s) lenOfFilesUnderPath = len(filesUnderPath) for i in range(lenOfFilesUnderPath): if len(node.sons) == 0: newNode = FileNode(filesUnderPath[i], None, [], os.path.isdir(_path_s+filesUnderPath[i]), node) node.sons.append(newNode) else: newNode = FileNode(filesUnderPath[i], None, [], os.path.isdir(_path_s+filesUnderPath[i]), node) node.sons[len(node.sons)-1].bro = newNode node.sons.append(newNode) #isSameName(node, newNode) file system will ensure that no the same name files exist. def searchNodeFromGivenFilePath(root, _path_s): ''' inputs: root -> Must give the root of directory. Meaning the absolute path of a node. _path_s -> The absolute path of a node. Examples: 'D:\CS_DATA\' output: Search the directory tree from root to find the node whose fileName_s is equal to 'CS_DATA'. So, you must give the absolute path. Whether 'D:\CS_DATA\' or 'D:\CS_DATA' would be fine. ''' if _path_s[-1] != '\': _path_s += '\' folderStructure = _path_s.split('\') if root.bro != None: print 'input root is not root of file tree' return if folderStructure[0] != root.fileName_s: print 'the head of input path is not same as root' return stack = [] stack.append(root) for i in range(1,len(folderStructure)-1): if len(stack) == 0: print 'stack is empty' break node = stack.pop() flag = 0 for j in node.sons: if folderStructure[i] == j.fileName_s: stack.append(j) flag = 1 if flag == 0: print 'can not find the folder %s' % folderStructure[i] return None node = stack.pop() return node def addNodeAsSonFromGivenNode(root, _sonPath_s): ''' inputs: root -> The root of the directory. Which directory that you want to add the node. _sonPath_s -> The absolute path of added node. Examples: 'D:\CS_DATA\tree\' means add the node named 'tree' to its parent 'CS_DATA' outputs: The directory tree with added node. ''' if _sonPath_s[-1] != '\': _sonPath_s += '\' fileStructure = _sonPath_s.split('\') lenOfFileStructure = len(fileStructure) if lenOfFileStructure <= 2: print 'These is not son in the input path %s' % _sonPath_s return _sonFileName_s = fileStructure[-2] _parentPath_s = '' for i in range(len(fileStructure)-2): _parentPath_s = _parentPath_s + fileStructure[i] + '\' _addNodeAsSonFromGivenNode(root, _parentPath_s, _sonFileName_s) def _addNodeAsSonFromGivenNode(root, _parentPath_s, _sonFileName_s): ''' inputs: root -> The root of directory tree. _parentPath_s -> The absolute path of parent _sonFileName_s -> the filename of added node outputs: This function is a auxiliary function of addNodeAsSonFromGivenNode ''' if _parentPath_s[-1] != '\': _parentPath_s += '\' parentNode = searchNodeFromGivenFilePath(root, _parentPath_s) if parentNode == None: print 'can not find the parent folder %s' % _parentPath_s return None if len(parentNode.sons) == 0: newNode = FileNode(_sonFileName_s, None, [], os.path.isdir(_parentPath_s+_sonFileName_s), parentNode) if isSameName(parentNode, newNode): return parentNode.sons.append(newNode) else: newNode = FileNode(_sonFileName_s, None, [], os.path.isdir(_parentPath_s+_sonFileName_s), parentNode) if isSameName(parentNode, newNode): return parentNode.sons[len(parentNode.sons)-1].bro = newNode parentNode.sons.append(newNode) def isSameName(parentNode, sonNode): ''' inputs: parentNode -> The parent node. sonNode -> the son node. outputs: If sonNode is already in parentNode.sons then return True. ''' for node in parentNode.sons: if node.fileName_s == sonNode.fileName_s: print 'has same node %s\%s -> %s' % (parentNode.fileName_s, node.fileName_s, sonNode.fileName_s) return True return False def addNodeUnderPathRecur(root, _path_s): ''' inputs: root -> The root of directory. _path_s -> The absolute path wanted to be added. Examples: 'D:\CS_DATA\' outputs: 1. Add all the file nodes under _path_s recursively. 2. The _path_s must exist in root. Unsafe: 1. Some system directory can not be added recursively. Examples: 'D:\System Volume Information' 2. I do not make the judgment between files whether have same name when adding. 3. So, this function must use in the premise of operation system ensuring the rule for us. ''' if _path_s[-1] != '\': _path_s = _path_s + '\' fileStructure = _path_s.split('\') if fileStructure[0] == root.fileName_s and len(fileStructure) == 2: print '_path_s can not be the root' return returnNode = currentNode = searchNodeFromGivenFilePath(root, _path_s) if currentNode == None: print 'can not find the path' return queue = deque([]) fileName_sl = os.listdir(_path_s) for fileName_s in fileName_sl: file_s = _path_s + fileName_s newNode = FileNode(fileName_s, None, [], os.path.isdir(file_s), currentNode) queue.append(newNode) while(len(queue) != 0): newNode = queue.popleft() currentNode = newNode.parent lenOfSonsCurrentNode = len(currentNode.sons) if lenOfSonsCurrentNode == 0: currentNode.sons.append(newNode) else: currentNode.sons[lenOfSonsCurrentNode-1].bro = newNode currentNode.sons.append(newNode) if newNode.isDir_b == True: fullPathOfNewNode = getFullPathOfNode(newNode) subFileName_sl = os.listdir(fullPathOfNewNode) for subFileName_s in subFileName_sl: subNewNode = FileNode(subFileName_s, None, [], os.path.isdir(fullPathOfNewNode+subFileName_s), newNode) queue.append(subNewNode) return returnNode def printBrosOfGivenNode(root, _path_s): ''' inputs: root -> The root of the directory. _path_s -> Examples: 'D:\CS_DATA' , 'D:\CS_DATA\' outputs: print out the bros of 'CS_DATA' for 'D:\CS_DATA' print out the sons of 'CS_DATA' for 'D:\CS_DATA\' ''' if _path_s[-1] != '\': node = searchNodeFromGivenFilePath(root, _path_s) if node == None: print 'can not find the node' parentOfNode = node.parent headOfSons = parentOfNode.sons[0] printStr = headOfSons.fileName_s + ',' while(headOfSons.bro != None): headOfSons = headOfSons.bro printStr = printStr + headOfSons.fileName_s + ',' else: node = searchNodeFromGivenFilePath(root, _path_s) if node == None: print 'can not find the node' printStr = '' if len(node.sons) == 0: print 'its sons is empty' else: for son in node.sons: printStr = printStr + son.fileName_s + ',' print printStr[:-1] def crtFileTreeFromPath(_path_s): ''' inputs: _path_s -> Examples: 'D:\sketchDataset\' outputs: This function will create the root node by 'D:', and then, call addNodeUnderPathUnrecur to add files under 'D:\', and then, again call addNodeUnderPathUnrecur to add files under 'D:\sketchDataset\' This process is a loop until the last separator of _path_s. ''' if _path_s[-1] != '\': _path_s += '\' fileStructure = _path_s.split('\') lenOfFileStructure = len(fileStructure) root = FileNode(_fileName_s=fileStructure[0], _isDir_b=os.path.isdir(fileStructure[0])) fileStr = root.fileName_s + '\' addNodeUnderPathUnrecur(root, fileStr) for i in range(1, lenOfFileStructure-1): file_s = fileStructure[i] fileStr = fileStr + file_s + '\' addNodeUnderPathUnrecur(root, fileStr) return root def searchLeafNodeUnderGivenNode(root, _path_s): ''' inputs: root -> For the given directory tree. _path_s -> The absolute path of node that wanted to search all the leafs under it. outputs: Return all the leafs under the given _path_s. Leaf is the file whose has not sons and it is not a directory ''' node = searchNodeFromGivenFilePath(root, _path_s) leafs = [] if node == None: print 'can not find the node in searchLeafNodeUnderGivenNode' return queue = deque([]) queue.append(node) while(len(queue) != 0): currentNode = queue.popleft() if len(currentNode.sons) == 0 and (currentNode.isDir_b == False): leafs.append(currentNode) else: for son in currentNode.sons: queue.append(son) return leafs def getFullPathOfNode(givenNode): ''' find the full(absolute) path of the input node. ''' tmpNode = givenNode fullPathOfNode = tmpNode.fileName_s + '\' while(tmpNode.parent != None): tmpNode = tmpNode.parent fullPathOfNode = tmpNode.fileName_s + '\' + fullPathOfNode return fullPathOfNode
比如我要计算草图检索的验证集,可以上上面的代码后面添加代码:
if __name__ == '__main__': root = crtFileTreeFromPath('D:\sketchDataset\') categroyNode = addNodeUnderPathRecur(root, 'D:\sketchDataset\category\') leafs = searchLeafNodeUnderGivenNode(root, 'D:\sketchDataset\category\') containModel_t = {} for i in range(len(leafs)): if leafs[i].parent.fileName_s not in containModel_t: containModel_t[leafs[i].parent.fileName_s] = [] containModel_t[leafs[i].parent.fileName_s].append(strExt.extractModelIdWithSuffix(leafs[i].fileName_s, suffix_s='.off')) else: containModel_t[leafs[i].parent.fileName_s].append(strExt.extractModelIdWithSuffix(leafs[i].fileName_s, suffix_s='.off')) categroyNode = addNodeUnderPathRecur(root, 'D:\sketchDataset\all_categorized_sketches\') sketchToCate_t = {} for son in categroyNode.sons: sketchNodes = son.sons for sketchNode in sketchNodes: sketchName = strExt.extractSketchNameWithSuffix(sketchNode.fileName_s, suffix_s='.txt') if sketchName not in sketchToCate_t: sketchToCate_t[sketchName] = son.fileName_s wanted = tblConcat.concatTableByKey_ValAndVal_Vals(sketchToCate_t, containModel_t) print wanted
结果就是,也就是草图165号的验证模型是'm1646.off, m1647.off'等等。
{'s165.txt': ['m1646.off', 'm1647.off', 'm1648.off', 'm1649.off', 'm1650.off', 'm1651.off', 'm1652.off', 'm1653.off', 'm1654.off', 'm1655.off', 'm1656.off', 'm1657.off', 'm1658.off', 'm1659.off', 'm1660.off', 'm1661.off', 'm1662.off', 'm1663.off', 'm1664.off', 'm1665.off'] ......}