zoukankan      html  css  js  c++  java
  • VBA比较两个Excel数据的异同

    代码背景:

    • 由于Excel本身无法简单的比较两个Excel数据的异同,所以用VBA编写代码的方式来实现。
    • 这里的比较条件是:数据行为单位,假设对应Sheet中没有重复数据,对应数据行的所有列的数据都相等,即为此行数据相同。
    • 这里的两个Sheet的数据行量级别大约为:50000 * 50000,数据列大约:50,对应Cell中的字符串大约100以内,中英文混合。
    • 如何在Excel中调出VBA的编写工具,请参考如下链接: https://jingyan.baidu.com/article/63f236281f17650208ab3d97.html

    整体来说,需求非常明确,若是不考虑效率的话,代码逻辑比较简单,循环比较即可。

    相关代码:

    Sub CompareData()
    
        Dim i As Long
        Dim j As Long
        
        Dim fullSheetName As String
        fullSheetName = "Sheet1"
        Set fullSheet = Sheets(fullSheetName)
        Dim fullDataRange As Variant
        fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim fullSheetRowMax As Long
        fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
        
        Dim partialSheetName As String
        partialSheetName = "Sheet2"
        Set partialSheet = Sheets(partialSheetName)
        Dim partialDataRange As Variant
        partialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim partialSheetRowMax As Long
        partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
    
        Dim columnMax As Integer
        columnMax = 46
        
        Dim columnMark As Integer
        columnMark = 48
        
        Dim sameRow As Boolean
        
        For i = 1 To fullSheetRowMax
            For j = 1 To partialSheetRowMax
                sameRow = True
                For columnIndex = 1 To columnMax
                    If fullDataRange(i, columnIndex) <> partialDataRange(j, columnIndex) Then
                        sameRow = False
                        Exit For
                    End If
                Next columnIndex
                
                If sameRow Then
                    fullSheet.Cells(i, columnMark) = 1
                    Exit For
                End If
            Next j
        Next i
        
        MsgBox "Successfully!"
    End Sub
    View Code

    上述代码实际运行大约用30分钟完成此数量级的比较,大约1000亿次的比较。

    当然了我们需要更快的比较方式,那么就需要对大数据进行结构优化,即:将partial的sheet中的数据进行分组,比如每1000个row放到一个组里,然后用一个标志位标记这个组里1000个row是否都有相同的数据,如有都有的话,那么下次再比较的时候就可以跳过这个组,进行下一组的1000个row的循环遍历。相同数量级,大约2分钟比较完成。

    注:实际数据是按照时间进行抽取出来的,所以partial的sheet数据 大致都在full的sheet的前半部分相同,如果数据无规律,非常混乱,那么还要对每一个row的数据进行结构优化,即:用另外一个标记为进行标记此row是否有相同的数据,判断的时候先判断这个标记位】

    相关代码如下:

    【注:函数中的一些变量都是HardCode的,要根据具体数据进行修改】

    Public Type PartialBasedModule
        IsAllSame As Boolean
        SheetDataRange As Variant
        SameCount As Long
    End Type
    
    
    Sub CompareData()
    
        Dim i As Long
        Dim j As Long
        Dim k As Long
        
        Dim fullSheetName As String
        fullSheetName = "Sheet1"
        Set fullSheet = Sheets(fullSheetName)
        Dim fullDataRange As Variant
        fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim fullSheetRowMax As Long
        fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
        
        Dim partialSheetName As String
        partialSheetName = "Sheet2"
        Set partialSheet = Sheets(partialSheetName)
        Dim PartialDataRange As Variant
        PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim partialSheetRowMax As Long
        partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
        
        Dim partialSheetPages() As PartialBasedModule
        partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax)
    
        Dim columnMax As Integer
        columnMax = 46
        
        Dim columnMark As Integer
        columnMark = 48
        
        Dim sameRow As Boolean
        
        For i = 1 To fullSheetRowMax
            For j = 1 To UBound(partialSheetPages)
                If partialSheetPages(j).SameCount < 1000 Then
                    For k = 1 To UBound(partialSheetPages(j).SheetDataRange)
                        sameRow = True
                        For ColumnIndex = 1 To columnMax
                            If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k, ColumnIndex) Then
                                sameRow = False
                                Exit For
                            End If
                        Next ColumnIndex
                        
                        If sameRow Then
                            fullSheet.Cells(i, columnMark) = 1
                            partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1
                            Exit For
                        End If
                    Next k
                Else
                    sameRow = False
                End If
                
                If sameRow Then
                    Exit For
                End If
            Next j
        Next i
        
        MsgBox "Successfully!"
    End Sub
    
    Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule()
        Dim eachPageCount As Long
        eachPageCount = 1000
        Dim pageCount As Integer
        pageCount = Int(rowCount / eachPageCount) + 1
        Dim pageIndex As Long
        
        Dim pageArr() As PartialBasedModule
        Dim startIndex As Long
        Dim endIndex As Long
        
        For pageIndex = 1 To pageCount
            Dim seperatedDataRange(1 To 1000, 1 To 46) As Variant
            Dim seperatedIndex As Long
            seperatedIndex = 1
            Dim colIndex As Integer
            
            If pageIndex < pageCount Then
                endIndex = pageIndex * eachPageCount
            Else
                endIndex = rowCount
            End If
            
            
            For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex
                For colIndex = 1 To 46
                    seperatedDataRange(seperatedIndex, colIndex) = PartialDataRange(startIndex, colIndex)
                Next colIndex
                seperatedIndex = seperatedIndex + 1
            Next startIndex
            
            Dim pageData As PartialBasedModule
            pageData.SheetDataRange = seperatedDataRange
            pageData.SameCount = 0
            pageData.IsAllSame = False
            
            ReDim Preserve pageArr(pageIndex)
            pageArr(pageIndex) = pageData
        Next pageIndex
        
        
        SeparatePartialSheet = pageArr
    End Function
    View Code

    给每个Row都加上标记的代码如下所示:【相同界别的数据,大约1分钟完成比较】

    Public Type RowModule
        IsSame As Boolean
        RowData As Variant
    End Type
    
    Public Type PartialBasedModule
        IsAllSame As Boolean
        SheetDataRange() As RowModule
        SameCount As Long
    End Type
    
    
    Sub CompareData()
    
        Dim i As Long
        Dim j As Long
        Dim k As Long
        
        Dim fullSheetName As String
        fullSheetName = "Sheet1"
        Set fullSheet = Sheets(fullSheetName)
        Dim fullDataRange As Variant
        fullDataRange = fullSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim fullSheetRowMax As Long
        fullSheetRowMax = fullSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
        
        Dim partialSheetName As String
        partialSheetName = "Sheet2"
        Set partialSheet = Sheets(partialSheetName)
        Dim PartialDataRange As Variant
        PartialDataRange = partialSheet.Range("A1", "AT80000").CurrentRegion.Value
        Dim partialSheetRowMax As Long
        partialSheetRowMax = partialSheet.Range("A1", "AT80000").CurrentRegion.Rows.Count
        
        Dim partialSheetPages() As PartialBasedModule
        partialSheetPages = SeparatePartialSheet(PartialDataRange, partialSheetRowMax)
    
        Dim columnMax As Integer
        columnMax = 46
        
        Dim columnMark As Integer
        columnMark = 48
        
        Dim sameRow As Boolean
        
        For i = 1 To fullSheetRowMax
            For j = 1 To UBound(partialSheetPages)
                If partialSheetPages(j).SameCount < 1000 Then
                    For k = 1 To UBound(partialSheetPages(j).SheetDataRange)
                        sameRow = True
                        
                        If partialSheetPages(j).SheetDataRange(k).IsSame Then
                            sameRow = False
                        Else
                            For ColumnIndex = 1 To columnMax
                                If fullDataRange(i, ColumnIndex) <> partialSheetPages(j).SheetDataRange(k).RowData(ColumnIndex) Then
                                    sameRow = False
                                    Exit For
                                End If
                            Next ColumnIndex
                            
                            If sameRow Then
                                fullSheet.Cells(i, columnMark) = 1
                                partialSheetPages(j).SheetDataRange(k).IsSame = True
                                partialSheetPages(j).SameCount = partialSheetPages(j).SameCount + 1
                                Exit For
                            End If
                        End If
                    Next k
                Else
                    sameRow = False
                End If
                
                If sameRow Then
                    Exit For
                End If
            Next j
        Next i
        
        MsgBox "Successfully!"
    End Sub
    
    Public Function SeparatePartialSheet(ByRef PartialDataRange As Variant, ByVal rowCount As Long) As PartialBasedModule()
        Dim eachPageCount As Long
        eachPageCount = 1000
        Dim pageCount As Integer
        pageCount = Int(rowCount / eachPageCount) + 1
        Dim pageIndex As Long
        
        Dim pageArr() As PartialBasedModule
        Dim startIndex As Long
        Dim endIndex As Long
        
        For pageIndex = 1 To pageCount
            Dim seperatedDataRange(1 To 1000) As RowModule
            Dim dataRows(1 To 1000) As Variant
            Dim seperatedIndex As Long
            seperatedIndex = 1
            Dim colIndex As Integer
            
            If pageIndex < pageCount Then
                endIndex = pageIndex * eachPageCount
            Else
                endIndex = rowCount
            End If
            
            
            For startIndex = (pageIndex - 1) * eachPageCount + 1 To endIndex
                Dim dataRow(1 To 46) As Variant
                For colIndex = 1 To 46
                    dataRow(colIndex) = PartialDataRange(startIndex, colIndex)
                Next colIndex
                
                Dim currentRowModule As RowModule
                currentRowModule.RowData = dataRow
                currentRowModule.IsSame = False
                
                seperatedDataRange(seperatedIndex) = currentRowModule
                seperatedIndex = seperatedIndex + 1
            Next startIndex
            
            Dim pageData As PartialBasedModule
            pageData.SheetDataRange = seperatedDataRange
            pageData.SameCount = 0
            pageData.IsAllSame = False
            
            ReDim Preserve pageArr(pageIndex)
            pageArr(pageIndex) = pageData
        Next pageIndex
        
        
        SeparatePartialSheet = pageArr
    End Function
    View Code

    最终的一个简单的数据结构如下图所示:

  • 相关阅读:
    Hive实现wordCount
    java.lang.RuntimeException: HRegionServer Aborted
    kafka基础知识
    PyCharm安装及使用
    SQLite基础-2.PyCharm+Database_Navigator
    SQLite基础-1.SQL简介
    Sublime Text 3 注册激活码
    windows下安装Python虚拟环境virtualenvwrapper-win
    windows下命令行利器---Cmder(安装,中文乱码,配置右键菜单)
    安装python
  • 原文地址:https://www.cnblogs.com/mingmingruyuedlut/p/12813041.html
Copyright © 2011-2022 走看看