zoukankan      html  css  js  c++  java
  • 转 How to Identify and Delete Duplicate SQL Server Records

    原文

    Recently, I was asked to help someone clean up their database after they had double loaded an import file. The problem they were having in identifying and deleting the duplicate information was the fact that a timestamp is applied to each row of data as it is inserted into the table. While the rest of the row of data was duplicated, the timestamp made the row unique. It was this uniqueness that caused the simple methods of determining and deleting duplicate data to fail. They needed a way to delete data from a table in which they determine the criteria of what made the data duplicate.

    After helping them out with their problem, I decided to write a short article to show the simple solution I came up with to delete the duplicate data from a table, even if that data is considered unique by SQL Server. I know there are many ways to delete duplicate data, but bear with me as I explain my way. As always, if you have another way, great write it up and let us know about it. If not, look over these scripts and see if you can use them to create your own method.

    Before I get into the example that actually deals with the described problem, I am going to start by showing a method to delete simple duplicate data for those who may be new to SQL Server and do not know how to clean up duplicate data from a table.

    /**********************************************
    Example of a simple duplicate data delete script.
    **********************************************/
    
    /**********************************************
    Set up test environment
    **********************************************/
    SET NOCOUNT ON
    
    --Create test table
    IF OBJECT_ID('tDupData') IS NOT NULL
    DROP TABLE tDupData
    GO
    
    CREATE TABLE tDupData
    (
    lngCompanyID INTEGER 
    ,strCompanyName VARCHAR(20)
    ,strAddress VARCHAR(10)
    ,dtmModified DATETIME
    )
    
    --Create test data
    INSERT INTO tDupData VALUES (1,'CompanyOne','Address1','01/15/2003')
    INSERT INTO tDupData VALUES (2,'CompanyTwo','Address2','01/15/2003')
    INSERT INTO tDupData VALUES (3,'CompanyThree','Address3','01/15/2003')
    INSERT INTO tDupData VALUES (2,'CompanyTwo','Address','01/16/2003') 
    INSERT INTO tDupData VALUES (3,'CompanyThree','Address','01/16/2003')
    
    -- Dup Data 
    INSERT INTO tDupData VALUES (1,'CompanyOne','Address1','01/15/2003') 
    GO
    
    /**********************************************
    Finish set up
    **********************************************/
    
    /**********************************************
    Simple duplicate data
    **********************************************/
    
    --Create temp table to hold duplicate data
    CREATE TABLE #tempduplicatedata
    (
    lngCompanyID INTEGER 
    ,strCompanyName VARCHAR(20)
    ,strAddress VARCHAR(10)
    ,dtmModified DATETIME
    )
    
    --Identify and save dup data into temp table
    INSERT INTO #tempduplicatedata
    SELECT * FROM tDupData
    GROUP BY lngCompanyID,strCompanyName,strAddress, dtmModified
    HAVING COUNT(*) > 1
    
    --Confirm number of dup rows
    SELECT @@ROWCOUNT AS 'Number of Duplicate Rows'
    
    --Delete dup from original table
    DELETE FROM tDupData 
    FROM tDupData
    INNER JOIN #tempduplicatedata
    ON  tDupData.lngCompanyID = #tempduplicatedata.lngCompanyID
    AND tDupData.strCompanyName = #tempduplicatedata.strCompanyName
    AND tDupData.strAddress = #tempduplicatedata.strAddress
    AND tDupData.dtmModified = #tempduplicatedata.dtmModified
    
    --Insert the delete data back
    INSERT INTO tDupData
    SELECT * FROM #tempduplicatedata
    
    --Check for dup data.
    SELECT * FROM tDupData
    GROUP BY lngCompanyID,strCompanyName,strAddress,dtmModified
    HAVING COUNT(*) > 1
    
    --Check table
    SELECT * FROM tDupData
    
    --Drop temp table
    DROP TABLE #tempduplicatedata
    
    --drop test table
    IF OBJECT_ID('tDupData') IS NOT NULL
    DROP TABLE tDupData
    GO
    

    As you can see, it is not hard to delete data that is duplicated across all columns of a table. What is harder to do is to delete data that you consider duplicate based on your business rules while SQL Server considers it unique data. This usually happens when one or more columns contain different data, but your business rules have determined that because the main columns of the table are the same, you have duplicate data. This usually happens when you have a problem during a data load and data is loaded multiple times generating new timestamps or identity values for each row. The identify value or the data field will cause uniqueness in the data and the simple delete method will fail.

  • 相关阅读:
    Kubernetes(k8s)中namespace的作用、反向代理访问k8s中的应用、k8s监控服务heapster
    Kubernetes(k8s)中dashboard的部署。
    Kubernetes(k8s)中Pod资源的健康检查
    Kubernetes(k8s)安装dns附件组件以及使用
    Kubernetes(k8s)的deployment资源
    Kubernetes(k8s)的Service资源
    Kubernetes(k8s)的RC(Replication Controller)副本控制器
    Kubernetes(k8s)常用资源的使用、Pod的常用操作
    简易图书管理系统(主要是jsp+servlet的练习),基于jsp+servlet的图书管理系统
    js 提取 sql 条件 表名 limit
  • 原文地址:https://www.cnblogs.com/philzhou/p/2015028.html
Copyright © 2011-2022 走看看