zoukankan      html  css  js  c++  java
  • 第三周周总结——kettle的简单使用以及MYSQL数据库去重

    一、kettle的简单使用

    主要是跟据德拓云端培训门户中的“大数据采集技术与应用”来进行学习,链接:http://113.31.104.47/portal/#/course/dashboard/b34d160db64624732ef152a1118af11a

    但是在安装spoon的过程中,自己的spoon不能正确打开,出现的问题以及解决方法如下:

    出现的问题:双击spoon.bat文件之后,cmd出现1s消失,spoon未打开。

    解决方法:文本文件的方式打开spoon.bat,

    if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx2048m" "-XX:MaxPermSize=256m"

    修改为

    if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms512m" "-Xmx512m" "-XX:MaxPermSize=256m"

    完成后,重新运行Spoon.bat,即可成功运行。
    来源:https://blog.csdn.net/yue31313/article/details/79898002

    另外,使用spoon连接MYSQL数据库时,还需注意将自己数据库对应的驱动包,放入软件的lib目录下!!!

    二、MYSQL数据库去重操作

    (1)sql语句一

    下面方法我在使用时数据库会报错,sql语句应用时出错:

    You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ------;

    查阅资料可能跟数据库的类型有关,我觉得下列语句更适用于Oracle数据库,因为oracle数据库的表中的每一行数据都有一个唯一的标识符,或者称为rowid,下列语句中对应的即Oracle中的rowid。

    1、查找表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断

    select * from people
    where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)

    2、删除表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断,只留有rowid最小的记录

    delete from people 
    where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
    and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)

    3、查找表中多余的重复记录(多个字段) 

    select * from vitae a
    where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)

    4、删除表中多余的重复记录(多个字段),只留有rowid最小的记录

    delete from vitae a
    where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
    and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)

    5、查找表中多余的重复记录(多个字段),不包含rowid最小的记录

    select * from vitae a
    where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
    and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)

    ————————————————
    版权声明:本文为CSDN博主「Java高知」的原创文章。
    原文链接:https://blog.csdn.net/tjcyjd/article/details/8950621

    (2)sql语句二

    接着查询,我发现了另一种适合自己的方法,即利用下面的代码来进行去重并保留一条的操作;

    DELETE
    FROM
        dept
    WHERE
        (dname, db_source) IN (
            SELECT
                t.dname,
                t.db_source
            FROM
                (
                    SELECT
                        dname,
                        db_source
                    FROM
                        dept
                    GROUP BY
                        dname,
                        db_source
                    HAVING
                        count(1) > 1
                ) t
        )
    AND deptno NOT IN (
        SELECT
            dt.mindeptno
        FROM
            (
                SELECT
                    min(deptno) AS mindeptno
                FROM
                    dept
                GROUP BY
                    dname,
                    db_source
                HAVING
                    count(1) > 1
            ) dt
    )

    来源:

    版权声明:本文为CSDN博主「千千的大千世界」的原创文章。
    原文链接:https://blog.csdn.net/n950814abc/article/details/82284838

  • 相关阅读:
    在Ubuntu上安装Hadoop(集群模式)
    Node.js v0.10.8 发布
    设置 Sublime Text 的 Python 开发环境
    jQuery 1.10.0 和 2.0.1 发布
    openSUSE 13.1 Milestone 2 发布
    mochad 0.1.6 发布,TCP 网关守护进程
    JPPF 3.3.2 发布,Java 并行处理框架
    PyCharm 又一强大Python IDE
    AntiXSS 支持Html同时防止XSS攻击
    (原创)攻击方式学习系列(总)
  • 原文地址:https://www.cnblogs.com/Qi77/p/12514462.html
Copyright © 2011-2022 走看看