一、kettle的简单使用
主要是跟据德拓云端培训门户中的“大数据采集技术与应用”来进行学习,链接:http://113.31.104.47/portal/#/course/dashboard/b34d160db64624732ef152a1118af11a
但是在安装spoon的过程中,自己的spoon不能正确打开,出现的问题以及解决方法如下:
出现的问题:双击spoon.bat文件之后,cmd出现1s消失,spoon未打开。
解决方法:文本文件的方式打开spoon.bat,
将
if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx2048m" "-XX:MaxPermSize=256m"
修改为
if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms512m" "-Xmx512m" "-XX:MaxPermSize=256m"
完成后,重新运行Spoon.bat,即可成功运行。
来源:https://blog.csdn.net/yue31313/article/details/79898002
另外,使用spoon连接MYSQL数据库时,还需注意将自己数据库对应的驱动包,放入软件的lib目录下!!!
二、MYSQL数据库去重操作
(1)sql语句一
下面方法我在使用时数据库会报错,sql语句应用时出错:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ------;
查阅资料可能跟数据库的类型有关,我觉得下列语句更适用于Oracle数据库,因为oracle数据库的表中的每一行数据都有一个唯一的标识符,或者称为rowid,下列语句中对应的即Oracle中的rowid。
1、查找表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断
select * from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
2、删除表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断,只留有rowid最小的记录
delete from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1) and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)
3、查找表中多余的重复记录(多个字段)
select * from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
4、删除表中多余的重复记录(多个字段),只留有rowid最小的记录
delete from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1) and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
5、查找表中多余的重复记录(多个字段),不包含rowid最小的记录
select * from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1) and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
————————————————
版权声明:本文为CSDN博主「Java高知」的原创文章。
原文链接:https://blog.csdn.net/tjcyjd/article/details/8950621
(2)sql语句二
接着查询,我发现了另一种适合自己的方法,即利用下面的代码来进行去重并保留一条的操作;
DELETE FROM dept WHERE (dname, db_source) IN ( SELECT t.dname, t.db_source FROM ( SELECT dname, db_source FROM dept GROUP BY dname, db_source HAVING count(1) > 1 ) t ) AND deptno NOT IN ( SELECT dt.mindeptno FROM ( SELECT min(deptno) AS mindeptno FROM dept GROUP BY dname, db_source HAVING count(1) > 1 ) dt )
来源:
版权声明:本文为CSDN博主「千千的大千世界」的原创文章。
原文链接:https://blog.csdn.net/n950814abc/article/details/82284838