Oracle和Elasticsearch数据同步

zoukankan html css js c++ java

Oracle和Elasticsearch数据同步
Python编写Oracle和Elasticsearch数据同步脚本

标签： elasticsearch oracle cx_Oracle python 数据同步

Python知识库
一、版本

Python版本 x64 2.7.12

Oracle（x64 12.1.0.2.0）和Elasticsearch（2.2.0）

python编辑器 PyCharm

下载安装请选择适合自己机器的版本

二、下载模块

通过官网下载和安装cx_Oracle和pyes模块，分别用于操作Oracle数据库和ES。安装fcntl模块用于解决python脚本单例执行问题。

如果是远程连接数据库和ES，请一定注意安装的模块或包版本。务必选择相应的版本，不然会遇到问题。

cx_Oracle：https://sourceforge.net/projects/cx-oracle/files/?source=navbar

pyes：https://github.com/aparo/pyes

fcntl：https://pypi.python.org/pypi?:action=show_md5&digest=3cea2958c97b24cf0ab12121be22b6dd

三、安装过程中会遇到的问题

cx_Oracle在本地安装过程中出现的一些问题：

1、安装c++for python的环境

2、安装Oracle数据库（或者安装API接口中需要的文件而不必下载配置整个oracle环境）

3、打开数据库工具 oracle SQL developor 按要求创建连接，并新建用户（创建数据库用户名时以c##开头，不然会提示）

4、oracle连接不上远程的服务器，检查版本是否匹配

fcntl在windows上安装时出现的问题：

1、用pip install fcntl 报错：indentationerror: unexpected indent（模块版本有问题）

四、源码

[python] view plain copy

# -*- coding: utf-8 -*-

"""

作者：陈龙

日期：2016-7-22

功能：oracle数据库到ES的数据同步

"""

import os

import sys

import datetime, time

# import fcntl

import threading

import pyes  # 引入pyes模块，ES接口

import cx_Oracle  # 引入cx_Oracle模块，Oracle接口



os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8'  # 中文编码

reload(sys)  # 默认编码设置为utf-8

sys.setdefaultencoding('utf-8')



# 创建ES连接并返回连接参数

def connect_ES(addr):

    try:

        global conn

        conn = pyes.ES(addr)  # 链接ES '127.0.0.1:9200'

        print 'ES连接成功'

        return conn

    except:

        print 'ES连接错误'

        pass



# 创建ES映射mapping 注意各各个字段的类型

def create_ESmapping():

    global spiderInfo_mapping, involveVideo_mapping, involveCeefax_mapping,keyWord_mapping,sensitiveWord_mapping

    spiderInfo_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},

                          'tableId': {'index': 'not_analyzed', 'type': 'integer'},

                          'title': {'index': 'analyzed', 'type': 'string'},

                          'author': {'index': 'not_analyzed', 'type': 'string'},

                          'content': {'index': 'analyzed', 'type': 'string'},

                          'publishTime': {'index': 'not_analyzed', 'type': 'string'},

                          'browseNum': {'index': 'not_analyzed', 'type': 'integer'},

                          'commentNum': {'index': 'not_analyzed', 'type': 'integer'},

                          'dataType': {'index': 'not_analyzed', 'type': 'integer'}}  # 除去涉我部分内容的ES映射结构

    involveVideo_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},

                            'tableId': {'index': 'not_analyzed', 'type': 'integer'},

                            'title': {'index': 'analyzed', 'type': 'string'},

                            'author': {'index': 'not_analyzed', 'type': 'string'},

                            'summary': {'index': 'analyzed', 'type': 'string'},

                            'publishTime': {'index': 'not_analyzed', 'type': 'string'},

                            'url': {'index': 'not_analyzed', 'type': 'string'},

                            'imgUrl': {'index': 'not_analyzed', 'type': 'string'},

                            'ranking': {'index': 'not_analyzed', 'type': 'integer'},

                            'playNum': {'index': 'not_analyzed', 'type': 'integer'},

                            'dataType': {'index': 'not_analyzed', 'type': 'integer'}}  # 涉我视音频内容的ES映射结构

    involveCeefax_mapping = {'tableName': {'index': 'not_analyzed', 'type': 'string'},

                            'tableId': {'index': 'not_analyzed', 'type': 'integer'},

                            'title': {'index': 'analyzed', 'type': 'string'},

                            'author': {'index': 'not_analyzed', 'type': 'string'},

                            'content': {'index': 'analyzed', 'type': 'string'},

                            'publishTime': {'index': 'not_analyzed', 'type': 'string'},

                            'keyWords': {'index': 'not_analyzed', 'type': 'string'},

                            'popularity': {'index': 'not_analyzed', 'type': 'integer'},

                            'url': {'index': 'not_analyzed', 'type': 'string'},

                            'dataType': {'index': 'not_analyzed', 'type': 'integer'}}  # 涉我图文资讯内容的ES映射结构

    keyWord_mapping = {'id':{'index': 'not_analyzed', 'type': 'integer'},

                      'keywords':{'index': 'not_analyzed', 'type': 'string'}}

    sensitiveWord_mapping = {'id':{'index': 'not_analyzed', 'type': 'integer'},

                            'sensitiveType':{'index': 'not_analyzed', 'type': 'string'},

                            'sensitiveTopic': {'index': 'not_analyzed', 'type': 'string'},

                            'sensitiveWords': {'index': 'not_analyzed', 'type': 'string'}}



# 创建ES相关索引和索引下的type

def create_ESindex(ES_index, index_type1,index_type2,index_type3,index_type4,index_type5):



    if conn.indices.exists_index(ES_index):

        pass

    else:

        conn.indices.create_index(ES_index)  # 如果所有Str不存在，则创建Str索引

        create_ESmapping()

        conn.indices.put_mapping(index_type1, {'properties': spiderInfo_mapping},[ES_index])  # 在索引pom下创建spiderInfo的_type  "spiderInfo"

        conn.indices.put_mapping(index_type2, {'properties': involveVideo_mapping},[ES_index])  # 在索引pom下创建involveVideo的_type  "involveVideo"

        conn.indices.put_mapping(index_type3, {'properties': involveCeefax_mapping},[ES_index])  # 在索引pom下创建involveCeefax的_type  "involveCeefax"

        conn.indices.put_mapping(index_type4, {'properties': keyWord_mapping}, [ES_index])

        conn.indices.put_mapping(index_type5, {'properties': sensitiveWord_mapping}, [ES_index])

    # conn.ensure_index



# 创建数据库连接并返回连接参数

def connect_Oracle(name, password, address):

    try:

        global conn1

        # conn1 = cx_Oracle.connect('c##chenlong','1234567890','localhost:1521/ORCL') #链接本地数据库

        conn1 = cx_Oracle.connect(name, password, address)  # 链接远程数据库 "pom","Bohui@123","172.17.7.118:1521/ORCL"

        print 'Oracle连接成功'

        return conn1

    except:

        print 'ES数据同步脚本连接不上数据库，请检查connect参数是否正确，或者模块版本是否匹配'

        pass



def fetch_account(accountcode):  # 取两个‘_’之间的账号名称

    end = accountcode.find('_')

    return accountcode[0:end].strip()

# 根据表的个数创建不同的对象

# 从记录文档中读取各个表的记录ID，判断各个表的ID是否有变化

# 分别读取各个表中的相关数据



# 读取各个表的ID与记录的ID(记录在文本或者数据库中)并判断

"""def read_compare_ID():

    global tuple_tableName_IdNum

    global cur

    tuple_tableName_IdNum = {}

    tablename = []

    cur = conn1.cursor()

    result1 = cur.execute("select * from tabs")  ##执行数据库操作读取各个表名

    row = result1.fetchall()

    for x in row:

        tablename.append(x[0])  # 将表名取出并赋值给tablename数组

        result2 = cur.execute('select {}_ID  from {}'.format(x[0], x[0]))

        ID_num = result2.fetchall()

        tuple_tableName_IdNum[x[0]] = ID_num"""



def readOracle_writeES(tableName, ES_index, index_type):

    global cc

    cur = conn1.cursor()

    #result_AlltableNames = cur.execute("select * from tabs")

    result_latestId = cur.execute("select max({}_Id) from {} ".format(tableName,tableName))

    num1 = result_latestId.fetchone() #当前表中的最大ID

    print '当前表中的最大ID{}'.format(num1[0])

    result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName.upper())) #通过数据库表拿到更新的ID tablename 都转化成大写

    num2 = result_rememberId.fetchone() #上次记录的更新ID

    print '上次记录的更新ID{}'.format(num2[0])

    if tableName.upper() == 'T_SOCIAL':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,likeNum,forwardNum,commentNum,accountCode from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  #之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  #一条一条写入ES，这个速度太慢，改进通过bulk接口导入

                aa= (i[5]+i[6])

                bb=  (i[7]+i[8])

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[1]), 'author': unicode(i[2]),

                    'content': unicode(i[3]), 'publishTime': str(i[4]), 'browseNum': aa,

                    'commentNum':bb, 'dataType':fetch_account(i[9])}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0] #如果写入成功才赋值

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName,index_type)

    if tableName.upper() == 'T_HOTSEARCH':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,accountCode,title,publishTime from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  #一条一条写入ES，这个速度太慢，改进通过bulk接口导入

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': '','content': '', 'publishTime': str(i[3]), 'browseNum': 0,

                    'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName, index_type)

    if tableName.upper() == 'T_VIDEO_HOT':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,accountCode,title,Author,publishTime from {} where {}_ID > {} and rownum<=40 ".format(tableName,tableName,tableName,num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),

                    'content': '', 'publishTime': str(i[4]), 'browseNum': 0,

                    'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type, bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    if tableName.upper() == 'T_PRESS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute(

                "select {}_ID,accountCode,title,Author,PublishDate,Content from {} where {}_ID > {} and rownum<=40 ".format(

                    tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),

                    'content': unicode(i[5]), 'publishTime': str(i[4]), 'browseNum': 0,

                    'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    if tableName.upper() == 'T_INDUSTRY':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute(

                "select {}_ID,accountCode,title,Author,PublishTime,Content,BrowseNum from {} where {}_ID > {} and rownum<=40 ".format(

                    tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率



            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]),'author': unicode(i[3]),

                    'content': unicode(i[5]), 'publishTime': str(i[4]), 'browseNum': i[6],

                    'commentNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True) : # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    if tableName.upper() == 'T_SOCIAL_SITESEARCH':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute('select {}_ID,title,author,content,publishTime,keyWords,browseNum,likeNum,forwardNum,commentNum,url,accountCode from {} where ({}_ID > {})'.format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchmany(50)  #因为数据量太大，超过了变量的内存空间，所以一次性取40条

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                popularity = (i[6] + i[7] + i[8] * 2 + i[9] * 2)

                if conn.index(

                    {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),

                    'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(i[5]),

                    'popularity':popularity,'url': i[10],

                    'dataType':fetch_account(i[11])}, ES_index, index_type, bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    if tableName.upper() == 'T_REALTIME_NEWS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                popularity = (i[5] + i[6] * 2)

                if conn.index(

                    {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),

                    'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(''),

                    'popularity':popularity,'url': i[8],'dataType':fetch_account(i[7])}, ES_index, index_type, bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName, index_type)

    if tableName.upper() == 'T_KEY_NEWS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                popularity = (i[5] + i[6] * 2)

                if conn.index(

                    {'tableName': tableName,'tableId':i[0],'title': unicode(i[1]),'author':unicode(i[2]),

                    'content':unicode(i[3]),'publishTime':str(i[4]),'keyWords':unicode(''),

                    'popularity':popularity,'url': i[8],'dataType':fetch_account(i[7])}, ES_index, index_type, bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName, index_type)

    if tableName.upper() == 'T_LOCAL_NEWS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,title,author,content,publishTime,browseNum,commentNum,accountCode,url from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                popularity = (i[5] + i[6] * 2)

                if conn.index(

                    {'tableName': tableName, 'tableId': i[0], 'title': unicode(i[1]), 'author': unicode(i[2]),

                    'content': unicode(i[3]), 'publishTime': str(i[4]), 'keyWords': unicode(''),

                    'popularity': popularity, 'url': i[8], 'dataType': fetch_account(i[7])}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName, index_type)

    if tableName.upper() == 'T_VIDEO_SITESEARCH':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute("select {}_ID,accountCode,title,Author,publishTime,url,imgUrl,playNum,keyWords from {} where {}_ID > {} and rownum<=40 ".format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 之前是因为数据量太大，超过了变量的内存空间，所以用fetchmany取40条  后来大神建议数据库中限制查询数然后fetchall，这样查询更有效率

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index(

                    {

                    'tableName': tableName, 'tableId': i[0], 'title': unicode(i[2]), 'author': unicode(i[3]),

                    'summary': unicode('0'), 'publishTime': str(i[4]), 'browseNum': i[7],'url':i[5],'imgUrl':i[6],'ranking':0,

                    'playNum': 0, 'dataType': fetch_account(i[1])}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute(

                "select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读{}写成功".format(tableName,index_type)

    if tableName.upper() == 'T_BASE_KEYWORDS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute('select {}_ID,keywords from {} where {}_ID > {} and rownum<=50'.format(tableName, tableName, tableName, num2[0]))

            result_tuple1 = result_readOracle.fetchall()  #因为数据量太大，超过了变量的内存空间，所以一次性取40条

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index({'id': i[0], 'keywords': i[1]}, ES_index, index_type,bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc += 1

                    print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId,tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    if tableName.upper() == 'T_BASE_SENSITIVEWORDS':

        while num2[0] < num1[0]:

            result_readOracle = cur.execute('select {}_ID,SensitiveType,SensitiveTopic,SensitiveWords from {} where {}_ID > {} and rownum<=50'.format(tableName, tableName, tableName,num2[0]))

            result_tuple1 = result_readOracle.fetchall()  # 因为数据量太大，超过了变量的内存空间，所以一次性取40条

            for i in result_tuple1:  # 一条一条写入ES，这个速度太慢，强烈需要改进通过bulk接口导入？

                if conn.index({'id':i[0],

                            'sensitiveType':unicode(i[1]),

                            'sensitiveTopic': unicode(i[2]),

                            'sensitiveWords':unicode(i[3])}, ES_index, index_type, bulk=True):  # 将数据写入索引pom的spiderInfo

                    cc +=1

            print 'bulk导入后的ID:{}'.format(i[0])

            rememberId = i[0]

            cur.execute("update T_REMEMBERID set tableId = {} where tableName = '{}'".format(rememberId, tableName))

            conn1.commit()

            result_rememberId = cur.execute("select tableId from T_REMEMBERID where tableName='{}'".format(tableName))  # 通过数据库表拿到更新的ID

            num2 = result_rememberId.fetchone()

        print "{}读写成功".format(tableName)

    else:

        pass



def ww(a):

    while True:

        print a

        time.sleep(0.5)  #用于多线程的一个实验函数



if __name__ == "__main__":

    cc = 0

    connect_ES('172.17.5.66:9200')

    # conn.indices.delete_index('_all')  # 清除所有索引

    create_ESindex("pom", "spiderInfo", "involveVideo", "involveCeefax","keyWord","sensitiveWord")

    connect_Oracle("pom", "Bohui@123", "172.17.7.118:1521/ORCL")

    # thread.start_new_thread(readOracle_writeES,("T_SOCIAL","pom","spiderInfo"),)#创建一个多线程

    # thread.start_new_thread(readOracle_writeES,("T_SOCIAL_SITESEARCH", "pom", "spiderInfo"),)#创建一个多线程

    mm = time.clock()

    readOracle_writeES("T_SOCIAL", "pom", "spiderInfo") #表名虽然在程序中设置了转化为大写，但是还是全大写比较好

    readOracle_writeES("T_HOTSEARCH", "pom", "spiderInfo")

    readOracle_writeES("T_VIDEO_HOT", "pom", "spiderInfo")

    readOracle_writeES("T_PRESS", "pom", "spiderInfo")

    readOracle_writeES("T_INDUSTRY", "pom", "spiderInfo")

    readOracle_writeES("T_VIDEO_SITESEARCH", "pom", "involveVideo")

    readOracle_writeES("T_REALTIME_NEWS", "pom", "involveCeefax")

    readOracle_writeES("T_KEY_NEWS", "pom", "involveCeefax")

    readOracle_writeES("T_LOCAL_NEWS", "pom", "involveCeefax")

    readOracle_writeES("T_SOCIAL_SITESEARCH", "pom", "involveCeefax")

    readOracle_writeES("T_BASE_KEYWORDS", "pom", "keyWord")

    readOracle_writeES("T_BASE_SENSITIVEWORDS", "pom", "sensitiveWord")

    nn = time.clock()

    # conn.indices.close_index('pom')

    conn1.close()

    print '数据写入耗时：{}  成功写入数据{}条'.format(nn-mm,cc)



#实验多线程

    """

    while a < 100:

        conn.index(

            {'tableName': 'T_base_account', 'type': '1', 'tableId': '123', 'title': unicode('陈龙'), 'author': 'ABC',

            'content': 'ABC', 'publishTime': '12:00:00', 'browseNum': '12', 'commentNum': '12', 'dataType': '1'},

            "pom", "spiderInfo", )  # 将数据写入索引pom的spiderInfo

        a += 1

    print time.ctime()

    """

"""

    threads = []

    t1 = threading.Thread(target=readOracle_writeES,args=("T_SOCIAL","pom","spiderInfo"))

    threads.append(t1)

    #t3 = threading.Thread(target=ww,args=(10,))

    #threads.append(t3)

    #t2 = threading.Thread(target=readOracle_writeES,args=("T_SOCIAL_SITESEARCH", "pom", "spiderInfo"))

    #threads.append(t2)

    print time.ctime()

    for t in threads:

        t.setDaemon(True)

        t.start()

    t.join()

"""

五、编译过程的问题

1、直接print游标cur.execute ( ) 将不能得到我们想要的结果

result2 = cur.execute('select T_SOCIAL_ID from T_SOCIAL')

print result2

返回：<__builtin__.OracleCursor on <cx_Oracle.Connection to pom@172.17.7.118:1521/ORCL>>

result2 = cur.execute('select T_SOCIAL_ID from T_SOCIAL')

print result2

num = result2.fetchall()

print num

for i in num:

print i[0]

返回：[(55,), (56,), (57,), (58,), (59,), (60,), (61,), (62,), (63,), (64,), (65,), (66,), (67,), (68,), (69,), (70,)]

   55

注意：用fetchall()得到的数据为：[(55,), (56,), (57,), (58,), (59,)] 元组而不是数字。

用变量[num] 的方式取出具体的数值

2、cx_Oracle中文编码乱码问题

显示中文乱码：��Ǳ��

或者显示未知的编码：('xcexd2xd5xe6xb5xc4xcaxc7xb1xeaxccxe2',)

需要注意一下几个地方，将数据库中的中文编码转化成utf-8编码，并将中文写入elasticsearch

os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8' #中文编码

reload(sys) #默认编码设置为utf-8 一定需要reload（sys）

sys.setdefaultencoding('utf-8')

'title':unicode('中文')

python传递给js的列表中文乱码怎么解决？

json.dumps(dictionary,ensure_ascii=False)

3、远程连接不上Oracle数据库的问题

第一：确保connect（）中各个参数的值都正确。例如

conn1 = cx_Oracle.connect("username","password","172.17.7.118:1521/ORCL") #连接远程数据库

conn1 = cx_Oracle.connect('username','password','localhost:1521/ORCL') #连接本地数据库

conn2 = pyes.ES('127.0.0.1:9200') #连接ES

第二：确保安装的版本都符合要求，包括模块的版本。

4、提示TypeError: 'NoneType' object is not callable

确保mapping中的各个字段类型都设置正确

检查索引和映射是否都书写正确

5、脚本同时读取多个数据库表

涉及到Python中多线程的问题，给每一个表起一个线程，同时给每一个线程加锁

编译时碰到问题：AssertionError: group argument must be None for now（检查函数是否书写正确，读写冲突）

AttributeError: 'builtin_function_or_method' object has no attribute 'setDaemon'

cx_Oracle.ProgrammingError: LOB variable no longer valid after subsequent fetch（fetchall数据量过大，溢出设置一次取数据库中 rownum数）

TypeError: 'NoneType' object has no attribute '__getitem__' （注意数据库查询对应的大小写）

No handlers could be found for logger "pyes" 可能是连接超时

AttributeError: 'tuple' object has no attribute 'append' tuple不能直接用append

TypeError: 'tuple' object does not support item assignment tuple不能赋值

数据库批量读取

就多线程问题咨询了大神，大神建议用多进程来实现会比较简单

6、脚本定时触发问题

Linux crontab定时执行任务，crontab防止脚本周期内未执行完重复执行

7、单实例的问题。防止脚本没有执行完再次触发

刚开始设想在脚本中完成，后来知道这个可以在系统中设定

8、数据同步插件

网上有大量的关于同步关系型数据库的有关插件 logstash-input-jdbc 不太好安装，不知道如何使用。

MySQL和ES同步插件的介绍，例如elasticsearch-river-jdbc

在这儿启用的是bulk接口，批量导入。数据同步的速度大大提高

9、判断数据是否同步成功

这个是之前一直没有注意的问题，但其实在数据传输的时候是非常重要的。

目前的判断方法是看ES中的数据量到底有多少，然后对照统计量进行判断分析，，这也是在后期发现有部分数据没有同步过去的方法。

10、统计写入了多少数据

UnboundLocalError: local variable 'cc' referenced before assignment

定义了全局变量cc，但是在局部进行了修改，所以报错修改同名的全局变量，则认为是一个局部变量

五、源码改进

因为数据写入的速度太慢（40条数据 800Kb大小写入花费2S左右），所有在原来的基础上，修改了读取数据库中未写入内容的策略和ES写入的策略。

插入完成的源码

调试问题：

1、pip install elasticsearch 引入helpers函数模块，使用bulk函数批量导入。

2、AttributeError: 'ES' object has no attribute 'transport' 因为原来使用的是pyes模块现在换成了elasticsearch,所以改成对应模块

conn2 = Elasticsearch("127.0.0.1:9200")

其他常见错误

    SerializationError：JSON数据序列化出错，通常是因为不支持某个节点值的数据类型

    RequestError：提交数据格式不正确

    ConflictError：索引ID冲突

    TransportError：连接无法建立

最后通过了解其实是找到了数据同步的插件 logstash-input-jdbc 能够实现数据的同步增删改查，按照网上的教程能够很轻松的实现，遇到的问题就是插件同步过去的字段都必须是小写。

------------

Python中cx_Oracle的一些函数：

commit() 提交
rollback() 回滚

cursor用来执行命令的方法:
callproc(self, procname, args):用来执行存储过程,接收的参数为存储过程名和参数列表,返回值为受影响的行数
execute(self, query, args):执行单条sql语句,接收的参数为sql语句本身和使用的参数列表,返回值为受影响的行数
executemany(self, query, args):执行单挑sql语句,但是重复执行参数列表里的参数,返回值为受影响的行数
nextset(self):移动到下一个结果集

cursor用来接收返回值的方法:
fetchall(self):接收全部的返回结果行.
fetchmany(self, size=None):接收size条返回结果行.如果size的值大于返回的结果行的数量,则会返回cursor.arraysize条数据.
fetchone(self):返回一条结果行.
scroll(self, value, mode='relative'):移动指针到某一行.如果mode='relative',则表示从当前所在行移动value条,如果 mode='absolute',则表示从结果集的第一行移动value条.

MySQL中关于中文编码的问题

conn = MySQLdb.Connect(host='localhost', user='root', passwd='root', db='python') 中加一个属性：

conn = MySQLdb.Connect(host='localhost', user='root', passwd='root', db='python',charset='utf8')

charset是要跟你数据库的编码一样，如果是数据库是gb2312 ,则写charset='gb2312'。
查看全文

相关阅读:
人事面试13
人事面试测试篇1
人事面试16
人事面试15
人事面试测试篇3
人事面试测试篇2
人事面试14
Oracle Compile 编译无效对象
 Oracle 移动数据文件的操作方法
 Oracle 9i 从9.2.0.1升级到 9.2.0.6 步骤

原文地址：https://www.cnblogs.com/Leo_wl/p/6075457.html

Oracle和Elasticsearch数据同步

Python编写Oracle和Elasticsearch数据同步脚本

一、版本

如果是远程连接数据库和ES，请一定注意安装的模块或包版本。务必选择相应的版本，不然会遇到问题。

三、安装过程中会遇到的问题

四、源码