zoukankan      html  css  js  c++  java
  • Python-从邮件中提取内容并插入数据库

    引子

    有100多封eml格式的本地邮件,需要从每封邮件中提取出特定的内容,并插入数据库中。难处在于如何使用正则提取内容。想试验正则表达式的结果,可以下载RegexBuddy,方便调试。

    代码

    import os
    from email import message_from_file
    import re
    import hashlib
    import json
    import pymssql
    
    def parseEmail(file):
        with open(file) as f:
            name = f.name
            msg = message_from_file(f)
            target_text = msg.get_payload()[0].get_payload()
        # <span>Timestamp: 2020-07-30 00:00:00 (UTC) </span>
        # <strong>Koality_AutoDQ4_OnePipeline_OfficeForms</strong>
        timestamp = re.findall("Timestamp: (d{4}-d{1,2}-d{1,2})",target_text)[0]
        dataset = re.findall("Dataset: .+_.+_.+_(.+_*w+) <http",target_text)[0]
        unique_dataset_name = re.findall("Dataset: (.+_.+_.+_.+) <http",target_text)[0]
        metrics = re.findall(" Metrics[:=]
    *(.+
    *.+)
    *<=*
    *h", target_text)[0]
        metrics = metrics.replace('
    ', '').replace('=','').replace(': ', '')
    
        # Koality_AutoDQ4_OA-OXO_Teams_DoD, whose dataset is  Teams_DoD instaed of DoD
        if 'Teams_' in unique_dataset_name:
            dataset = 'Teams_' + dataset
    
    
        incident_source_text = target_text[target_text.find('Incident List'): target_text.find('RootCause')].replace('=', '').replace('
    ', '')
        incidents_str = incident_source_text[:incident_source_text.find('<https')]
        incdient_groups = re.findall("(w+:w+),", incidents_str)
        incident_dict = {}
        for incident in incdient_groups:
            k, v = incident.split(':')
            incident_dict[k] = v
        incident_json = json.dumps(incident_dict)
        hashkey = toHash(incident_json)
        return timestamp, dataset, metrics, incident_json, hashkey, unique_dataset_name
    
    def toHash(text):
        hs = hashlib.md5()
        hs.update(text.encode())
        return hs.hexdigest().upper()
    
    
    def insertIntoDB(conn,table, results):
        cur = conn.cursor()
        count = 0
        for row in results:
            rowDict = {'table': table,
                       'timestamp':row[0], 
                       'dataset':row[1],
                       'metric':row[2], 
                       'json':row[3],
                       'hashkey':row[4],
                       'unique_dataset_name': row[5]}
            sql = """
                    insert into [{table}] 
                    ([WorkloadName], [MetricName], [DimensionCombination_JsonForDb_Hash], [Timestamp], [DimensionCombination_JsonForDb], [UniqueDatasetName], [MetricValue], [ExpectedMetricValue])
                    VALUES ('{dataset}', '{metric}','{hashkey}','{timestamp}', '{json}', '{unique_dataset_name}', 0, 0); 
            """.format(**rowDict)
            try:
                cur.execute(sql)
                print('Inserted a row......')
            except:
                print(row[1], row[5])
            count += 1
    
    
        print("Successfully inserted {:d} rows....".format(count))
        cur.close()
        conn.close()
        return True
    
    
    def main():
        print("Starting script......")
    
        emailFolder = r"C:WorkIncidentEmails"
    
        results = []
        for filePath in os.listdir(emailFolder):
            fullPath = emailFolder + "\" + filePath
            timestamp, dataset, metrics, incident_json, hashkey, unique_dataset_name = parseEmail(fullPath)
            results.append([timestamp, dataset, metrics, incident_json, hashkey, unique_dataset_name])
    
        conn = pymssql.connect('server', 'usr@server', 'pwd', 'db', autocommit=True)
        insertIntoDB(conn, 'KenshoAutoDQAlert', results)
    
    if __name__ == "__main__":
        main()
    View Code
  • 相关阅读:
    广域网(ppp协议、HDLC协议)
    0120. Triangle (M)
    0589. N-ary Tree Preorder Traversal (E)
    0377. Combination Sum IV (M)
    1074. Number of Submatrices That Sum to Target (H)
    1209. Remove All Adjacent Duplicates in String II (M)
    0509. Fibonacci Number (E)
    0086. Partition List (M)
    0667. Beautiful Arrangement II (M)
    1302. Deepest Leaves Sum (M)
  • 原文地址:https://www.cnblogs.com/yifeixu/p/13686661.html
Copyright © 2011-2022 走看看