zoukankan html css js c++ java

【转载】SqlBulkCopy.批量导入日志文件中的内容到SQL SERVER

单叙述了使用正则表达式分割Apache日志文件中每条记录的信息，现在解决如何批量导入日志文件中的内容到SQL SERVER数据库。思路如下：

1.利用SqlBulkCopy.WriteToServer(IDataReader reader)方法批量导入日志文件中的记录到SQL SERVER数据库。

2.自定义的TxtDataReader类实现IDataReader接口用于传递给SqlBulkCopy.WriteToServer使用。

3.在TxtDataReader的实现中利用正则表达式分组捕获需要的信息。

第一步：实现自定义的TxtDataReader类

1.代码中的未列出实现的IDateReader接口成员对当前实现的功能没有影响。

2.TxtDataReader构造函数的pattren参数格式必须是：命名分组_类型，类型只简单的匹配了int|date|string三种。

主要代码

  1 partial class TxtDataReader : System.Data.IDataReader
  2 {
  3     string pattern;
  4     int fieldCount;
  5     TextReader reader;
  6     NameValueCollection FieldInfo;
  7     NameValueCollection items;
  8     bool isClosed;
  9     /// <summary>
 10     /// 私有构造函数，内部调用创建TxtDataReader的实例
 11     /// </summary>
 12     /// <param name="file">要处理的文本</param>
 13     /// <param name="pattern">匹配每行文本的正则表达式</param>
 14      private TxtDataReader(string file,string pattern)
 15     {
 16         this.pattern=pattern;
 17         MatchCollection matches = Regex.Matches(pattern, "\\(\\?<([^<>]+)_(int|date|string)>");
 18         this.fieldCount = matches.Count;
 19         this.FieldInfo = new NameValueCollection();
 20         for (int i = 0; i < matches.Count; i++)
 21         {
 22             this.FieldInfo.Add(matches[i].Groups[1].Value, matches[i].Groups[2].Value);
 23         }
 24         this.reader = new StreamReader(file);
 25         this.isClosed = false;
 26     }
 27     /// <summary>
 28     /// 静态方法返回TxtDataReader的实例
 29     /// </summary>
 30      public static TxtDataReader ExecuteReader(string file,string pattern)
 31     {
 32         return new TxtDataReader(file, pattern);
 33     }
 34     /// <summary>
 35     /// 列数
 36     /// </summary>
 37      public int FieldCount
 38     {
 39         get { return this.fieldCount; }
 40     }
 41     /// <summary>
 42     /// 读取并处理文件的下一行
 43     /// </summary>
 44      public bool Read()
 45     {
 46         if (this.isClosed)
 47         {
 48             return false;
 49         }
 50         string line = this.reader.ReadLine();
 51         if (line == null)
 52         {
 53             this.isClosed = true;
 54             return false;
 55         }
 56         GroupCollection groups = Regex.Match(line, this.pattern).Groups;
 57         items = new NameValueCollection();
 58         for (int i = 0; i < this.FieldInfo.Count; i++)
 59         {
 60             string value = groups[FieldInfo.GetKey(i)+"_"+this.FieldInfo[i]].Value;
 61             if (this.FieldInfo[i] == "date")
 62             {
 63                 string format = "[dd/MMM/yyyy:HH:mm:ss zzzz]";
 64                 value = DateTime.ParseExact(value, format, new CultureInfo("en-US", true)).ToString();
 65             }
 66             if (this.FieldInfo[i] == "int")
 67             {
 68                 try
 69                 {
 70                     value = Convert.ToInt32(value).ToString();
 71                 }
 72                 catch (Exception ex)
 73                 {
 74                     value = "0";
 75                 }
 76             }
 77             items.Add(FieldInfo.GetKey(i), value);
 78         }
 79 
 80         return true;
 81     }
 82     /// <summary>
 83     /// 根据列索引返回列值
 84     /// </summary>
 85      public object GetValue(int i)
 86     {
 87         return items[i];
 88     }
 89     /// <summary>
 90     /// 返回读取器的状态
 91     /// </summary>
 92      public bool IsClosed
 93     {
 94         get { return this.isClosed; }
 95     }
 96     /// <summary>
 97     /// 关闭读取器
 98     /// </summary>
 99      public void Close()
100     {
101         this.reader.Close(); ;
102     }
103     #region IDisposable 成员
104 
105     public void Dispose()
106     {
107         this.Close();
108     }
109 
110     #endregion
111 }

第二步：构造用于 TxtDataReader解析日志每行记录的正则表达式

正则表达式	类型	备注	描述
^			匹配每一行的开头。
(?<ip_string>[0-9.]+)\s	string	not null	匹配IP地址。
(?<identity_string>[\w.-]+)\s	string	可能为"-"	匹配identity，由数字字母下划线或点分隔符组成。
(?<userid_string>[\w.-]+)\s	string	可能为"-"	匹配userid，由数字字母下划线或点分隔符组成。
(?<date_date>\[[^\[\]]+\])\s"	datetime	not null	匹配时间。
(?<request_string>(?:[^"]\|\")*)"\s	string	可能为""	匹配请求信息。
(?<status_int>\d{3})\s	int	3个整数	匹配状态码。
(?<bytes_int>\d+\|-)\s"	int	可能为"-"	匹配响应字节数或-。
(?<ref_string>(?:[^"]\|\")*)"\s"	string	可能为"-"	匹配"Referer"请求头，双引号中可能出现转义的双引号\"。
(?<useragent_string>(?:[^"]\|\")*)"	string	可能为"-"	匹配"User-Agent"请求头，双引号中可能出现转义的双引号\"。
$			匹配行尾。

完整的正则表达式如下：

^(?<ip_string>[0-9.]+)\s(?<identity_string>[\w.-]+)\s(?<userid_string>[\w.-]+)\s(?<date_date>\[[^\[\]]+\])\s"(?<request_string>(?:[^"]|\")*)"\s(?<status_int>\d{3})\s(?<bytes_int>\d+|-)\s"(?<ref_string>(?:[^"]|\")*)"\s"(?<useragent_string>(?:[^"]|\")*)"$

第三步：创建用于导入的表格

代码

CREATE TABLE [dbo].[test](
    [ip] [nvarchar](64),
    [identity] [nvarchar](64),
    [userid] [nvarchar](64),
    [date] [datetime] NULL,
    [request] [nvarchar](2000),
    [status] [int] NULL,
    [bytes] [int] NULL,
    [ref] [nvarchar](2000),
    [useragent] [nvarchar](2000) NULL
)

第四步：使用SqlBulkCopy导入数据

代码

 1 static void Main(string[] args)
 2 {
 3     string pattern = "^(?<ip_string>[0-9.]+)\\s(?<identity_string>[\\w.-]+)\\s(?<userid_string>[\\w.-]+)\\s(?<date_date>\\[[^\\[\\]]+\\])\\s\"(?<request_string>(?:[^\"]|\\\")+)\"\\s(?<status_int>\\d{3})\\s(?<bytes_int>\\d+|-)\\s\"(?<ref_string>(?:[^\"]|\\\")+)\"\\s\"(?<useragent_string>(?:[^\"]|\\\")+)\"$";
 4     string conn = "Data Source=.;Initial Catalog=seodb;Integrated Security=True";
 5     using (SqlBulkCopy sbc = new SqlBulkCopy(conn))
 6     {
 7         TxtDataReader reader = TxtDataReader.ExecuteReader(file, pattern);
 8         sbc.DestinationTableName = "dbo.test";
 9         try
10         {
11             sbc.WriteToServer(reader);
12         }
13         catch (Exception ex)
14         {
15 
16         }
17         finally
18         {
19             reader.Close();
20         }
21     }
22     
23 }

备注：

1.如果其他场合需要以reader[int]或reader[string]形式获取读取器重的方法，需实现索引器。如需使用其他IDataReader接口定义的方法，同样。

参考代码：

1 public object this[string name]
2 {
3     get { return items[name]; }
4 }
5 
6  public object this[int i]
7 {
8     get { return items[i]; }
9 }
//Console.WriteLine(reader["ip"]);

2.为让示例代码简单清晰，其他未提供具体实现的IDataReader代码放在了分部类中。

3.当前所述解决的问题是解析日志文件并导入数据库中，只提供了满足当前需求的精简代码，甚至不包含必要的异常处理，数据截断、溢出等处理，如在实际用途中使用，需自行扩展和添加代码。

4.简单的、仅具参考价值的测试：windows xp\.net 2.0 \sql server 2005\2G内存 217M大小正式站日志文件，处理十次，平均时间35秒。

查看全文

相关阅读:
Python流程控制
 Python 迭代器和列表解析
 Python 文件对象
 TF-IDF介绍
 hexo博客更换主题
 学习笔记—MapReduce
Mac下Anaconda的安装和使用
 Flume的介绍和简单操作
 hexo+github搭建个人博客
 Hbase的安装和基本使用

原文地址：https://www.cnblogs.com/fx2008/p/2280216.html