zoukankan      html  css  js  c++  java
  • 论文二次处理流程

    第一步:
    依据conf目录下的program.list文件在raw_data下面建立一个各个节目名称的文件夹
    依据conf目录下的program_keywords文件在各个节目路径下面建立该节目对应的过滤词文件

    第二步:

    依据节目的过滤词从sina_weibo.data中根据每个节目下的若干个关键词依次进行过滤

    得到对应的program.data文件格式为

    提取到的字段为
    微博id($2) 用户id($3) 创建时间($5) 转发($11) 评论($12) 赞($13) 内容($6)

    以上两个步骤处理的完整脚本文件为:

    第三步:

    单独通过节目名称过滤的,保存在.title文件中(其实二三步可以合并)

    第四部:抽取标注样本,对于总的.uinq文件以及所有的.title.uniq文件都打乱之后提取1000行.保存在$program.sample 和$program.title.sample文件中

    第五步:复制以上smple文件到annotate文件中,为下一步可能进行的人工标注做准备。

     上述流程的代码为:

    !/bin/sh
    root_dir=/home/minelab/liweibo
    source_file=/home/minelab/cctv2014/data_warehouse/sina_weibo.data
    conf_dir=$root_dir/conf
    raw_dir=$root_dir/raw_data
    
    在raw_data目录下面建立各个节目名称命名的文件夹,同时建立各个节目下的关键词文件
    echo "make the program dir..."
    while read line
    do
        rm -rf $raw_dir/$line
        mkdir  $raw_dir/$line
        cat $conf_dir/program_keywords | grep $line | awk -F'	' '{for(i=1;i<=NF;i++) print $i}'> $raw_dir/$line/$line.filterwords
        echo $line" mkdir and get filter words is done@!"
    done < $conf_dir/program.list
    
    echo 'get the candidate tweet for each program filtering by the keywords...'
    program_list=`ls $raw_dir`
    for program in $program_list
    do
        rm -rf $raw_dir/$program/$program.data
        rm -rf $raw_dir/$program/$program.uniq
        while read line
        do
            cat $source_file | grep $line | awk -F'	' '{print $2"	"$3"	"$5"	"$11"	"$12"	"$13"	"$6}'>> $raw_dir/$program/$program.data
        done < $raw_dir/$program/$program.filterwords
        echo $program "filtering is done!"
        #去除链接以及文本去重
        sed -i '1,$s/http://t.cn/[a-zA-Z0-9]{4,9}//g' $raw_dir/$program/$program.data 
        echo $program "remove url is done..."
        cat $raw_dir/$program/$program.data | sort -t '    ' -k 7 | uniq -f 6 > $raw_dir/$program/$program.uniq
        echo $program "uniq is done ..."
    done
    echo "filter tweet by all words is done..."
    
    echo 'get the candidate tweet for each program filtering by the title...'
    program_list=`ls $raw_dir`
    for program in $program_list
    do
        rm -rf $raw_dir/$program/$program.title
        rm rf $raw_dir/$program/$program.title.uniq
        cat $source_file | grep $program | awk -F'	' '{print $2"	"$3"	"$5"	"$11"	"$12"	"$13"	"$6}' > $raw_dir/$program/$program.title
        echo $program "filtering is done!"
        #去除链接以及文本去重
        sed -i '1,$s/http://t.cn/[a-zA-Z0-9]{4,9}//g' $raw_dir/$program/$program.title
        echo $program "remove url is done..."
        cat $raw_dir/$program/$program.title | sort -t '    ' -k 7 | uniq -f 6 > $raw_dir/$program/$program.title.uniq
        echo $program "uniq is done ..."
    done
    echo "preData is done..."
    
    
    echo "sample is begining ..."
    program_list=`ls $raw_dir`
    for program in $program_list
    do 
        rm -rf $raw_dir/$program/$program.sample
        rm -rf $raw_dir/$program/$program.title.sample
        cat $raw_dir/$program/$program.uniq | shuf | head -n 1000 > $raw_dir/$program/$program.sample
        cat $raw_dir/$program/$program.title.uniq | shuf | head -n 1000 > $raw_dir/$program/$program.title.sample
        echo $program "sampling is done..."
    done
    
    echo "statics start..."
    program_list=`ls $raw_dir`
    for program in $program_list
    do 
        rm -rf $raw_dir/$program/$program.statistic 
        wc -l $raw_dir/$program/$program.data >> $raw_dir/$program/$program.statistic
        wc -l $raw_dir/$program/$program.uniq >> $raw_dir/$program/$program.statistic
        wc -l $raw_dir/$program/$program.title >> $raw_dir/$program/$program.statistic
        wc -l $raw_dir/$program/$program.title.uniq >> $raw_dir/$program/$program.statistic
        wc -l $raw_dir/$program/$program.sample>> $raw_dir/$program/$program.statistic
        wc -l $raw_dir/$program/$program.title.sample>> $raw_dir/$program/$program.statistic
        echo $program "statistic is done..."
    done
    
    echo "copy for annotate ..."
    program_list=`ls $raw_dir`
    for program in $program_list
    do 
        rm -rf $raw_dir/$program/$program.sample.annotate
        rm -rf $raw_dir/$program/$program.title.sample.annotate
        cp $raw_dir/$program/$program.sample $raw_dir/$program/$program.sample.annotate
        cp $raw_dir/$program/$program.title.sample $raw_dir/$program/$program.title.sample.annotate
        echo $program "copy for annotate is done ..."
    done
    二次数据预处理代码

    训练集和测试集不同,计算平均值

    以上实验baseline的python的脚本

    #!/usr/python
    #!-*- coding=utf8-*-
    import os
    import os.path
    import sys
    
    root_dir='/media/新加卷__/小论文实验/data/liweibo'
    
    def init():
        reload(sys)
        sys.setdefaultencoding('utf8')
    
    
    def traverseFile(dir_name,suffix_list,result_list,recursive=True):
        init()
    #    print '加载路径为:'+ dir_name
        files=os.listdir(dir_name)
        for suffix in suffix_list:
            for file_name in files:
                full_name=dir_name+'/'+file_name
                if(os.path.isdir(full_name) & recursive):
                    traverseFile(full_name,suffix_list,result_list,recursive)
                else:
                    if(full_name.endswith(suffix)):
                        result_list.append(full_name)
        return result_list
    
    def printlist(l):
        for i in range(len(l)):
            print l[i]
    
    if __name__=="__main__":
        result_list=list()
        traverseFile(root_dir,['.fenci'],result_list)
        for result in result_list:
            print result
    辅助工具类,用于特定后缀文件的加载和列表的打印
    #!/usr/python
    #!-*-coding=utf8-*-
    import numpy as np
    import random
    
    import myUtil
    
    from sklearn import cross_validation
    from sklearn import svm
    from sklearn import metrics
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    import pylab as pl
    
    def loadCorpusDataFromFile(in_file_name):
        label_list=list()
        corpus=list()
        with open(in_file_name) as in_file:
            for line in in_file:
                line_arr=line.strip().split('	')
                if(len(line_arr)<3):
                    continue
                label_list.append(int(line_arr[0]))
                corpus.append(line_arr[2])
        return (corpus,label_list)
    
    def preDataByWordBag(in_file_name_list):
        v=CountVectorizer(min_df=1)
        corpus=list()
        label_list=list()
        for in_file_name in in_file_name_list:
            (cur_corpus,cur_label_list)=loadCorpusDataFromFile(in_file_name)
            corpus.extend(cur_corpus)
            label_list.extend(cur_label_list)
        data_list=v.fit_transform(corpus)
        label_list=np.array(label_list)
        return (data_list,label_list)
    
    def preDataByTfidf(in_file_name_list):
        v=TfidfVectorizer(min_df=1)
        corpus=list()
        label_list=list()
        for in_file_name in in_file_name_list:
            (cur_corpus,cur_label_list)=loadCorpusDataFromFile(in_file_name)
            corpus.extend(cur_corpus)
            label_list.extend(cur_label_list)
        data_list=v.fit_transform(corpus)
        label_list=np.array(label_list)
        return (data_list,label_list)
    '''
    人为指定训练集和测试集,这种情况无需交叉验证
    data_train 训练集
    data_test 测试集
    classifier 分类器
    '''
    def trainModelAllocateTestData(data_train,data_test,classifier):
        print "start to trainModel..."
        x_train=data_train[0]
        y_train=data_train[1]
        x_test=data_test[0]
        y_test=data_test[1]
    
        n_samples,n_features=x_train.shape
        print "n_samples:"+str(n_samples)+"n_features:"+str(n_features)
    
        classifier.fit(x_train,y_train)
        y_true,y_pred=y_test,classifier.predict(x_test)
        precision=metrics.precision_score(y_true,y_pred)
        recall=metrics.recall_score(y_true,y_pred)
        accuracy=metrics.accuracy_score(y_true,y_pred)
    #    accuracy=classifier.score(x[test],y_true)
        f=metrics.fbeta_score(y_true,y_pred,beta=1)
        probas_=classifier.predict_proba(x_test)
        fpr,tpr,thresholds=metrics.roc_curve(y_test,probas_[:,1])
        roc_auc=metrics.auc(fpr,tpr)
        print("precision:%0.2f,recall:%0.2f,f:%0.2f,accuracy:%0.2f,roc_auc:%0.2f" % (precision,recall,f,accuracy,roc_auc))
        
        #plot ROC curve
        pl.clf()
        pl.plot(fpr,tpr,label='ROC curve (area = %0.2f)' % roc_auc)
        pl.plot([0,1],[0,1],'k--')
        pl.xlim([0.0,1.0])
        pl.ylim([0.0,1.0])
        pl.xlabel('False Positive Rate')
        pl.ylabel('True Positive Rate')
        pl.title('receiver opetating characteristic example')
        pl.legend(loc='lower right')
        pl.show()
        return (precision,recall,f,accuracy,roc_auc)
    
    def trainModel(data,classifier,n_folds=5):
        print "start to trainModel..."
        x=data[0]
        y=data[1]
    
        #shupple samples
        n_samples,n_features=x.shape
        print "n_samples:"+str(n_samples)+"n_features:"+str(n_features)
        p=range(n_samples)
        random.seed(0)
        random.shuffle(p)
        x,y=x[p],y[p]
    
        #cross_validation
        cv=cross_validation.KFold(len(y),n_folds=5)
        mean_tpr=0.0
        mean_fpr=np.linspace(0,1,100)
    
        mean_recall=0.0
        mean_accuracy=0.0
        mean_f=0.0
        mean_precision=0.0
    
        for i,(train,test) in enumerate(cv):
            print "the "+str(i)+"times validation..."
            classifier.fit(x[train],y[train])
            y_true,y_pred=y[test],classifier.predict(x[test])
            mean_precision+=metrics.precision_score(y_true,y_pred)
            mean_recall+=metrics.recall_score(y_true,y_pred)
    #        mean_accuracy+=metrics.accuracy_score(y_true,y_pred)
            mean_accuracy+=classifier.score(x[test],y_true)
            mean_f+=metrics.fbeta_score(y_true,y_pred,beta=1)
            
            probas_=classifier.predict_proba(x[test])
            fpr,tpr,thresholds=metrics.roc_curve(y[test],probas_[:,1])
            mean_tpr+=np.interp(mean_fpr,fpr,tpr)
            mean_tpr[0]=0.0
            roc_auc=metrics.auc(fpr,tpr)
            pl.plot(fpr,tpr,lw=1,label='ROC fold %d (area=%0.2f)'%(i,roc_auc))
        pl.plot([0,1],[0,1],'--',color=(0.6,0.6,0.6),label='luck')
    
        mean_precision/=len(cv)
        mean_recall/=len(cv)
        mean_f/=len(cv)
        mean_accuracy/=len(cv)
    
        mean_tpr/=len(cv)
        mean_tpr[-1]=1.0
        mean_auc=metrics.auc(mean_fpr,mean_tpr)
        print("mean_precision:%0.2f,mean_recall:%0.2f,mean_f:%0.2f,mean_accuracy:%0.2f,mean_auc:%0.2f " % (mean_precision,mean_recall,mean_f,mean_accuracy,mean_auc))
        pl.plot(mean_fpr,mean_tpr,'k--',label='Mean ROC (area=%0.2f)'% mean_auc,lw=2)
    
        pl.xlim([-0.05,1.05])
        pl.ylim([-0.05,1.05])
        pl.xlabel('False Positive Rate')
        pl.ylabel('True Positive Rate')
        pl.title('ROC')
        pl.legend(loc="lower right")
        pl.show()
    
    def removeOneFeatureThenTrain(data,clf):
        x=data[0]
        y=data[1]
        n_samples,n_features=x.shape
        for i in range(n_features):
            print 'remove ' + str(i+1) + ' feture...'
            data_one=x[:,0:i]
            data_two=x[:,(i+1):n_features]
            data_leave=np.column_stack((data_one,data_two))
            trainModel((data_leave,y),clf)
    
    def chooseSomeFeaturesThenTrain(data,clf,choose_index):
        x=data[0]
        y=data[1]
        (n_samples,n_features)=x.shape
        result_data=np.zeros(n_samples).reshape(n_samples,1)
        for i in choose_index:
            if i<1 or i > n_features:
                print 'error feature_index'
                return
            choose_column=x[:,(i-1)].reshape(n_samples,1)
            result_data=np.column_stack((result_data,choose_column))
        result_data=(result_data[:,1:],y)        
        trainModel(result_data,clf)
            
            
    def main():
        #采用svm进行分类
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
    
        #采用自己提取的属性赋权重
    #    print "using my own weight strategory..."
    #    data=preData()
    #    trainModel(data,clf)
    
    #    #采用word-bag赋权重
    #    print "using wordBag strategory..."
    #    data=preDataByWordBag()
    #    trainModel(data,clf)
    
        #采用tf-idf赋权重
    #    print "using tfidf strategory..."
    #    data=preDataByTfidf()
    #    trainModel(data,clf)
    
        #利用系统的10倍交叉验证功能
        #data_list=data[0]
        #label_list=data[1]
        #scores=cross_validation.cross_val_score(clf,data_list,label_list,cv=5)
        #print    scores
        #print("Accuracy:%0.2f(+/-%0.2f)"%(scores.mean(),scores.std()**2))
    
        #每次去除一个属性进行判断
        print "begin to remove one feature at one time..."
        #data=preData()
        #removeOneFeatureThenTrain(data,clf)
    
        #每次选择若干属性组合进行判断
        print "begin to choose some features.."
        data=preData()
        n_samples,n_features=data[0].shape
        for i in range(1,n_features+1):
            chooseSomeFeaturesThenTrain(data,clf,[i])
            
    
    root_dir='/media/新加卷_/小论文实验/data/liweibo/raw_data'
    '''
    加载所有的分词文件,通过.fenci作为文件过滤的标准
    '''
    def loadAllFenciFile():
        file_list=list()
        #加载分词文件,准备分类器
        myUtil.traverseFile(root_dir,['.fenci'],file_list)
        return file_list
    
    '''
    蒋所有的数据作为训练集
    '''
    def testAllFile():
        file_list=loadAllFenciFile()
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
        data=preDataByWordBag(file_list)
        trainModel(data,clf)
    
        
    '''
    单个节目逐个训练
    '''
    def testEachFile():
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
        file_list=loadAllFenciFile()
        for i in range(len(file_list)):
            if i==1:
                continue
            data=preDataByWordBag([file_list[i]])
            trainModel(data,clf)
    
    def trainBySomeTestByOther():
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
        ambiguity_list=loadAllFenciFile()
        mean_precision=0.0
        mean_recall=0
        mean_f=0.0
        mean_accuracy=0.0
        mean_auc=0.0
        program_num=len(ambiguity_list)
        for i in  range(program_num):
            test_file=ambiguity_list[i]
            ambiguity_list.remove(test_file)
            ambiguity_list.append(test_file)
            print 'test_file:'
            print test_file
            print 'train_file:'
            myUtil.printlist(ambiguity_list)
            test_line=len(loadCorpusDataFromFile(test_file)[1])
            data_all=preDataByWordBag(ambiguity_list)
            data_train=(data_all[0][0:-test_line],data_all[1][0:-test_line])
            data_test=(data_all[0][-test_line:],data_all[1][-test_line:])
            (precision,recall,f,accuracy,roc_auc)=trainModelAllocateTestData(data_train,data_test,clf)    
            mean_precision+=precision
            mean_recall+=recall
            mean_f+=f
            mean_accuracy+=accuracy
            mean_auc+=roc_auc
            ambiguity_list=loadAllFenciFile()
        mean_precision/=program_num
        mean_recall/=program_num
        mean_f/=program_num
        mean_accuracy/=program_num
        mean_auc/=program_num
        print("the average result of train by some test by other is:")
        print("mean_precision:%0.2f,mean_recall:%0.2f,mean_f:%0.2f,mean_accuracy:%0.2f,mean_auc:%0.2f " % (mean_precision,mean_recall,mean_f,mean_accuracy,mean_auc))
        
    #-------------------------利用自己提取的特征进行训练--------------------------------------
    def loadMyDataForSingle(inFilePath):
        label_list=list()
        data_list=list()
        with open(inFilePath) as inFile:
            for line in inFile:
                lineArr=line.strip().split('	')
                if(len(lineArr)!=8):
                    continue
                label_list.append(int(lineArr[0]))
                data_list.append([float(lineArr[1]),float(lineArr[2]),float(lineArr[3]),float(lineArr[4]),float(lineArr[5]),float(lineArr[6]),float(lineArr[7])])
        return (data_list,label_list)
    
    def loadMyDataForMany(inFilePathList):
        label_list=list()
        data_list=list()
        for inFilePath in inFilePathList:
            result=loadMyDataForSingle(inFilePath)
            label_list.extend(result[1])
            data_list.extend(result[0])
        return (np.array(data_list),np.array(label_list))
    def loadAllMyFile():
        file_list=list()
        myUtil.traverseFile(root_dir,['.result'],file_list)
        return file_list
    
    def trainAllByMine():
        file_list=loadAllMyFile()
        data=loadMyDataForMany(file_list)
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
        trainModel(data,clf)
    def trainSomeTestOtherByMine():
        clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0)
        file_list=loadAllMyFile()
        mean_precision=0.0
        mean_recall=0
        mean_f=0.0
        mean_accuracy=0.0
        mean_auc=0.0
        program_num=len(file_list)
        for i in  range(program_num):
            test_file=file_list[i]
            file_list.remove(test_file)
            print 'test_file:'
            print test_file
            print 'train_file:'
            myUtil.printlist(file_list)
            data_train=loadMyDataForMany([test_file])
            data_test=loadMyDataForMany(file_list)
            (precision,recall,f,accuracy,roc_auc)=trainModelAllocateTestData(data_train,data_test,clf)    
            mean_precision+=precision
            mean_recall+=recall
            mean_f+=f
            mean_accuracy+=accuracy
            mean_auc+=roc_auc
            file_list=loadAllMyFile()
        mean_precision/=program_num
        mean_recall/=program_num
        mean_f/=program_num
        mean_accuracy/=program_num
        mean_auc/=program_num
        print("the average result of train by some test by other is:")
    
    if __name__=='__main__':
        #所有节目一起利用词袋参与训练
        #testAllFile()
    
        #单个节目wordbag单独训练
        #testEachFile()
    
        #利用歧义大的文件利用wordbag训练,但是测试集和训练集不同
        #trainBySomeTestByOther()
    
        #利用自己的特征提取方法蒋测试集和训练集综合进行训练
        #trainAllByMine()
    
        #利用自己的特征提取方法蒋测试集和训练集单独进行训练
        trainSomeTestOtherByMine()
    二次实验的python脚本
    利用最简单的word-of-bag作为baseline,结果如下:
    
    当训练集和测试集来源相同,采用五倍交叉验证
    
    mean_precision:0.92,mean_recall:0.92,mean_f:0.92,mean_accuracy:0.90,mean_auc:0.96
    测试集为团圆饭的时候:
    
    precision:0.49,recall:0.39,f:0.43,accuracy:0.62,roc_auc:0.62
    测试集为我就这么个人的时候:
    
    precision:0.95,recall:0.81,f:0.88,accuracy:0.83,roc_auc:0.91
    
    测试集为我的要求不算高的时候:
    
    precision:0.92,recall:0.31,f:0.47,accuracy:0.54,roc_auc:0.79
    
    测试集为扶不扶的时候:
    
    precision:0.93,recall:0.87,f:0.90,accuracy:0.83,roc_auc:0.85
    
    测试集为时间都去哪儿的时候:
    
    precision:0.84,recall:0.35,f:0.49,accuracy:0.58,roc_auc:0.69
    测试集为说你什么好的时候:
    
    precision:0.59,recall:0.79,f:0.67,accuracy:0.66,roc_auc:0.76
    以上各项平均可以得到:
    
    mean_precision:0.79,mean_recall:0.59,mean_f:0.64,mean_accuracy:0.68,mean_auc:0.77
    
    使用自己的加权方法,测试集和训练集混合,得到的结果是:
    
    mean_precision:0.94,mean_recall:0.81,mean_f:0.87,mean_accuracy:0.85,mean_auc:0.88
    蒋测试集和训练集分开,得到的结果是:
    
    测试集为团圆饭的时候:
    
    precision:0.98,recall:0.71,f:0.83,accuracy:0.80,roc_auc:0.85
    测试集为我就这么个人的时候:
    
     precision:0.90,recall:0.73,f:0.81,accuracy:0.80,roc_auc:0.81
    
    测试集为我的要求不算高的时候:
    
    precision:0.74,recall:0.88,f:0.80,accuracy:0.74,roc_auc:0.86
    测试集为扶不扶的时候(这个绝对拉低平均分呀,可以考虑去掉尝试):
    
    precision:0.55,recall:1.00,f:0.71,accuracy:0.55,roc_auc:0.54
    
    测试集为时间都去哪儿的时候:
    
    precision:0.74,recall:0.97,f:0.84,accuracy:0.77,roc_auc:0.91
    测试集为说你什么好的时候:
    
    precision:0.97,recall:0.75,f:0.85,accuracy:0.83,roc_auc:0.86
    平均结果为:
    
    precision:0.97,recall:0.75,f:0.85,accuracy:0.83,roc_auc:0.86
    二次实验结果


     

  • 相关阅读:
    auto-sklearn案例解析二
    auto-sklearn案例解析二
    auto-sklearn案例解析一
    auto-sklearn案例解析一
    auto-sklearn简介
    auto-sklearn简介
    auto-sklearn手册
    auto-sklearn手册
    观念
    JDBC总结
  • 原文地址:https://www.cnblogs.com/bobodeboke/p/3575775.html
Copyright © 2011-2022 走看看