zoukankan      html  css  js  c++  java
  • case7 淋巴瘤子类分类实验记录

    case7 淋巴瘤子类分类实验记录


    简介

    分类问题:3分类 (identifying three sub-types of lymphoma: Chronic Lymphocytic Leukemia (CLL, 慢性淋巴细胞白血病), Follicular Lymphoma (FL,滤泡性淋巴瘤), and Mantle Cell Lymphoma (MCL,套细胞淋巴瘤)
    网络模型:AlexNet
    数据集: 原图1388*1040大小,共计374张, 1.4G。 CLL:113, FL:138, MCL:122


    实验流程

    准备工作
    caffe环境配置好;数据集代码下载完毕

    • 将大图切成小的patches.代码:step1_make_patches.m。
      代码需要修改的就是路径,这点需要注意。为了方便,将数据集放在与.m的同级目录下.
      在这之前,为了与教程所描述的数据集中图片的命名一致,要在每一类别下的图片加类名前缀。这里给出ubuntu下批量修改文件名的方法:

      cd 到子类所在的路径下
      假设要加的类名前缀为CLL-
      sudo rename 's/^/CLL-/' *tif

    修正后的代码以及简要理解如下:

        clc
    	clear all 
    	% 子图的输出路径
    	outdir='./subs/'; %output directory for all of the sub files
    	mkdir(outdir)
    
    	% 设置取patch时的步长
    	step_size=32;
    	% 设置patch大小,注意作者在这里提到,输入caffe时还会被crop成32*32
    	 patch_size=36; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them
    	% 按类别取patch
    	classes={'CLL','FL','MCL'};
    	class_struct={};
    	for classi=1:length(classes)
    		% 得到目标类文件夹下所有图片名称
    		files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
    		% 生成无重复的病人序号。这里解释一下,因为作者想做的是建立与病人关联的数据库,但是实际上该数据集没有病人信息,但为了统一,仍采用这种结构生成数据
    		% arrayfun: 对数组中的每一个元素进行fun运算; x{1}{1}生成1x1的cell
    		patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
    		patient_struct=[];
    		% parfor 并行
    	parfor ci=1:length(patients) % for each of the *patients* we extract patches
        		% base属性为名字
        		patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
        		% sub_file 属性存放该病人(大图)的patch存放路径
        		patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
        		% 得到对应病人的大图
        		files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
        
        		for fi=1:length(files) %for each of the files..... % 由上,该数据集无重复,每个病人只对应一张大图
            		disp([ci,length(patients),fi,length(files)])
            		fname=files(fi).name;
            		% 保存的该病人每张大图的名字
            		patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
            
            		io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),fname]); %read the image
                
            		[nrow,ncol,ndim]=size(io);
            		fnames_sub={};
            		i=1;
            		% 取图像的patch,实际上是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2
            		for rr=1:step_size:nrow-patch_size
                		for cc=1:step_size:ncol-patch_size
                    		for rot=1:2  % 旋转,旋转90度,扩充数据集x2,
                        		try
                            		% 可以改成rr=1:step_size:nrow-patch_size+1,... ,
                            		% subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                            
                            		subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                            		subio=imrotate(subio,(rot-1)*90);
                            		% patch的命名方式:第几个patch
                            		subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                            		fnames_sub{end+1}=subfname;
                            		imwrite(subio,[outdir,subfname]);
                            		i=i+1;
                        		catch err
                            		disp(err);
                            		continue
                        		end
                    		end
                		end
            		end
            
            
            		patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
        		end
        
    		end
    		class_struct{classi}=patient_struct;
    
    	end
    
    	save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes
    

    每个图片切出2752张patches


    • 生成交叉验证集,为了得到最好的模型。使用5折交叉验证。代码step2_make_training_lists.m。
      每一个交叉验证集需要生成4个txt文件,以第一折为例:
      train_w32_parent_1.txt,test_w32_parent_1.txt:该交叉验证集包含的病人名称列表的txt
      train_w32_1.txt,test_w32_1.txt: 该交叉验证集包含的图片名称以及对应类别的列表的txt
      代码比较直观,只要是要理解5折交叉验证的原理。简单记录下代码:

        load class_struct %save this just incase the computer crashes before the next step finishes
        % 5折交叉验证
        nfolds=5; %determine how many folds we want to use during cross validation
        fidtrain=[];
        fidtest=[];
        
        
        fidtrain_parent=[];
        fidtest_parent=[];
        % 生成所有文件的句柄
        for zz=1:nfolds %open all of the file Ids for the training and testing files
            %each fold has 4 files created (as discussed in the tutorial)
            fidtrain(zz)=fopen(sprintf('train_w32_%d.txt',zz),'w');
            fidtest(zz)=fopen(sprintf('test_w32_%d.txt',zz),'w');
            
            fidtrain_parent(zz)=fopen(sprintf('train_w32_parent_%d.txt',zz),'w');
            fidtest_parent(zz)=fopen(sprintf('test_w32_parent_%d.txt',zz),'w');
        end
        
        % 将病人ID写入patient.txt .将病人的patch图片及类别(CLL:0,FL:1,MCL : 2)名写入另外两个txt
        % 5折交叉验证是:4个为训练集,剩余一个为测试集,这样可以组合为5个数据集
        for classi=1:length(class_struct)
            
            patient_struct=class_struct{classi};
            
            npatients=length(patient_struct); %get the number of patients that we have
            indices=crossvalind('Kfold',npatients,nfolds); %use the matlab function to generate a k-fold set
            
            for fi=1:npatients %for each patient
                disp([fi,npatients]);
                for k=1:nfolds %for each fold
                    
                    if(indices(fi)==k) %if this patient is in the test set for this fold, set the file descriptor accordingly
                        fid=fidtest(k);
                        fid_parent=fidtest_parent(k);
                    else %otherwise its in the training set
                        fid=fidtrain(k);
                        fid_parent=fidtrain_parent(k);
                    end
                    
                    fprintf(fid_parent,'%s
      ',patient_struct(fi).base); %print this patien's ID to the parent file
                    
                    subfiles=patient_struct(fi).sub_file; %get the patient's images
                    
                    for subfi=1:length(subfiles) %for each of the patient images
                        try
                            subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
        					% !!!这里注意要将%s	%d改为%s %d,使用空格作为分隔,否则后面格式转换时会出错:could not open or find file...
                            cellfun(@(x) fprintf(fid,'%s %d
      ',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)
                            
                        catch err
                            disp(err)
                            disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                            continue
                        end
                    end
                    
                end
            end
            
        end
        
        for zz=1:nfolds %now that we're done, make sure that we close all of the files
            fclose(fidtrain(zz));
            fclose(fidtest(zz));
            
            fclose(fidtrain_parent(zz));
            fclose(fidtest_parent(zz));
            
        end
      

    5个数据集模型, 每个测试集203648张patches,训练集825600,训练集:测试集~1:4


    • 生成数据集。这里利用caffe的命令行生成leveldb格式的数据和相应的均值文件。之所以不直接用image layer,是因为:还需计算所需格式的均值,而且image layer也不是设计为大数据量读取的,所以直接使用caffe命令行更加方便。
      代码:step3_make_dbs.sh,** 在sub文件夹内运行**,以确保路径正确。还是要修改源代码的一些路径问题和一些细节上的错误:

        #!/bin/bash
        
        filepath=$(cd "$(dirname "$0")"; pwd)
        
        for kfoldi in {1..5}
        do
        echo "doing fold $kfoldi"
        #注意这里,如果你实验的目录是在caffe路径下时,可以这样,否则需要绝对路径。而且原代码的for循环内部{{kfoldi}} 应改为kfoldi
        #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_${kfoldi}.txt DB_train_${kfoldi} &
        #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_${kfoldi}.txt DB_test_${kfoldi} &
        /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_$kfoldi.txt DB_train_$kfoldi &
        /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_$kfoldi.txt DB_test_$kfoldi &
        done
        
        
        
        
        FAIL=0
        for job in `jobs -p`
        do
            echo $job
            wait $job || let "FAIL+=1"
        done
        
        
        
        
        echo "number failed: $FAIL"
        
        cd ../
        
        for kfoldi in {1..5}
        do
        echo "doing fold $kfoldi"
        #这里同上,进行修改
        /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train_$kfoldi DB_train_w32_$kfoldi.binaryproto -backend leveldb  &
        done
      

    • 训练DL分类器

    说明:使用的网络结构是alexnet,其实比AlexNet官方结构精简,只有3对卷积池化和两个全连接,实际上这是cifar10分类中使用的网络结构。要考虑到这里的输入图片大小为32*32(网络结构中对输入的定义还做了crop为32的操作),而且是3分类(alexnet是1000分类),所以从模型的复杂度上也不需要做的和alexnet那样复杂。所以网络深度和一些参数需要变化,不能照搬AlexNet。但是值得实验的是,是否病理学图像必须裁成小图,大一些的图是否可以,少加一些pool,把深度提上去,不知道性能会怎么样?

    需要的文件:与7-lymphoma同级的common文件夹下的BASE-alexnet_solver_ada.prototxt、(BASE-alexnet_traing_32w_db.prototxt、BASE-alexnet_traing_32w_dropout_db.prototxt;带不带dropout),(deploy_train32.prototxt、deploy_train32_dropout.prototxt,测试网络定义)。
    复制5份,用于5个模型(5折交叉验证),命名方式1-alexnet_solver_ada.prototxt,以此类推。
    修改的内容:

    1. 核对所有文件中的$(kfoldi)d,需要相应替换为数字1-5. 修改prototxt文件最后ip layer的输出为3。

    2. 要修改路径。文件中的路径(数据,prototxt)是指模型定义都放在了caffe的./model下,而数据集存LMDB和mean文件放在caffe根目录下。如果不是,需要替换为绝对路径。

    3. 修改caffe的测试迭代次数,在solver文件下的test_iter。计算方法为测试数据量/测试时的batch_size。batch_size = 128,而前者可以通过运行下面指令快速得到:

       wc -l test_w32_1.txt
      

    或者打开文件拉到最后一行,看文本编辑器的下方显示的行数。

    进行训练:

    	/home/mz/py-R-FCN/caffe/build/tools/caffe train --solver=1-alexnet_solver_ada.prototxt
    

    对于模型5,迭代600000次,不加dropout的模型:0.841879,loss = 0.513672 ;
    加dropout的模型:0.826787,loss = 0.576846
    对于模型4,迭代600000次,不加dropout的模型:0.86142,loss = 0.364765 ;
    加dropout的模型:0.85352,loss = 0.500288
    对于模型3,迭代600000次,不加dropout的模型:0.840632,loss = 0.448586 ;
    加dropout的模型:0.814813,loss = 0.546735
    对于模型2,迭代600000次,不加dropout的模型:0.817167,loss = 0.466199 ;
    加dropout的模型:0.797229,loss = 0.557098
    对于模型1,迭代600000次,不加dropout的模型:0.85496,loss = 0.435163 ;
    加dropout的模型:0.828828,loss = 0.577961


    尝试大尺寸的patch,然后使用不同的网络结构(AlexNet,VGG-16,GoogLeNet,ResNet)

    1. 数据准备。
      现在尝试大尺寸的patch,这里裁剪成227*227。后续的实验不再进行交叉验证。将全部数据合为一份数据集,按照2:1:1划分训练集,校验集和测试集。
      方法:重新新建一个文件夹,用来存放实验数据。更改原来的step1和step2的文件中的代码。参考如下:
      step1. 从原图上切227×227的patch,同时对这些patch做水平翻转,扩充数据。一张原图生成96张patch。
    clc
    clear all 
    % 子图的输出路径
    outdir='./subs_227/'; %output directory for all of the sub files
    mkdir(outdir)
    
    % 设置取patch时的步长
    step_size=227;
    % 设置patch大小,注意作者在这里提到,输入caffe时还会被crop成32*32
    patch_size=227; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them
    
    % 是否水平翻转
    flip = true;
    
    % 按类别取patch
    classes={'CLL','FL','MCL'};
    class_struct={};
    for classi=1:length(classes)
        % 得到目标类文件夹下所有图片名称
        files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
        % 生成无重复的病人序号。这里解释一下,因为作者想做的是建立与病人关联的数据库,但是实际上该数据集没有病人信息,但为了统一,仍采用这种结构生成数据
        % arrayfun: 对数组中的每一个元素进行fun运算; x{1}{1}生成1x1的cell
        patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
        patient_struct=[];
        % parfor 并行
       parfor ci=1:length(patients) % for each of the *patients* we extract patches
            % base属性为名字
            patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
            % sub_file 属性存放该病人(大图)的patch存放路径
            patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
            % 得到对应病人的大图
            files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
            
            for fi=1:length(files) %for each of the files..... % 由上,该数据集无重复,每个病人只对应一张大图
                disp([ci,length(patients),fi,length(files)])
                fname=files(fi).name;
                % 保存的该病人每张大图的名字
                patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
                
                io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}), fname]); %read the image
                    
                [nrow,ncol,ndim]=size(io);
                fnames_sub={};
                i=1;
                % 取图像的patch,实际上是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2
                for rr=1:step_size:nrow-patch_size
                    for cc=1:step_size:ncol-patch_size
                        for rot=1:2  % 旋转,旋转90度,扩充数据集x2,
                            try
                                % 可以改成rr=1:step_size:nrow-patch_size+1,... ,
                                % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                                
                                subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                                subio=imrotate(subio,(rot-1)*90);
                                % patch的命名方式:第几个patch
                                subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                fnames_sub{end+1}=subfname;
                                imwrite(subio,[outdir,subfname]);
                                i=i+1;
                                if flip
                                    subio_flip = subio(:,end:-1:1,1:3);
                                    % patch的命名方式:第几个patch
                                    subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                    fnames_sub{end+1}=subfname;
                                    imwrite(subio_flip,[outdir,subfname]);
                                    i=i+1;
                                end
                            catch err
                                disp(err);
                                continue
                            end
                        end
                    end
                end
                
                
                patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
            end
            
        end
        class_struct{classi}=patient_struct;
    
    end
    
    save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes
    

    step2.生成分别包含训练和测试集图片name list的TXT文件.训练集:17856;测试集:9120.;校验集:8928

    load class_struct %save this just incase the computer crashes before the next step finishes
    
    % 生成文件的句柄
    
    fidtrain=fopen(sprintf('train_w227.txt'),'w');
    fidval=fopen(sprintf('val_w227.txt'),'w');
    fidtest=fopen(sprintf('test_w227.txt'),'w');
    fidtrain_parent  = fopen(sprintf('train_w227_parent.txt'),'w');
    fidval_parent  = fopen(sprintf('val_w227_parent.txt'),'w');
    fidtest_parent  = fopen(sprintf('test_w227_parent.txt'),'w');
    % 将病人的patch图片及类别(CLL:0,FL:1,MCL : 2)名写入训练和测试txt
    
    % 训练集,校验集和测试集比例2:1:1
    
    
    for classi=1:length(class_struct)
        
        patient_struct=class_struct{classi};
        
        npatients=length(patient_struct); %get the number of patients that we have
        % 打乱顺序
        RandIndex = randperm(npatients);
        test_index = RandIndex(1:ceil(0.25*npatients));
        val_index = RandIndex(ceil(0.25*npatients)+1:ceil(0.5*npatients));
        train_index = RandIndex(ceil(0.5*npatients)+1:end);
            
        for fi=1:npatients %for each patient
            disp([fi,npatients]);
                
            if(ismember(fi, test_index)) %if this patient is in the test set for this fold, set the file descriptor accordingly
                fid=fidtest;
                fid_parent=fidtest_parent;
            elseif(ismember(fi, train_index)) %otherwise its in the training set
                fid=fidtrain;
                fid_parent=fidtrain_parent;
            else
                fid=fidval;
                fid_parent=fidval_parent;
            end
                
            fprintf(fid_parent,'%s
    ',patient_struct(fi).base); %print this patien's ID to the parent file
    
            subfiles=patient_struct(fi).sub_file; %get the patient's images
    
            for subfi=1:length(subfiles) %for each of the patient images
                try
                    subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
                    cellfun(@(x) fprintf(fid,'%s %d
    ',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)
    
                catch err
                    disp(err)
                    disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                    continue
                end
            end
    
        end
    end
        
    
     %now that we're done, make sure that we close all of the files
    fclose(fidtrain);
    fclose(fidtest);
    fclose(fidval);
    fclose(fidtrain_parent);
    fclose(fidtest_parent);
    fclose(fidval_parent);
    
    

    step3. 生成leveldb格式的数据以及对应的均值文件。将按如下修改的step3文件放入subs_227

    #!/bin/bash
    
    echo "create lmdb data"
    /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../train_w227.txt ../DB_train &
    /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_test &
    /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_val &
    
    
    
    
    
    FAIL=0
    for job in `jobs -p`
    do
        echo $job
        wait $job || let "FAIL+=1"
    done
    
    
    
    
    echo "number failed: $FAIL"
    
    cd ../
    
    
    echo "ceate mean binary"
    /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train DB_train_w227.binaryproto -backend lmdb  &
    

    不同的模型

    AlexNet

    1. 从caffe/models下拷贝bvlc-alexnet文件夹,得到Alexnet的模型定义prototxt和solver.prototxt.更改相关参数,进行训练。
      参数:迭代次数:50000;test_iter=179;test_interval=200;fc8-output=3;
    2. 结果
      val-accuracy: 0.927598; train-loss = 8.55613e-05;
      这里测试的时候仍使用train_val.prototxt,另存一份,起名为train_test.prototxt。只是要将校验集路径改为测试集路径。然后,执行下面命令:
    sudo /home/mz/py-R-FCN/caffe/build/tools/caffe test -model=train_test.prototxt -weights=../models/caffe_alexnet_train_iter_50000.caffemodel -gpu 0 -iterations=183
    
    

    -iterations迭代次数参数计算方式:测试集数量/batch_size
    test-accuracy: 0.856721 loss=1.03727

    GooLeNet

    VGG

    ResNet

  • 相关阅读:
    【待整理】转义字符
    关系运算符 与 逻辑运算符
    浏览器相关
    正则表达式
    样式定义——多重浏览器
    事件
    属性定义
    数组
    日期
    构造函数
  • 原文地址:https://www.cnblogs.com/alanma/p/6991765.html
Copyright © 2011-2022 走看看