zoukankan      html  css  js  c++  java
  • 使用ML.NET实现情感分析[新手篇]后补

    在《使用ML.NET实现情感分析[新手篇]》完成后,有热心的朋友建议说,为何例子不用中文的呢,其实大家是需要知道怎么预处理中文的数据集的。想想确实有道理,于是略微调整一些代码,权作示范。

    首先,我们需要一个好用的分词库,所以使用NuGet添加对JiebaNet.Analyser包的引用,这是一个支持.NET Core的版本。

    然后,训练和验证用的数据集找一些使用中文的内容,并且确认有正确的标注,当然是要越多越好。内容类似如下:

    最差的是三文鱼生鱼片。 0
    我在这里吃了一顿非常可口的早餐。 1
    这是拉斯维加斯最好的食物之一。 1
    但即使是天才也无法挽救这一点。 0
    我认为汤姆汉克斯是一位出色的演员。 1
    ...

    增加一个切词的函数:

    public static void Segment(string source, string result)
    {
        var segmenter = new JiebaSegmenter();
        using (var reader = new StreamReader(source))
        {
            using (var writer = new StreamWriter(result))
            {
                while (true)
                {
                    var line = reader.ReadLine();
                    if (string.IsNullOrWhiteSpace(line))
                        break;
                    var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                    if (parts.Length != 2) continue;
                    var segments = segmenter.Cut(parts[0]);
                    writer.WriteLine("{0}	{1}", string.Join(" ", segments), parts[1]);
                }
            }
        }
    }

    原有的文件路径要的调整为:

    const string _dataPath = @".datasentiment labelled sentencesimdb_labelled.txt";
    const string _testDataPath = @".datasentiment labelled sentencesyelp_labelled.txt";
    const string _dataTrainPath = @".datasentiment labelled sentencesimdb_labelled_result.txt";
    const string _testTargetPath = @".datasentiment labelled sentencesyelp_labelled_result.txt";

    在Main函数的地方增加调用:

    Segment(_dataPath, _dataTrainPath);
    Segment(_testDataPath, _testTargetPath);

    预测用的数据修改为:

    IEnumerable<SentimentData> sentiments = new[]
    {
        new SentimentData
        {
            SentimentText = "今天的任务并不轻松",
            Sentiment = 0
        },
        new SentimentData
        {
            SentimentText = "我非常想见到你",
            Sentiment = 0
        },
        new SentimentData
        {
            SentimentText = "实在是我没有看清楚",
            Sentiment = 0
        }
    };

    一切就绪,运行结果如下:

    看上去也不坏对么? :)

    不久前也看到.NET Blog发了一篇关于ML.NET的文章《Introducing ML.NET: Cross-platform, Proven and Open Source Machine Learning Framework》,我重点摘一下关于路线图方向的内容。

    The Road Ahead

    There are many capabilities we aspire to add to ML.NET, but we would love to understand what will best fit your needs. The current areas we are exploring are:

    • Additional ML Tasks and Scenarios
    • Deep Learning with TensorFlow & CNTK
    • ONNX support
    • Scale-out on Azure
    • Better GUI to simplify ML tasks
    • Integration with VS Tools for AI
    • Language Innovation for .NET

    可以看到,随着ONNX的支持,更多的机器学习框架如:TensorFlow、CNTK,甚至PyTorch都能共享模型了,加上不断新增的场景支持,ML.NET将越来越实用,对已有其他语言开发的机器学习服务也能平滑地过渡到.NET Core来集成,值得期待!

    按惯例最后放出项目结构和完整的代码。

    using System;
    using Microsoft.ML.Models;
    using Microsoft.ML.Runtime;
    using Microsoft.ML.Runtime.Api;
    using Microsoft.ML.Trainers;
    using Microsoft.ML.Transforms;
    using System.Collections.Generic;
    using System.Linq;
    using Microsoft.ML;
    using JiebaNet.Segmenter;
    using System.IO;
    
    namespace SentimentAnalysis
    {
        class Program
        {
            const string _dataPath = @".datasentiment labelled sentencesimdb_labelled.txt";
            const string _testDataPath = @".datasentiment labelled sentencesyelp_labelled.txt";
            const string _dataTrainPath = @".datasentiment labelled sentencesimdb_labelled_result.txt";
            const string _testTargetPath = @".datasentiment labelled sentencesyelp_labelled_result.txt";
    
            public class SentimentData
            {
                [Column(ordinal: "0")]
                public string SentimentText;
                [Column(ordinal: "1", name: "Label")]
                public float Sentiment;
            }
    
            public class SentimentPrediction
            {
                [ColumnName("PredictedLabel")]
                public bool Sentiment;
            }
    
            public static PredictionModel<SentimentData, SentimentPrediction> Train()
            {
                var pipeline = new LearningPipeline();
                pipeline.Add(new TextLoader<SentimentData>(_dataTrainPath, useHeader: false, separator: "tab"));
                pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
    
                var featureSelector = new FeatureSelectorByCount() { Column = new[] { "Features" } };
                pipeline.Add(featureSelector);
    
                pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
    
                PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();
                return model;
            }
    
            public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
            {
                var testData = new TextLoader<SentimentData>(_testTargetPath, useHeader: false, separator: "tab");
                var evaluator = new BinaryClassificationEvaluator();
                BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
                Console.WriteLine();
                Console.WriteLine("PredictionModel quality metrics evaluation");
                Console.WriteLine("------------------------------------------");
                Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
                Console.WriteLine($"Auc: {metrics.Auc:P2}");
                Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
            }
    
            public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
            {
                IEnumerable<SentimentData> sentiments = new[]
                {
                    new SentimentData
                    {
                        SentimentText = "今天的任务并不轻松",
                        Sentiment = 0
                    },
                    new SentimentData
                    {
                        SentimentText = "我非常想见到你",
                        Sentiment = 0
                    },
                    new SentimentData
                    {
                        SentimentText = "实在是我没有看清楚",
                        Sentiment = 0
                    }
                };
    
                var segmenter = new JiebaSegmenter();
                foreach (var item in sentiments)
                {
                    item.SentimentText = string.Join(" ", segmenter.Cut(item.SentimentText));
                }
    
    
                IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
                Console.WriteLine();
                Console.WriteLine("Sentiment Predictions");
                Console.WriteLine("---------------------");
    
                var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
                foreach (var item in sentimentsAndPredictions)
                {
                    Console.WriteLine($"Sentiment: {item.sentiment.SentimentText.Replace(" ", string.Empty)} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
                }
                Console.WriteLine();
            }
    
            public static void Segment(string source, string result)
            {
                var segmenter = new JiebaSegmenter();
                using (var reader = new StreamReader(source))
                {
                    using (var writer = new StreamWriter(result))
                    {
                        while (true)
                        {
                            var line = reader.ReadLine();
                            if (string.IsNullOrWhiteSpace(line))
                                break;
                            var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
                            if (parts.Length != 2) continue;
                            var segments = segmenter.Cut(parts[0]);
                            writer.WriteLine("{0}	{1}", string.Join(" ", segments), parts[1]);
                        }
                    }
                }
            }
    
            static void Main(string[] args)
            {
                Segment(_dataPath, _dataTrainPath);
                Segment(_testDataPath, _testTargetPath);
                var model = Train();
                Evaluate(model);
                Predict(model);
            }
        }
    }

  • 相关阅读:
    Python全栈开发之---mysql数据库
    python爬虫项目(scrapy-redis分布式爬取房天下租房信息)
    python多线程爬虫+批量下载斗图啦图片项目(关注、持续更新)
    python爬虫+数据可视化项目(关注、持续更新)
    超融合基本架构简单定义
    开启新生之路,,,学习网络
    Redhat7.2 ----team网卡绑定
    设计原则
    java应用程序的运行机制
    java三大版本和核心优势
  • 原文地址:https://www.cnblogs.com/BeanHsiang/p/9029120.html
Copyright © 2011-2022 走看看