zoukankan      html  css  js  c++  java
  • CMUSphinx Learn

    Before you start

    开始之前

    Before you start the development of the speech application, you need to consider several important points. They will define the way you'll implement the application.

    在做语音应用开发之前,你需要考虑几个重要的问题,它们决定了你实现应用的途径。

    Algorithms

    算法

    Speech technology puts several important limits on the way it's possible to implement the application. For example, as noted above it is impossible to recognize any known word of the language. You need to consider the ways to overcome such limitations. Such ways are known for most types of applications out there and described later in tutorial. To follow them, you sometimes need to rethink how your application will behave and interact with the user.

    在可能实现的应用程序中,语音技术有几个重要限制。比如,语音技术不可能识别任何一个已知语言的单词,这就需要采取措施来克服这样的限制,大多数应用程序采用这些已知的方法,稍后会在教程中阐述。为了遵循这些方法,有事需要重新思考应用程序该如何与用户进行交互。

     

    Although we try to provide important examples, we obviously can't cover everything. There is no utterance verification or speaker identification example yet, though they could be created later. Most algorithms are widely covered in scientific literature, and some of them are explained in tutorial later in the section. Moreover, new methods to solve old problems raise each year.

    尽管我们努力提供一些重要的例子,但是还是不能涵盖所有的方面。现在仍然没有发音确认或说话人识别的例子,尽管它们以后可能会实现。大部分算法在科学文献中都可以找到,在本章的稍后部分会对有些算法做出解释。此外,为了解决老问题,每年都会提出一些新方法。

     

    To name several common applications and the way to approach them:

    几种常见的应用程序的名称和它们使用的方法:

     

    Generic dictation is never so generic. You need to find out a domain you'll recognize which can be dialogs, readings, meetings, voicemails, legal or medical transcriptions. If you consider voicemails, note that the language there is way more restricted than general language. It's actually a very small vocabulary with specialized sequence of terms:

    • It's Sandy. Let's meet tomorrow
    • Hi. That's Joe, I'm going to sell you that car

    通用听写程序是非通用的,需要找一个需要识别的场合,可以是对话、读物、会议、语音邮件、法律或者医学的录音,如果你选择语音邮件,注意,语音邮件的语言比一般语言的限制更多,它实际上是一个具有专业术语的小词汇表。

     

    There will be a lot of names and that's a problem, but you'll never find a voicemail about quantum physics and that's a very good thing. The recognizer will use the restrictions you provided with the language model to improve accuracy of the result.

    邮件里面会出现很多的名字,这样就会有问题,但是你从未发现一篇量子力学的语音邮件,那是一件幸运的事。识别器将使用提供的语言模型的限制规则来提高识别结果的准确率。

    You'll have to build a language model for your domain, but that's not as complicated as you might think. Don't afraid as well, if you'll cover the 60k most common words in English; the accuracy will be the same as with 120k words. For other languages with rich morphology the situation is different, but also solvable with morphology-based subwords. Also, you have to build a post-processing system, adaptation system and user-identification system.

    我们必须为自己的应用场合建立一个语言模型,语言模型并没有你想象的那么复杂,也不要害怕,如果有60K的普通英语单词的语言模型,那么识别的准确率和120K大小的模型是一样的。对于具有丰富形态学的其他语言来说,情况就不一样了,但是用基于形态学的亚单词模型仍然可以解决。另外,必须建立一个后期处理系统、自适应系统和用户识别系统。

    For recognition on an embedded processor, there are two ways to consider - recognition on the server and recognition on the device. The former is more popular now days because it lets you use the power and flexibility of the cloud computations.

    对于嵌入式处理器的识别来说,有两个方面需要考虑,服务器上的识别和设备上的识别。这样的形式现在很流行,因为它可以让你使用云计算的强大能力和灵活性。

    Language learning will require you to build a framework for tracking incorrect pronunciations. That will include generation of incorrect pronunciations and scoring them.

    语言学习需要建立一个追踪错误发音的框架,这个框架包括错误发音的产生并为他们打分。

    For command and control, it was popular to use a finite state grammar for a long time. Unfortunately, we could not recommend that to you now days. It's way better to employ a medium vocabulary recognizer with semantic analysis framework on the top to improve user experience and let him use more or less natural language. There is no sense to start with finite grammar right now. For more details on the semantic architecture, look at the Olimpus project. Dialog systems will require user feedback framework as well.

    命令和控制使用有限状态语法,这样的形式已经流行很长一段时间了,不幸的是,我们现在不推荐使用这种形式。在顶端采用带语义分析框架的中等词汇量的识别器来提高用户体验倒是一个不错的方法,没必要用有限状态语法开始。Olimpus项目可以得到更多关于语义结构的细节信息。此外,对话系统还需要用户反馈的框架。

    Voice search, semantic analysis and translation will need to be build on the top of the lattices generated by engine. You need to take lattices with confidence scores and feed them into the upper levels like translation engine.

    语音搜索,语义分析和转化需要在引擎产生的网格之前进行构建,你需要取得网格的信任分数并将他们置于像机器翻译这样更高层。

    For open vocabulary recognition like name and places recognition, you will need a subword language model.

    对于像名字和地点的开放词汇识别,就需要亚单词语言模型了。

    Text alignment, like captions synchronization, will require you to build a specialized language model from reference text to restrict the search.

    文本对齐,像字幕同步,要求建立一个从参考文本到限制搜索的专业语言模型。

    Existing accuracy figures

    现在的准确率

    For most tasks above there are published accuracy results. You can find them if you'll identify the task. Those results could be useful or not useful in terms of accuracy for your users. You might count that you'll jump over the figures, but it's unlikely that it will be done quickly.

    上面大部分任务都发布了识别结果准确率,当你确定一个任务的时候,你就能找到它们,对于用户来说,结果的准确率可能有用,也可能没用。你可能认为,可以不用关心这些数字,但这是不可能的。

    For example, the broadcast news recognition task is done with 20-25% accuracy. If it's not enough for your application, you probably need to consider modification of the application. You might add hand-correction step or preliminary adaptation step to improve accuracy. If accuracy will not be sufficient after that, probably it's better to think if you need speech at all. There are other more reliable interfaces you could use.

    比如,新闻广播的识别率有20-25%,这样的识别率对你的应用来说还不够的话,你可能就得考虑修改应用程序了,你可以添加手动修正步骤或者适应预处理步骤来提高准确率。如果经过上面的步骤,准确率提高的不够明显,可能需要考虑一下,是否需要语音识别了。尽管这样,还是有其他可靠的接口供使用。

    For example, though ASR-based IVR systems are fancy and handy, many people still prefer communication with DTMF systems or web-based forms or just email to contact the company. Remember that you need an effective interface, not modest one.

    比如,尽管基于ASR的IVR系统很精准、便利,但是很多人仍然喜欢使用DTMF系统通信,或者是基于网页的形式,抑或是使用email来联系公司。记住,你需要的是一个高效的接口,而不是一个适度的接口。

    Resources

    资源

    Next issue you need to consider, is the availability of the speech material for training, testing and optimizing the system. You need to find out which resources are available to you.

    接下来需要考虑的问题是用于训练、测试和优化系统的语音材料是否可以得到。你需要找出哪些资源是可以获得的。

    The testing set is a critical issue for any speech recognition application. The testing set should be representative enough acoustically and terms of language. But the test set shouldn't necessary be large, you can spend 10 minutes to create a good one. It might be a sample recordings you could do yourself.

    用于训练的数据对任何语音识别应用来说都是一个关键问题,测试数据集应该在声学和语言形式上具有代表性,但是测试数据集没必要很大,你可以花10分钟就能创建一个好的,它也可以是你自己录制的录音样本。

    For training set and models you should check the resources that are already present. The increasing interest in speech technology makes people contribute by creation of models for their native languages. In general, you'll have to collect audio material for specified language. Actually it's not so complicated thing to do. Audio books, movies and podcasts provide enough recordings to build very good acoustic model with little effort.

    你应该检查现有资源来获得训练集和模型,随着人们对语音技术的兴趣与日俱增,他们会把创建的本国语言模型贡献出来,一般来说,你只需收集特定语言的音频资料。实际上,这并不是一件复杂的事情,有声读物、电影和播客都提供了足够多的录音,只需要很少的努力就可以构建一个很好的声学模型。

    To build a phonetic dictionary you can use existing TTS synthesizer which nowdays cover a lot of languages. Also you can boostrap dictionary by hand and then extend it with machine learning tools.

    为了构建一个语音字典,你可以使用涵盖多种语言类型的TTS 合成器,你也可以使用手动引导字典,然后用机器学习工具来扩展它。

    For language models you'll have to find a lot of texts for your domain. It might be textbooks, already transcribed recordings or some other sources like website contents crawled on the web.

    你必须为你的应用场合寻找许多的文本来创建语言模型,可是课本、记录或者一些网站上的资源。

    Technologies

    技术

    Third thing to consider is the set of particular technologies you will build on. Although CMUSphinx tries to provide more or less complete program suite for development of speech applications, you'll sometimes need to use other packages/programming languages/tools. You need to find out yourself if you are going to continue with Java, C or any of scripting languages CMUSphinx supports. The rule to choose between sphinx4 or pocketsphinx is the following:

    • Need speed or portability → use pocketsphinx
    • Need flexibility and managability → use sphinx4

    第三件要考虑的事情是构建采用的一系列技术,尽管CMUSphinx试图为语音应用开发提供完整的程序套件,但是有时候还是要使用其他的软件包和语言工具,你需要先确定你即将使用Java语言、C语言或者是CMUSphinx支持的脚本语言。sphinx4或者pocketsphinx的选择:

         需要速度或者便捷性 - 使用pocketsphinx

          需要灵活性和可管理性 - 使用sphinx4

     

    Although people often ask what is more accurate sphinx4 or pocketsphinx, you shouldn't bother with this question at all. Accuracy is not the argument here. Both sphinx4 and pocketsphinx provide acceptable accuracy and even then it depends on many factors, not just the engine. The thing is that engine is just a part of the system which should include many more components. If we are talking about large vocabulary decoder, there must be diarization framework, adaptation framework and postprocessing framework. They all need to cooperate somehow. Flexibility of sphinx4 allows you to build such a system quickly. It's easy to embed sphinx4 into flash server like red5 to provide web-based recognition, it's easy to manage many sphinx4 instances doing large-scale decoding on a cluster.

    尽管人们经常会问sphinx4和pocketsphinx谁准确率更高,无需对这个问题烦恼,准确率无需在此论证。sphinx4和pocketsphinx的准确率都是可接受的,它们由很多音素决定,而不是引擎本身。引擎只是系统的一部分,它包含了很多组件。大词汇量解码器具有聚类框架、自适应框架和后置处理框架,它们需要在一起合作,灵活的sphinx4允许你快速建立一个系统。向red5这样的flash服务器中嵌入sphinx4来提供基于网页版的语音识别是非常容易的事情,通过大规模解码集群,可以很容易的管理sphinx4的实例。

    On the other side, if your system needs to be efficient and reasonably accurate, if you are running on embedded device or you are interested in using recognizer with some exotic language like Erlang, pocketsphinx is your choice. It's very hard to integrate Java with other languages not supported by JVM pocketsphinx is way better here.

    另一方面,如果你的系统需要高效和可靠的准确率,如果运行在嵌入式设备中,或者你有兴趣使用Erlang语言来做识别器,你应该选择pocketsphinx。当Java和不支持JVM的其他语言难以集成时,pocketsphinx是一个好的选择。

    Next example of what you need to consider a development platform choice. If you are bound to some, that's an easy question for you. If you can choose, we highly recommend you to use GNU/Linux as a development platform. We can help you with Windows or Mac issues but there are no guarantees, our main development platform is Linux. For many tasks you'll need to run complex scripts using perl of python. On Windows it might be problematic.

    需要考虑的下一个情况是开发平台的选择,当你遇到某些限制,这些限制对你来说很简单。如果你可以选择,我们强烈推荐你使用GNU/Linux作为开发平台,我们可以帮助你解决Windows或者Mac上的问题,但不能给于保证,我们主要的开发平台是Linux。你可以运行复杂的perl的python脚本来完成多任务,但在Windows上可能是有问题的。

    Got it? Let's start! Next section will describe the process of creation the sample application either with sphinx4 or pocketsphinx. Choose the right one.

    明白了吗?让我们出发吧!下一节将会阐述使用sphinx4或者pocketsphinx创建样例程序的过程。选择你需要的那一个吧。

  • 相关阅读:
    计算几何学习8
    c语言数据结构学习心得——队列
    c语言数据结构学习心得——栈
    c语言数据结构学习心得——数据结构基本概念
    c语言数据结构学习心得——图
    c语言数据结构学习心得——树
    c语言数据结构学习心得——二叉树
    c语言数据结构学习心得——线性表
    Asp.net 2.0 Webpart 数据库的迁移
    BUGReport:datagrid带模板列绑定空数据集出错的问题
  • 原文地址:https://www.cnblogs.com/riskyer/p/3424103.html
Copyright © 2011-2022 走看看