zoukankan      html  css  js  c++  java
  • MindSpore框架 加载文本数据集 示例

    代码原地址:

    https://www.mindspore.cn/tutorial/training/zh-CN/r1.2/use/load_dataset_text.html

    =======================================================

    完整代码:

    import os
    
    os.system("rm -f ./datasets/tokenizer.txt")
    
    if not os.path.exists('./datasets'):
        os.mkdir('./datasets')
    file_handle=open('./datasets/tokenizer.txt',mode='w')
    file_handle.write('Welcome to Beijing 
    北京欢迎您! 
    我喜欢English! 
    ')
    file_handle.close()
    
    
    
    
    
    import mindspore.dataset as ds
    import mindspore.dataset.text as text
    
    DATA_FILE = './datasets/tokenizer.txt'
    dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)
    
    ds.config.set_seed(58)
    dataset = dataset.shuffle(buffer_size=3)
    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']))
    
    
    print('='*30)
    
    
    replace_op1 = text.RegexReplace("Beijing", "Shanghai")
    replace_op2 = text.RegexReplace("北京", "上海")
    dataset = dataset.map(operations=replace_op1)
    dataset = dataset.map(operations=replace_op2)
    for data in dataset.create_dict_iterator(output_numpy=True):###need to mark
        print(text.to_str(data['text']))
    
    
    print('='*30)
    
    
    tokenizer = text.WhitespaceTokenizer()
    
    dataset = dataset.map(operations=tokenizer)
    
    for data in dataset.create_dict_iterator(num_epochs=1,output_numpy=True):
        print(text.to_str(data['text']).tolist())

    运行结果:

    ============================================================================

    需要注意的一点是,如果将

    dataset.create_dict_iterator(output_numpy=True)  改为

    dataset.create_dict_iterator()


    则会报错:

    修改后的代码:
    import os
    
    os.system("rm -f ./datasets/tokenizer.txt")
    
    if not os.path.exists('./datasets'):
        os.mkdir('./datasets')
    file_handle=open('./datasets/tokenizer.txt',mode='w')
    file_handle.write('Welcome to Beijing 
    北京欢迎您! 
    我喜欢English! 
    ')
    file_handle.close()
    
    
    
    
    
    import mindspore.dataset as ds
    import mindspore.dataset.text as text
    
    DATA_FILE = './datasets/tokenizer.txt'
    dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)
    
    ds.config.set_seed(58)
    dataset = dataset.shuffle(buffer_size=3)
    for data in dataset.create_dict_iterator(output_numpy=True):
        print(text.to_str(data['text']))
    
    
    print('='*30)
    
    
    replace_op1 = text.RegexReplace("Beijing", "Shanghai")
    replace_op2 = text.RegexReplace("北京", "上海")
    dataset = dataset.map(operations=replace_op1)
    dataset = dataset.map(operations=replace_op2)
    for data in dataset.create_dict_iterator():###need to mark
        print(text.to_str(data['text']))
    
    
    print('='*30)
    
    
    tokenizer = text.WhitespaceTokenizer()
    
    dataset = dataset.map(operations=tokenizer)
    
    for data in dataset.create_dict_iterator(num_epochs=1,output_numpy=True):
        print(text.to_str(data['text']).tolist())
    View Code
    
    
    

    报错信息:

    WARNING: 'ControlDepend' is deprecated from version 1.1 and will be removed in a future version, use 'Depend' instead.
    [WARNING] ME(5047:140238748528768,MainProcess):2021-07-11-02:12:43.597.916 [mindspore/ops/operations/array_ops.py:2302] WARN_DEPRECATED: The usage of Pack is deprecated. Please use Stack.
    我喜欢English!
    Welcome to Beijing
    北京欢迎您!
    ==============================
    Traceback (most recent call last):
      File "/tmp/pycharm_project_753/x.py", line 34, in <module>
        for data in dataset.create_dict_iterator():###need to mark
      File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/dataset/engine/iterators.py", line 125, in __next__
        data = self._get_next()
      File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/dataset/engine/iterators.py", line 169, in _get_next
        return {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}
      File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/dataset/engine/iterators.py", line 169, in <dictcomp>
        return {k: self._transform_tensor(t) for k, t in self._iterator.GetNextAsMap().items()}
      File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/dataset/engine/iterators.py", line 84, in <lambda>
        self._transform_tensor = lambda t: Tensor(t.as_array())
      File "/usr/local/python-3.7.5/lib/python3.7/site-packages/mindspore/common/tensor.py", line 74, in __init__
        raise TypeError(f"For Tensor, the input_data is a numpy array, "
    TypeError: For Tensor, the input_data is a numpy array, but it's data type is not in supported list: ['int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'bool_'].

    进程已结束,退出代码为 1



    其原因就是如果设置output_numpy=True
    那么输出的就是numpy类型数据,由于输入的是numpy类型数据,那么在内部进行数据处理时不对数据类型进行转换。

    如果设置output_numpy=False (默认设置),
    那么输出的就是Tensor类型数据,由于输入的是numpy类型数据,那么在内部进行数据处理时就需要对数据类型进行转换。

    根据报错信息:
    TypeError: For Tensor, the input_data is a numpy array, but it's data type is not in supported list: ['int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'bool_'].

    我们可以知道如果内部需要对数据类型转换的话,那么输入数据必须是以下类型:
    ['int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'bool_'].
    或者是可以转换为这些类型的数据,而我们调用 dataset.create_dict_iterator 其内部输入的数据是 字符型(str), 因此无法转换从而报错。






    ===============================================================




    不过这里不得不吐槽以下MindSpore框架的报错信息写的真是很需要猜,不然真是看不懂,要是没有些经验的话这种报错信息也是难以懂的。






    本博客是博主个人学习时的一些记录,不保证是为原创,个别文章加入了转载的源地址还有个别文章是汇总网上多份资料所成,在这之中也必有疏漏未加标注者,如有侵权请与博主联系。
  • 相关阅读:
    PHP实现畅言留言板和网易跟帖样式
    关于MySql中自增长id设置初始值
    建议
    P3P解决cookie存取的跨域问题
    学习模板实例
    Mac 安装Bower
    webstorm for mac 破解步骤
    Mac上搭建php开发环境
    ios 开发之 -- 极光推送,发送自定义消息,进入制定页面
    ios开发之 -- 强制横屏
  • 原文地址:https://www.cnblogs.com/devilmaycry812839668/p/14995582.html
Copyright © 2011-2022 走看看