zoukankan      html  css  js  c++  java
  • pytorch bug: for step,data in enumerate(loader)+Connection reset by peer

    单GPU跑的程序,而且是在docker中,迭代了几百步后,程序突然崩掉了,

    程序停在了 for step,data in enumerate(loader),下面是部分bug信息

    Traceback (most recent call last):
    ........
      File ".../torch/utils/data/dataloader.py", line 206, in __next__
        idx, batch = self.data_queue.get()
      File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
        return recv()
      File ".../torch/multiprocessing/queue.py", line 22, in recv
        return pickle.loads(buf)
      File "/usr/lib/python2.7/pickle.py", line 1388, in loads
        return Unpickler(file).load()
      File "/usr/lib/python2.7/pickle.py", line 864, in load
        dispatch[key](self)
      File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
        value = func(*args)
      File ".../torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
        fd = multiprocessing.reduction.rebuild_handle(df)
      File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
        conn = Client(address, authkey=current_process().authkey)
      File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
        answer_challenge(c, authkey)
      File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
        message = connection.recv_bytes(256)         # reject large message
    IOError: [Errno 104] Connection reset by peer

    我以为是enumerate的问题,出现了脏数据,但细想不可能啊,都迭代了一个epoch了,

    查看资料,追踪这个error,Connection reset by peer,网上说是https://github.com/pytorch/pytorch/issues/9127,

    以前版本有bug,需要将新版本的 torch/_six.py and torch/utils/data/dataloader.py 替换以前的版本,

    工作量大,被这个思路带着走,完全跑偏了。放弃了,

    查询DataLoader的参数,有建议把batch_size调小,调到了1,

    num_workers值也调到了1,还是报错,

    DataLoader的函数定义如下:

    DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
    num_workers=0, collate_fn=default_collate, pin_memory=False,
    drop_last=False)

    1.  dataset:加载的数据集
    2.  batch_size:batch size
    3.  shuffle::是否将数据打乱
    4.  sampler: 样本抽样
    5.  num_workers:使用多进程加载的进程数,0代表不使用多进程
    6.  collate_fn: 如何将多个样本数据拼接成一个batch,一般使用默认的拼接方式即可
    7.  pin_memory:是否将数据保存在pin memory区,pin memory中的数据转到GPU会快一些
    8.  drop_last:dataset中的数据个数可能不是batch_size的整数倍,drop_last为True会将多出来不足一个batch的数据丢弃

    于是将num_workers参数值改成了默认值 0,不用多进程跑,程序可以运行了,激动万分,感激涕零啊

  • 相关阅读:
    C 语言定义
    一次系统磁盘异常使用100%的处理
    supervisord 安装、配置体验
    uva 211(dfs)
    poj1651
    有一种感动叫ACM(记WJMZBMR在成都赛区开幕式上的讲话)
    nyoj-746
    Codeforces Round #308 (Div. 2)----C. Vanya and Scales
    long long 与 _int64
    石子归并问题(nyoj737)
  • 原文地址:https://www.cnblogs.com/walktosee/p/10615315.html
Copyright © 2011-2022 走看看