Magic combination for GPUs, TF, PyTorch, and Horovod?

Good afternoon,

I’ve been trying to build an environment for NV GPUs, Tensorflow, cudatoolkit, mpi4py, PyTorch, and Horovod. I’ve tried countless combinations of Python versions, package versions, sometimes using pip, etc., and I can’t find an Anaconda combination that seems to work (i.e. TF and PyTorch use GPUs, and I can install Horovod which detects mpi4py for distributed training). When I tried pip, horovod would never build.

I’d rather use a pure Conda installation - does anyone have any recommendations?

Thanks!

Jeff

Hello,

One reason you might be having problems is if you are mixing packages installed via ‘conda’ with packages installed via ‘pip’ in the same conda environment - that won’t work.

Did you get a chance to try the following suggestion to build a conda environment suitable for Horovod?

https://horovod.readthedocs.io/en/stable/conda_include.html

Sorry about my slow reply. Using the instructions you mentioned I finally got Horovod to build. Previously it would not but I tried a different system this time.

It took over 24 hours to build the virtual environment - I have no idea why.

A problem I’m still having is that I cannot get Horovod to build for tensorflow. Using the conda command to build the virtual env, tensorflow doesn’t show up in the list.

$ horovodrun --check-build
Horovod v0.19.5:

Available Frameworks:
TensorFlow
PyTorch
MXNet

Available Controllers:
MPI
Gloo

Available Tensor Operations:
NCCL
DDL
CCL
MPI
Gloo

I activated the virtual env and tried building horovod with tensorflow by hand and I get an error. [command is HOROVOD_WITH_TENSORFLOW=1 pip install --verbose horovod==0.19 ]

...
  error: command '/home/laytonjb/PROJECTS/HOROVOD/env/bin/x86_64-conda-linux-gnu-cc' failed with exit code 1
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/laytonjb/PROJECTS/HOROVOD/env/bin/python3.8 -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-tvo9e91u/horovod_cc068a47d33342269568cb05b9b0b15c/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-fdiqvdk2
  cwd: /tmp/pip-install-tvo9e91u/horovod_cc068a47d33342269568cb05b9b0b15c/
error
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
  Running command python setup.py clean
  /home/laytonjb/PROJECTS/HOROVOD/env/lib/python3.8/site-packages/setuptools/_distutils/dist.py:265: UserWarning: Unknown distribution option: 'test_requires'
    warnings.warn(msg)
  running clean
  removing 'build/temp.linux-x86_64-cpython-38' (and everything under it)
  removing 'build/lib.linux-x86_64-cpython-38' (and everything under it)
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.8' does not exist -- can't clean it
  removing 'build'
Failed to build horovod
ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects
Exception information:
Traceback (most recent call last):
  File "/home/laytonjb/PROJECTS/HOROVOD/env/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/home/laytonjb/PROJECTS/HOROVOD/env/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 248, in wrapper
    return func(self, options, args)
  File "/home/laytonjb/PROJECTS/HOROVOD/env/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 426, in run
    raise InstallationError(
pip._internal.exceptions.InstallationError: Could not build wheels for horovod, which is required to install pyproject.toml-based projects
Remote version of pip: 23.1.2
Local version of pip:  23.1.2
Was pip installed by pip? False
Removed build tracker: '/tmp/pip-build-tracker-mvjlngg8'

My attempts at googling for a solution haven’t revealed anything.

Any thoughts?

Thanks!

Hello,

I can only guess here what the problem is based on having some experience with setuptools and the error returned.

Unfortunately, I can’t try to reproduce the problem or try the install myself, because I don’t have a CUDA capable Linux system available for testing.

Take a look at the complete python command that is being executed above and see what script or python code is being executed. It looks like you need to edit that script or configuration file and add the line:

import setuptools

Good luck with it! Please update this thread if you get further with this task.