M-CURL Server Setup Tutorial

M-CURL Server Setup Tutorial

This article takes M-CURL reinforcement learning algorithm as an example to show how to setup a machine learning environment on a cloud server to accelerate training and improve experiment efficiency. This is part of the IPP research project: Knowledge Transfer for Reinforcement Learning based Robot Motion Planning.

Prerequisite: matpool server and Linux setup

Quick start: https://matpool.com/supports/doc-quick-start/

Quick start for team work: https://matpool.com/supports/doc-use-team-on-matpool/

Cloud disk guide: https://matpool.com/supports/doc-use-matbox-on-matpool/

FAQ: https://matpool.com/supports/reference/faqs/

VScode connection:https://matpool.com/supports/doc-vscode-connect-matpool/

Pycharm connection: https://matpool.com/supports/doc-pycharm-connect-matpool/

Connect server via VNC!:https://matpool.com/supports/doc-vnc-connect-matpool/

Note: Connecting the server with VScode only supports operations and command line tools with terminal, while VNC provides graphic desktop, therefore supporting screen display, an important feature that would be used by gym environment.

Linux commands: https://matpool.com/supports/reference/common-cmds/

The Linux system on the server does not support text editor, therefore I suggest you can first install nano:

1
sudo apt-get install nano

which will play an important role in future envionmrnt setups.

Conda Environment

Create mcurl Environment

We first create an independent environment for our MCURL project, where we will install all the dependencies without interefering with other environments. We can use the .yml file provided in MCURL repository to do that.

1
conda env create -f environment.yml

Then conda will automatically download the dependcies listed in the .yml file. However, in most cases, the libraries that are to be installed via pip will fail due to conflict versions. Nevertheless, things like pytorch will be installed successfully, and you should take a note of the version of pytorch, which will be referenced to setup cuda.

After that we can use

1
conda activate environment_name

to enter conda environment. At the same time, remeber to change the configuration of python in VScode to be the conda environment we’ve just set.

Install mujoco

You can refer to https://zhuanlan.zhihu.com/p/352304615 for more detailed instruction (in Chinese).

Mujoco is a physical simulation platform on which MCURL is based to work.

First, on local comouter, go to https://www.roboti.us/index.html to download mujoco package and related licence key https://www.roboti.us/license.html. Compile everything related to mujoco and upload to the server cloud disk.

We then login to the cloud server and create a hidden folder mujoco, then copy our mjkey.txt into .mujoco folder, as shown in the commands below:

1
2
3
4
5
6
7
8
mkdir ~/.mujoco  #create a hidden folder mujoco
cp mujoco200_linux.zip ~/.mujoco
cd ~/.mujoco
unzip mujoco200_linux.zip #extract mujoco
mv mujoco200_linux mujoco200
cd .. #go back to /mnt directory
cp mjkey.txt ~/.mujoco
cp mjkey.txt ~/.mujoco/mujoco200/bin

Note: /.mujoco/mujoco200_linux must be modified as ~/.mujoco/mujoco200, otherwise import mujoco_py will raise errors.

Finally, mjkey.txt will be under ~/.mujoco/mujoco200/bin.

Then we need add mujoco into environment variable:

1
nano ~/.bashrc 

Copy and paste the following content into the bashrc file you just opened.

1
2
export LD_LIBRARY_PATH=~/.mujoco/mujoco200/bin${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} 
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}

Then press control X, Y, Enter, exit and save file,and finally remeber to

1
source ~/.bashrc

Install dmc2gym and dm_control

dmc2gym provides a physics simulation environment with Mujoco. However, strange issues may arise here.

First, try pip install dmc2gym. If successful, great!

But if you encounter “exit 128”:

1
2
3
4
5
6
7
8
9
pip install git+git://github.com/denisyarats/dmc2gym.git
Collecting git+git://github.com/denisyarats/dmc2gym.git
Cloning git://github.com/denisyarats/dmc2gym.git to /tmp/pip-req-build-ahuyx_q5
Running command git clone -q git://github.com/denisyarats/dmc2gym.git /tmp/pip-req-build-ahuyx_q5
fatal: unable to connect to github.com:
github.com[0: 140.82.121.4]: errno=Connection timed out

WARNING: Discarding git+git://github.com/denisyarats/dmc2gym.git. Command errored out with exit status 128: git clone -q git://github.com/denisyarats/dmc2gym.git /tmp/pip-req-build-ahuyx_q5 Check the logs for full command output.
ERROR: Command errored out with exit status 128: git clone -q git://github.com/denisyarats/dmc2gym.git /tmp/pip-req-build-ahuyx_q5 Check the logs for full command output.

Then try:

1
pip install git+https://github.com/denisyarats/dmc2gym.git

If that doesn’t work, it could be a network issue on your cloud server, so try adding https://mirror.ghproxy.com/ before the clone URL:

1
pip install git+https://mirror.ghproxy.com/https://github.com/denisyarats/dmc2gym.git

You can find more about speeding up GitHub downloads here: https://matpool.com/supports/reference/faqs/#%E5%A6%82%E4%BD%95%E5%8A%A0%E9%80%9F-github-%E4%B8%8B%E8%BD%BD%EF%BC%9F

If none of this works, try again a few times :)

Then, install dm_control in a similar way, using the stable 3.6 version from GitHub: https://github.com/google-deepmind/dm_control/tree/python3.6_eol?tab=readme-ov-file

Running and Modifying Python Code

For simplicity, we’ll skip running Python via shell scripts and run train.py directly. However, before running train.py, there may be libraries flagged (yellow lines), so install them first. Common ones are the color library and the sci-image library. Their installation commands are easy to find online.

Before running the code, adjust the default values of certain arguments to tweak settings. For example, you can set save_video to false to avoid certain bugs.

Next, directly run train.py in the terminal (if it runs successfully, something’s wrong 😅), and you’ll encounter small issues. Modify the corresponding library files based on the errors. Common issues include:

  1. AttributeError: 'dict' object has no attribute 'env_specs'
1
2
3
4
5
6
7
File "train.py", line 318, in <module>
main()
File "train.py", line 200, in main
frame_skip=args.action_repeat
File "/home/anavani/anaconda3/envs/rad/lib/python3.7/site-packages/dmc2gym/__init__.py", line 28, in make
if not env_id in gym.envs.registry.env_specs:
AttributeError: 'dict' object has no attribute 'env_specs'

Simply delete env_specs after gym.envs.registry.env_specs. See: https://github.com/openai/gym/issues/3097

  1. ValueError: not enough values to unpack (expected 5, got 4)

Refer to GPT’s suggestion:

According to the train.py and utils.py code you provided, the error indicates that the evaluate function in train.py expects 5 values but env.step(action) only returns 4.

In train.py‘s evaluate function:

1
obs, reward, terminated, truncated, info = env.step(action)

It attempts to unpack 5 variables, but env.step(action) only returns 4 (obs, reward, terminated, truncated), missing info.

To fix this, adjust the unpacking:

1
2
obs, reward, terminated, truncated = env.step(action)
done = terminated or truncated # Use terminated or truncated as the done condition

Before making any changes, add a debug statement to check env.step(action):

1
2
3
4
result = env.step(action)
print("Step returned:", result) # Debugging
obs, reward, terminated, truncated = result
done = terminated or truncated

We already have modified code, but this is useful for independent replication.

  1. Modify two feedforward layers in ctmr_sac.py based on errors by cloning instead of operating in place on x. The modified .py file can be found here: https://sjtu.feishu.cn/docx/HCnPdRRiEoxOzGxjzI9cggLMnsb?from=from_copylink

By now, the Python part should be bug-free. If there are still bugs, find your own solutions :)

Reconfiguring dmc2gym

If you’re using VSCode, progress may be slow at this point. Switch to a VNC virtual desktop to run train.py. You may encounter errors like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Reading package lists... Done
Building dependency tree
Reading state information... Done
libgl1-mesa-glx is already the newest version (20.0.8-0ubuntu1~18.04.1).
libgl1-mesa-glx set to manually installed.
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
cuda-drivers-530 : Depends: nvidia-compute-utils-530 (>= 530.30.02) but it is not going to be installed
Depends: nvidia-utils-530 (>= 530.30.02) but it is not going to be installed
nvidia-driver-530 : Depends: nvidia-compute-utils-530 (= 530.30.02-0ubuntu1) but it is not going to be installed
Depends: nvidia-utils-530 (= 530.30.02-0ubuntu1) but it is not going to be installed
Recommends: libnvidia-compute-530:i386 (= 530.30.02-0ubuntu1)
Recommends: libnvidia-decode-530:i386 (= 530.30.02-0ubuntu1)
Recommends: libnvidia-encode-530:i386 (= 530.30.02-0ubuntu1)
Recommends: libnvidia-fbc1-530:i386 (= 530.30.02-0ubuntu1)
Recommends: libnvidia-gl-530:i386 (= 530.30.02-0ubuntu1)
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

Refer to this blog: https://blog.xiunian.wang/?p=1867 and GPT for a solution:

Edit LD_LIBRARY_PATH:

1
export LD_LIBRARY_PATH=/root/miniconda3/envs/mcurl/lib/python3.11/site-packages/nvidia/cudnn/lib:/root/miniconda3/envs/mcurl/lib/python3.11/site-packages/torch/lib:/usr/local/cuda-11.8/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

Verify the settings:

1
echo $LD_LIBRARY_PATH

Ensure the output is correct.

Re-run the ldd command:

1
ldd /root/miniconda3/envs/mcurl/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8

If all works, add export LD_LIBRARY_PATH=... to .bashrc for persistence.

By now, the dmc2gym setup should be complete, and the program should run.

CUDA and Pytorch

If your program runs but the GPU usage is 0%, it means CUDA isn’t configured correctly. To configure CUDA, first check the CUDA and Pytorch version compatibility at: https://pytorch.org/get-started/previous-versions/, then install the corresponding CUDA version.

You can check with the following code:

1
2
3
import torch
print(torch.__version__)
print(torch.cuda.is_available())

If False, it means the CUDA version is mismatched. Make sure to install the correct version.

Also refer to: https://matpool.com/supports/doc-public-data/#cuda-%E5%AE%89%E8%A3%85

Results

Using a GPU cloud server for fast training will generate a folder named after the corresponding training project in the directory where train.py is located. The folder will contain the stored buffer, model, video, and training logs. The format of the training logs is as follows:

1
2
3
{"episode_reward": 0.0, "episode": 1.0, "duration": 2.1876120567321777, "step": 500}
{"episode_reward": 105.28541183262065, "episode": 5.0, "duration": 1.9116880893707275, "step": 1000}
{"episode_reward": 141.84056755671332, "episode": 9.0, "batch_reward": 0.9721407055854797, "critic_loss": 33.874087238311766, "actor_loss": -10.513817775249482, "actor_target_entropy": -1.0, "actor_entropy": 0.47874767780303956, "alpha_loss": 0.052578425779938695, "alpha_value": 0.09913411643684729, "ctmr_loss": 7.322355401992798, "ctmr_cpc_lr": 8.395825000000003e-05, "duration": 74.88283920288086, "step": 1500}