2.5.1 health check

When there are some issues with either storage/drive/GPU link etc, the first step is to run the health check on the server. Here is the script that could be useful to get more information for debugging. (https://github.com/longw010/gitbook_suppl/blob/master/chap2/2.5.1.sh)

If you decide to contact the support team, the `DGX unit serial number:` will also be needed.

Reimage will erase everything. Need backup.

When there are some issues with the system, need to go to the /var/log , and save the kern.log and etc; sometimes, it is only saved in the latest 7 days.

Last updated