[info] GPU hardware benchmarking test

What should we prepare for the benchmark test?
How to run GPU hardware benchmarking test in terms of env?
Which metrics are we curious most? How to store logs for the benchmark testing?
Any good practice here?
Is there any other results related to GPU benchmarking test?
Code reference

Answers:

1. Well-defined purpose; data; source code; some basic estimation;
purpose: This is the most important part and the part that requires the most efforts. Which part do you want to investigate? For example, 
-- fp32 v.s. fp16
-- fit v.s. fit_generator()
data:  Note that if it is company-privacy, we should use randomly generated data with the same format. 
source code: the OOP style is essential. For example, the model structure should be de-coupled with the pipeline. 
some basic estimation: we need to have a raw understanding on how it works in terms of the memory size and the estimated running time for training and inference

2. Mainly two ways. 1. Container 2. Set up the env in the server. For 2, we need to make sure that the tensorflow version is supported by current CUDA version. For example, if the CUDA is 10, it doesn't support the latest tensorflow(1.12) now. If you don't have root right for that, you might want to use nvidia-docker container instead.

3. GPU utilization. 
There are a few way to get the GPU utilization
=> use nvidia-smi
Running time
The running time here is tricky. You need to record the data initialization, the training time for each epoch.

4. Repeat the experiment for 10 times and calculate the mean
Datasets should be similar to the real datasets
Be aware of the IO issue


5. MLPerf: https://mlperf.org/


6. https://github.com/mlperf
Comments: 
note that the inference code and the training code are separated
I love they store the machine parameters in the JSON file
Use container to set up environment

https://github.com/lambdal/benchmarks/tree/8459a23af411ea79968c9af645afdad77b01eeb4

Comments: Classical model

Last updated