mlops
  • MLops
  • MLops
    • Chap 0. Before everything
    • Chap 1. Tools for developers
      • 1.0 Connect with peers
      • 1.1 Make records
      • 1.2 Keep updated
      • 1.3 How come I never use that?
      • 1.4 Script trick
        • WFH server remote connection kit
      • 1.5 Kill the time
      • 1.6 Workspace setup
      • 1.7 ML-related recommends
      • 1.8 miscellaneous: the most memorable debug process
    • Chap 2. Hardware
      • 2.1 (T|G|C)PU
        • [info] How to calculate GPU's upper bound in computing
        • [info] GPU source in cloud
        • [info] nvlink
        • [info] GPU hardware benchmarking test
      • 2.3 network topology and VPN
        • [trouble-shooting] interface
      • 2.4 Edge devices (WIP)
      • 2.5 Workstation/Server
        • 2.5.1 health check
        • 2.5.2 maintenance and upgrade
      • 2.6 Storage-Data Center
        • [Case] Debug w/r speed in nfs file system
        • Data storage system
        • data redundency
        • Debugging the pure storage system
        • Storage system benchmark
        • health-check
        • osxfuse
        • libnfs
        • [Case] Monitor storage increase
      • 2.7 Benchmarks
        • When we do benchmarks on the hardware, which is included?
        • GPUStressTest
      • 2.8 Takeaways
        • access security
        • health-check
        • backup periodically
    • Chap 3. Infrastructure/platform design
      • 3.1 Prototype
        • maintenance@docker
        • Docker with GPU
        • Set up DL environment and versions/dependencies
        • GPU scheduler
        • Data version control
        • Data Parallism
        • maintenance@GPU
        • Training pipeline
        • maintenance@data
        • maintenance@storage
        • Model registry
        • A few good-to-have for training data tracking
      • 3.2 Product
        • Maintenance@cloud
        • Notification of new changes
      • 3.3 Internal tools
        • Plan on wiki- better ranking methods
        • Internal tool - paper sharing
        • internal tool note
        • case study
        • wiki@rnd
        • wiki@rnd- knowledge repo
        • Internal tools requirements
      • 3.4 Benchmarks
      • 3.5 Takeaways
        • how to deal with failed driver
        • What to consider to upgrade a tool we used in the infra
        • backup plan when cloud infra failed
        • Some tips about data transferring between local and server
        • When to use cloud GPU or on-premise GPU
    • Chap 4. Toolkit/codebase
      • 4.3 Packaging
      • 4.0 good to have
        • package your commonly used funcs using conda
        • publish commonly used base image to docker hub
        • set up jenkins and customized config
        • a template for PR in the .github/
        • pylint: code style
        • docstring
        • Have a folder for dev, and test other formal code
        • unit test
        • logging
        • how to report bugs effectively
        • type checking
        • Keep a good Changelog
      • 4.1 toolkits template
      • 4.2 DL-specific
        • Data preprocessing in parallel
    • Chap 5. Paper reproduction
      • 5.0 Takeaway
        • common issues in code reproduction
        • A checklist for open-sourcing your code for reproducibility
        • practice on the version control and reproducing the experiments
        • Toy pipeline on all paper reproduction with code available
      • 5.1 Case study
      • 5.2 What's next: simple is better
    • Chap 6. Prototype development
      • Design Doc
        • [WIP] Scope the project: convert an ill-defined problem to a well-defined setup
          • product cycle
          • Product Backlog refinement
      • Combine with UX design
      • Data Clean and preproceessing
        • DVC 2.0
      • Data version control
      • Training Orchestration
      • Experiments Management
      • Speed up the pipeline
        • TRTorch and torch2trt
        • Efficient Interpolation
        • Distributed training
        • Profiling Deep Learning Network
        • Nsight system
        • dlprof
        • profile
      • Version control and reproducible
        • Use YML more to record config
        • Tuning-experiment tracking
    • Chap 7. Deployment and model serving
      • Documentation
      • Cloud - instance selection
      • Environments (Staging, production) set up
      • Authentication and security
      • New integration
      • use uWSGI and NGinx to serve a DL model
      • Monitoring
      • Scaling
      • Dashboard
      • Model serving
        • Serving model formats
          • TFLite
      • Case study
        • Multiple deployment phases
        • An example of SAAS deployment TODO list
        • Convert matlab script to python's
        • PyTorch JIT
    • Chap 8. Productionization/Maintenance/Adoption
      • 8.0 versions update
        • Python version update
        • driver update
        • CUDA update
        • dependency updates
      • 8.1 online learning infrastructure
      • 8.2 drifting
    • Chap 9. PR/keep stoa
      • 9.0 Conference
      • 9.1 Challenges/Competitions
      • 9.2 Lectures/Webinars
        • Notes on Stanford MLSys Seminar Series-MLflow
      • 9.3 Tech blogs
        • Notes on machine learning in product
        • work at data science group in linkedin
        • What skillsets should a full-stack ML engineer have
        • What it takes to be a ML infra engineer
        • Google engineer tool
        • Infra at Netflix
        • Infra notes
      • 9.4 open sources wheels
        • Feedback on Clara training framework
        • Interesting applications
        • try Rembg
        • Try TF2.0+
        • try DarkTorch
    • Acknowledge
  • DataOps
    • Chap 0. Preface
    • Chap 1. Data engineering
      • DataBase
        • Benchmark on data format
        • Hue
        • MySQL
      • Tips
      • Data format conversion
      • Global variables in Spark
    • Chap 2. Data integration
    • Chap 3. Data security/privacy
    • Chap 4. Data quality
  • MODELOPS
    • Chap 0. Intro
    • Chap 1. Model registery
  • Fun Facts about Image
    • Chap 0. Preface
    • Chap 1. Process
      • Base format
      • Interpolation
    • Chap 2. Metrics
      • MTF
    • Chap 3. Case Study
      • [WIP] Unbelievable
      • Don't f**k with cats
        • Dissect images and audios from videos
        • Identify the geography feature from the image
        • Repetitive face detection
        • Text signature
  • Softskills
    • Chap 1. Mindsets
      • Read good books
      • Stick to it if you think it is right
      • Promote a bug to a feature
      • Focus on what you can do now
      • Focus on improving the tastes over the instant `success`
      • Accumulation
      • You're not a baseline
      • Accept it when I can't change it
      • It is a learning process
      • We are all the same
      • potentially being replaced?
    • Chap 2. Soft skills in getting things done
      • Praise in public and get feedback in private
      • Sell the idea before meeting
      • Double-entry bookkeeping
      • Always ask yourself questions
      • Make TODO list at Friday afternoon
      • Make the backlog transparent
      • Stop and review
      • Meeting replay template
      • a template answer
      • KPT reply
    • Chap 3. Portfolio and side projects
      • [WIP] adobe senseei
      • [WIP] fun tools to enlarge image
      • [WIP] Super-cool website and illustration on client analysis
      • [WIP] plugin in zoom: make your face always professional smiling
      • [WIP] waifu2x
    • Chap 4. Mentorship
      • IC track
      • Being an ML engineer
      • Being an ML architect
Powered by GitBook
On this page

Was this helpful?

  1. MLops
  2. Chap 2. Hardware
  3. 2.1 (T|G|C)PU

[info] GPU hardware benchmarking test

What should we prepare for the benchmark test?
How to run GPU hardware benchmarking test in terms of env?
Which metrics are we curious most? How to store logs for the benchmark testing?
Any good practice here?
Is there any other results related to GPU benchmarking test?
Code reference

Answers:

1. Well-defined purpose; data; source code; some basic estimation;
purpose: This is the most important part and the part that requires the most efforts. Which part do you want to investigate? For example, 
-- fp32 v.s. fp16
-- fit v.s. fit_generator()
data:  Note that if it is company-privacy, we should use randomly generated data with the same format. 
source code: the OOP style is essential. For example, the model structure should be de-coupled with the pipeline. 
some basic estimation: we need to have a raw understanding on how it works in terms of the memory size and the estimated running time for training and inference

2. Mainly two ways. 1. Container 2. Set up the env in the server. For 2, we need to make sure that the tensorflow version is supported by current CUDA version. For example, if the CUDA is 10, it doesn't support the latest tensorflow(1.12) now. If you don't have root right for that, you might want to use nvidia-docker container instead.

3. GPU utilization. 
There are a few way to get the GPU utilization
=> use nvidia-smi
Running time
The running time here is tricky. You need to record the data initialization, the training time for each epoch.

4. Repeat the experiment for 10 times and calculate the mean
Datasets should be similar to the real datasets
Be aware of the IO issue


5. MLPerf: https://mlperf.org/


6. https://github.com/mlperf
Comments: 
note that the inference code and the training code are separated
I love they store the machine parameters in the JSON file
Use container to set up environment

https://github.com/lambdal/benchmarks/tree/8459a23af411ea79968c9af645afdad77b01eeb4

Comments: Classical model
Previous[info] nvlinkNext2.3 network topology and VPN

Last updated 3 years ago

Was this helpful?