Feedback on Clara training framework

https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v3.0/nvmidl/clara_faq.html#why-should-i-use-clara-train

The exploration would be to try out the Clara framework as well as explore how it could help with our current setup. Also, it is just a personal opinion.

Summary:

  • the suitable scenario to use the training framework is when the prototype development is done to model handover for deployment. One potential TODO item is to make it compatible with multiple Python and tensorflow versions(v3.0 are python3.6/tf1.14) if we only use the training framework but deploy it in our own setup.

  • the training framework is a great, generic, and standard framework. It is also very helpful as a reference for us to build a specific framework with our focus or common issues we met(for example, recon focused, etc.), because the advantage for the framework focused mostly on most classification, autoML, and faster training related tasks. Dockerfile is a good reference for the basic setup, but other source codes are encrypted.

  • the model register(NGC) could be helpful for some research purposes, for example, some segmentation models.

  • need confirmation on the plan for applying federated learning on the product.

Details:

Clara includes training framework and deployment framework. Below will focus on the research-related part.

  • Clara training container

pro:

  • great documents

  • the framework is easy to adopt and caught on

  • with a wide collection of useful packages or tools, for example dali, ngc, nvmidl to do data conversion, dlprof to monitor the footprint, etc

  • take good care of the determinism in tensorflow (it is very hard to achieve to tensorflow although much easier in PyTorch)

con:

  • the highlighted features seem not to align our focus so that after we switch from pure tensorflow or PyTorch environment to this, it seems that there is no much time saving for the switch for the current version. Also, for the training speed up part, except for smart caching, others can be easily included in our current pipeline. for small caching, it seems that it only was tested on classification-related tasks and need manual tuning on some parameters.

NGC: available models to use directly

Some additional Good-to-have for their promotion and user adaption:

  • In terms of the code snippets in the documentation, it would be great to have some full scripts in public in GitHub. For example, training/deployment examples in Python with wider variances (recon based/complex number supported, etc.) under the Clara training container in the GitHub repo. We can help with that if needed.

  • In addition to FAQ, building a forum with active forums or communities would boost the confidence for switching because the forums are the first and most direct way to go when there is an issue. Also, it is one of the most critical metrics for people to decide whether the tool is mature enough to switch.

  • We can also contribute some models to the NGC if needed.

Last updated