What it takes to be a ML infra engineer

Coding:

  • Reviewing our tools and code to help us continue moving at state-of-the-art speed

  • Spark/Dataproc for experiment at scale, mlflow for experiment management

  • Substantial experience with multiple technologies from the following list: Arrow, Bazel, Docker, Kibana, MPI, MySQL, Redis, Spark, Zookeeper.

  • Ability to build full-stack web applications/services for internal tooling.

  • Experience using Cython Numba, C or similar to speed up analytical code

  • Experience with GPU acceleration (CUDA and CUDNN)

  • Flink / MLflow

Integration:

  • building deep integration between NVIDIA's GPU-backed RAPIDS frameworks and all of the major cloud and on-premise machine learning platforms

Deployment:

  • automate, deliver, monitor, and improve machine learning solutions while ensuring data and models are secure

  • Experience with Terraform and Puppet for infrastructure management and automation

  • Experience with Kubernetes deployments and cluster management

Last updated