mlops
  • MLops
  • MLops
    • Chap 0. Before everything
    • Chap 1. Tools for developers
      • 1.0 Connect with peers
      • 1.1 Make records
      • 1.2 Keep updated
      • 1.3 How come I never use that?
      • 1.4 Script trick
        • WFH server remote connection kit
      • 1.5 Kill the time
      • 1.6 Workspace setup
      • 1.7 ML-related recommends
      • 1.8 miscellaneous: the most memorable debug process
    • Chap 2. Hardware
      • 2.1 (T|G|C)PU
        • [info] How to calculate GPU's upper bound in computing
        • [info] GPU source in cloud
        • [info] nvlink
        • [info] GPU hardware benchmarking test
      • 2.3 network topology and VPN
        • [trouble-shooting] interface
      • 2.4 Edge devices (WIP)
      • 2.5 Workstation/Server
        • 2.5.1 health check
        • 2.5.2 maintenance and upgrade
      • 2.6 Storage-Data Center
        • [Case] Debug w/r speed in nfs file system
        • Data storage system
        • data redundency
        • Debugging the pure storage system
        • Storage system benchmark
        • health-check
        • osxfuse
        • libnfs
        • [Case] Monitor storage increase
      • 2.7 Benchmarks
        • When we do benchmarks on the hardware, which is included?
        • GPUStressTest
      • 2.8 Takeaways
        • access security
        • health-check
        • backup periodically
    • Chap 3. Infrastructure/platform design
      • 3.1 Prototype
        • maintenance@docker
        • Docker with GPU
        • Set up DL environment and versions/dependencies
        • GPU scheduler
        • Data version control
        • Data Parallism
        • maintenance@GPU
        • Training pipeline
        • maintenance@data
        • maintenance@storage
        • Model registry
        • A few good-to-have for training data tracking
      • 3.2 Product
        • Maintenance@cloud
        • Notification of new changes
      • 3.3 Internal tools
        • Plan on wiki- better ranking methods
        • Internal tool - paper sharing
        • internal tool note
        • case study
        • wiki@rnd
        • wiki@rnd- knowledge repo
        • Internal tools requirements
      • 3.4 Benchmarks
      • 3.5 Takeaways
        • how to deal with failed driver
        • What to consider to upgrade a tool we used in the infra
        • backup plan when cloud infra failed
        • Some tips about data transferring between local and server
        • When to use cloud GPU or on-premise GPU
    • Chap 4. Toolkit/codebase
      • 4.3 Packaging
      • 4.0 good to have
        • package your commonly used funcs using conda
        • publish commonly used base image to docker hub
        • set up jenkins and customized config
        • a template for PR in the .github/
        • pylint: code style
        • docstring
        • Have a folder for dev, and test other formal code
        • unit test
        • logging
        • how to report bugs effectively
        • type checking
        • Keep a good Changelog
      • 4.1 toolkits template
      • 4.2 DL-specific
        • Data preprocessing in parallel
    • Chap 5. Paper reproduction
      • 5.0 Takeaway
        • common issues in code reproduction
        • A checklist for open-sourcing your code for reproducibility
        • practice on the version control and reproducing the experiments
        • Toy pipeline on all paper reproduction with code available
      • 5.1 Case study
      • 5.2 What's next: simple is better
    • Chap 6. Prototype development
      • Design Doc
        • [WIP] Scope the project: convert an ill-defined problem to a well-defined setup
          • product cycle
          • Product Backlog refinement
      • Combine with UX design
      • Data Clean and preproceessing
        • DVC 2.0
      • Data version control
      • Training Orchestration
      • Experiments Management
      • Speed up the pipeline
        • TRTorch and torch2trt
        • Efficient Interpolation
        • Distributed training
        • Profiling Deep Learning Network
        • Nsight system
        • dlprof
        • profile
      • Version control and reproducible
        • Use YML more to record config
        • Tuning-experiment tracking
    • Chap 7. Deployment and model serving
      • Documentation
      • Cloud - instance selection
      • Environments (Staging, production) set up
      • Authentication and security
      • New integration
      • use uWSGI and NGinx to serve a DL model
      • Monitoring
      • Scaling
      • Dashboard
      • Model serving
        • Serving model formats
          • TFLite
      • Case study
        • Multiple deployment phases
        • An example of SAAS deployment TODO list
        • Convert matlab script to python's
        • PyTorch JIT
    • Chap 8. Productionization/Maintenance/Adoption
      • 8.0 versions update
        • Python version update
        • driver update
        • CUDA update
        • dependency updates
      • 8.1 online learning infrastructure
      • 8.2 drifting
    • Chap 9. PR/keep stoa
      • 9.0 Conference
      • 9.1 Challenges/Competitions
      • 9.2 Lectures/Webinars
        • Notes on Stanford MLSys Seminar Series-MLflow
      • 9.3 Tech blogs
        • Notes on machine learning in product
        • work at data science group in linkedin
        • What skillsets should a full-stack ML engineer have
        • What it takes to be a ML infra engineer
        • Google engineer tool
        • Infra at Netflix
        • Infra notes
      • 9.4 open sources wheels
        • Feedback on Clara training framework
        • Interesting applications
        • try Rembg
        • Try TF2.0+
        • try DarkTorch
    • Acknowledge
  • DataOps
    • Chap 0. Preface
    • Chap 1. Data engineering
      • DataBase
        • Benchmark on data format
        • Hue
        • MySQL
      • Tips
      • Data format conversion
      • Global variables in Spark
    • Chap 2. Data integration
    • Chap 3. Data security/privacy
    • Chap 4. Data quality
  • MODELOPS
    • Chap 0. Intro
    • Chap 1. Model registery
  • Fun Facts about Image
    • Chap 0. Preface
    • Chap 1. Process
      • Base format
      • Interpolation
    • Chap 2. Metrics
      • MTF
    • Chap 3. Case Study
      • [WIP] Unbelievable
      • Don't f**k with cats
        • Dissect images and audios from videos
        • Identify the geography feature from the image
        • Repetitive face detection
        • Text signature
  • Softskills
    • Chap 1. Mindsets
      • Read good books
      • Stick to it if you think it is right
      • Promote a bug to a feature
      • Focus on what you can do now
      • Focus on improving the tastes over the instant `success`
      • Accumulation
      • You're not a baseline
      • Accept it when I can't change it
      • It is a learning process
      • We are all the same
      • potentially being replaced?
    • Chap 2. Soft skills in getting things done
      • Praise in public and get feedback in private
      • Sell the idea before meeting
      • Double-entry bookkeeping
      • Always ask yourself questions
      • Make TODO list at Friday afternoon
      • Make the backlog transparent
      • Stop and review
      • Meeting replay template
      • a template answer
      • KPT reply
    • Chap 3. Portfolio and side projects
      • [WIP] adobe senseei
      • [WIP] fun tools to enlarge image
      • [WIP] Super-cool website and illustration on client analysis
      • [WIP] plugin in zoom: make your face always professional smiling
      • [WIP] waifu2x
    • Chap 4. Mentorship
      • IC track
      • Being an ML engineer
      • Being an ML architect
Powered by GitBook
On this page

Was this helpful?

  1. Softskills
  2. Chap 4. Mentorship

Being an ML engineer

This is my current understanding of the expectation of a machine learning engineer with a few years of working experience.

I joined my current team when it was still in a very early stage so that I got chances to practice solving different tasks under different situations/the skillsets needed for the role. Later on, the team grows. When all people in the team are familiar with everything, what makes me competitive? Why am I the person working on this project, not others? Who will I be after a few years of working as a machine learning engineer?

I read through the ml lead or senior/staff ml engineer's job description, also some daily work/shares from them. I think the required qualities include,

  • to anticipate the resources required, potential issues, needs, and be able to set tasks and evaluate results

  • be able to resolve issues or get resources to do that/someone who can take end-to-end responsibility.

  • share takeaways in public and contribute to the community

Also, a pitfall is that no one is irreplaceable. Without one, everything will still be on. It doesn't mean the person is not important. I hope, with me, the solution is great and I also enjoyed the process.

General principle to practice:

  • Replicate papers

  • Don't get comfortable in the comfort zone. If you start a new project, it had better to be learn some new frameworks/libraries/tools.

  • Learn boring things, like a proper Git flow, how to use Docker, how to build an app using Flask, and how to deploy models on AWS or under-appreciated by a solid majority of applicants.

  • Do annoy things, like present work in the meetup, attend conferences and network etc.

  • Do things that seem crazy.

PreviousIC trackNextBeing an ML architect

Last updated 3 years ago

Was this helpful?