mlops
  • MLops
  • MLops
    • Chap 0. Before everything
    • Chap 1. Tools for developers
      • 1.0 Connect with peers
      • 1.1 Make records
      • 1.2 Keep updated
      • 1.3 How come I never use that?
      • 1.4 Script trick
        • WFH server remote connection kit
      • 1.5 Kill the time
      • 1.6 Workspace setup
      • 1.7 ML-related recommends
      • 1.8 miscellaneous: the most memorable debug process
    • Chap 2. Hardware
      • 2.1 (T|G|C)PU
        • [info] How to calculate GPU's upper bound in computing
        • [info] GPU source in cloud
        • [info] nvlink
        • [info] GPU hardware benchmarking test
      • 2.3 network topology and VPN
        • [trouble-shooting] interface
      • 2.4 Edge devices (WIP)
      • 2.5 Workstation/Server
        • 2.5.1 health check
        • 2.5.2 maintenance and upgrade
      • 2.6 Storage-Data Center
        • [Case] Debug w/r speed in nfs file system
        • Data storage system
        • data redundency
        • Debugging the pure storage system
        • Storage system benchmark
        • health-check
        • osxfuse
        • libnfs
        • [Case] Monitor storage increase
      • 2.7 Benchmarks
        • When we do benchmarks on the hardware, which is included?
        • GPUStressTest
      • 2.8 Takeaways
        • access security
        • health-check
        • backup periodically
    • Chap 3. Infrastructure/platform design
      • 3.1 Prototype
        • maintenance@docker
        • Docker with GPU
        • Set up DL environment and versions/dependencies
        • GPU scheduler
        • Data version control
        • Data Parallism
        • maintenance@GPU
        • Training pipeline
        • maintenance@data
        • maintenance@storage
        • Model registry
        • A few good-to-have for training data tracking
      • 3.2 Product
        • Maintenance@cloud
        • Notification of new changes
      • 3.3 Internal tools
        • Plan on wiki- better ranking methods
        • Internal tool - paper sharing
        • internal tool note
        • case study
        • wiki@rnd
        • wiki@rnd- knowledge repo
        • Internal tools requirements
      • 3.4 Benchmarks
      • 3.5 Takeaways
        • how to deal with failed driver
        • What to consider to upgrade a tool we used in the infra
        • backup plan when cloud infra failed
        • Some tips about data transferring between local and server
        • When to use cloud GPU or on-premise GPU
    • Chap 4. Toolkit/codebase
      • 4.3 Packaging
      • 4.0 good to have
        • package your commonly used funcs using conda
        • publish commonly used base image to docker hub
        • set up jenkins and customized config
        • a template for PR in the .github/
        • pylint: code style
        • docstring
        • Have a folder for dev, and test other formal code
        • unit test
        • logging
        • how to report bugs effectively
        • type checking
        • Keep a good Changelog
      • 4.1 toolkits template
      • 4.2 DL-specific
        • Data preprocessing in parallel
    • Chap 5. Paper reproduction
      • 5.0 Takeaway
        • common issues in code reproduction
        • A checklist for open-sourcing your code for reproducibility
        • practice on the version control and reproducing the experiments
        • Toy pipeline on all paper reproduction with code available
      • 5.1 Case study
      • 5.2 What's next: simple is better
    • Chap 6. Prototype development
      • Design Doc
        • [WIP] Scope the project: convert an ill-defined problem to a well-defined setup
          • product cycle
          • Product Backlog refinement
      • Combine with UX design
      • Data Clean and preproceessing
        • DVC 2.0
      • Data version control
      • Training Orchestration
      • Experiments Management
      • Speed up the pipeline
        • TRTorch and torch2trt
        • Efficient Interpolation
        • Distributed training
        • Profiling Deep Learning Network
        • Nsight system
        • dlprof
        • profile
      • Version control and reproducible
        • Use YML more to record config
        • Tuning-experiment tracking
    • Chap 7. Deployment and model serving
      • Documentation
      • Cloud - instance selection
      • Environments (Staging, production) set up
      • Authentication and security
      • New integration
      • use uWSGI and NGinx to serve a DL model
      • Monitoring
      • Scaling
      • Dashboard
      • Model serving
        • Serving model formats
          • TFLite
      • Case study
        • Multiple deployment phases
        • An example of SAAS deployment TODO list
        • Convert matlab script to python's
        • PyTorch JIT
    • Chap 8. Productionization/Maintenance/Adoption
      • 8.0 versions update
        • Python version update
        • driver update
        • CUDA update
        • dependency updates
      • 8.1 online learning infrastructure
      • 8.2 drifting
    • Chap 9. PR/keep stoa
      • 9.0 Conference
      • 9.1 Challenges/Competitions
      • 9.2 Lectures/Webinars
        • Notes on Stanford MLSys Seminar Series-MLflow
      • 9.3 Tech blogs
        • Notes on machine learning in product
        • work at data science group in linkedin
        • What skillsets should a full-stack ML engineer have
        • What it takes to be a ML infra engineer
        • Google engineer tool
        • Infra at Netflix
        • Infra notes
      • 9.4 open sources wheels
        • Feedback on Clara training framework
        • Interesting applications
        • try Rembg
        • Try TF2.0+
        • try DarkTorch
    • Acknowledge
  • DataOps
    • Chap 0. Preface
    • Chap 1. Data engineering
      • DataBase
        • Benchmark on data format
        • Hue
        • MySQL
      • Tips
      • Data format conversion
      • Global variables in Spark
    • Chap 2. Data integration
    • Chap 3. Data security/privacy
    • Chap 4. Data quality
  • MODELOPS
    • Chap 0. Intro
    • Chap 1. Model registery
  • Fun Facts about Image
    • Chap 0. Preface
    • Chap 1. Process
      • Base format
      • Interpolation
    • Chap 2. Metrics
      • MTF
    • Chap 3. Case Study
      • [WIP] Unbelievable
      • Don't f**k with cats
        • Dissect images and audios from videos
        • Identify the geography feature from the image
        • Repetitive face detection
        • Text signature
  • Softskills
    • Chap 1. Mindsets
      • Read good books
      • Stick to it if you think it is right
      • Promote a bug to a feature
      • Focus on what you can do now
      • Focus on improving the tastes over the instant `success`
      • Accumulation
      • You're not a baseline
      • Accept it when I can't change it
      • It is a learning process
      • We are all the same
      • potentially being replaced?
    • Chap 2. Soft skills in getting things done
      • Praise in public and get feedback in private
      • Sell the idea before meeting
      • Double-entry bookkeeping
      • Always ask yourself questions
      • Make TODO list at Friday afternoon
      • Make the backlog transparent
      • Stop and review
      • Meeting replay template
      • a template answer
      • KPT reply
    • Chap 3. Portfolio and side projects
      • [WIP] adobe senseei
      • [WIP] fun tools to enlarge image
      • [WIP] Super-cool website and illustration on client analysis
      • [WIP] plugin in zoom: make your face always professional smiling
      • [WIP] waifu2x
    • Chap 4. Mentorship
      • IC track
      • Being an ML engineer
      • Being an ML architect
Powered by GitBook
On this page

Was this helpful?

  1. MLops
  2. Chap 5. Paper reproduction
  3. 5.0 Takeaway

practice on the version control and reproducing the experiments

Rules:

  1. Keep the random seed unchanged (15213)

  2. For the data, use Python scripts instead of the cells from the jupyter-notebook to generate data.

  3. Never delete those Python scripts even though the data was deleted.

  4. The input from the Python scripts is the raw DICOM files from /raid/Data.

  5. Try not to use dependent functions from other files. If I have to use it, use it from the master branch.

  6. Be cautious to run operations or modify the functions in the master branch.

  7. Do push the code and backup the data generation scripts for each experiment.

  8. For each function, write the dependency functions on the comments, so that if I have to modify some functions from the master branch, the second thing I must do is to run the unit test and check if the dependent functions are working properly.

PreviousA checklist for open-sourcing your code for reproducibilityNextToy pipeline on all paper reproduction with code available

Last updated 4 years ago

Was this helpful?