mlops
  • MLops
  • MLops
    • Chap 0. Before everything
    • Chap 1. Tools for developers
      • 1.0 Connect with peers
      • 1.1 Make records
      • 1.2 Keep updated
      • 1.3 How come I never use that?
      • 1.4 Script trick
        • WFH server remote connection kit
      • 1.5 Kill the time
      • 1.6 Workspace setup
      • 1.7 ML-related recommends
      • 1.8 miscellaneous: the most memorable debug process
    • Chap 2. Hardware
      • 2.1 (T|G|C)PU
        • [info] How to calculate GPU's upper bound in computing
        • [info] GPU source in cloud
        • [info] nvlink
        • [info] GPU hardware benchmarking test
      • 2.3 network topology and VPN
        • [trouble-shooting] interface
      • 2.4 Edge devices (WIP)
      • 2.5 Workstation/Server
        • 2.5.1 health check
        • 2.5.2 maintenance and upgrade
      • 2.6 Storage-Data Center
        • [Case] Debug w/r speed in nfs file system
        • Data storage system
        • data redundency
        • Debugging the pure storage system
        • Storage system benchmark
        • health-check
        • osxfuse
        • libnfs
        • [Case] Monitor storage increase
      • 2.7 Benchmarks
        • When we do benchmarks on the hardware, which is included?
        • GPUStressTest
      • 2.8 Takeaways
        • access security
        • health-check
        • backup periodically
    • Chap 3. Infrastructure/platform design
      • 3.1 Prototype
        • maintenance@docker
        • Docker with GPU
        • Set up DL environment and versions/dependencies
        • GPU scheduler
        • Data version control
        • Data Parallism
        • maintenance@GPU
        • Training pipeline
        • maintenance@data
        • maintenance@storage
        • Model registry
        • A few good-to-have for training data tracking
      • 3.2 Product
        • Maintenance@cloud
        • Notification of new changes
      • 3.3 Internal tools
        • Plan on wiki- better ranking methods
        • Internal tool - paper sharing
        • internal tool note
        • case study
        • wiki@rnd
        • wiki@rnd- knowledge repo
        • Internal tools requirements
      • 3.4 Benchmarks
      • 3.5 Takeaways
        • how to deal with failed driver
        • What to consider to upgrade a tool we used in the infra
        • backup plan when cloud infra failed
        • Some tips about data transferring between local and server
        • When to use cloud GPU or on-premise GPU
    • Chap 4. Toolkit/codebase
      • 4.3 Packaging
      • 4.0 good to have
        • package your commonly used funcs using conda
        • publish commonly used base image to docker hub
        • set up jenkins and customized config
        • a template for PR in the .github/
        • pylint: code style
        • docstring
        • Have a folder for dev, and test other formal code
        • unit test
        • logging
        • how to report bugs effectively
        • type checking
        • Keep a good Changelog
      • 4.1 toolkits template
      • 4.2 DL-specific
        • Data preprocessing in parallel
    • Chap 5. Paper reproduction
      • 5.0 Takeaway
        • common issues in code reproduction
        • A checklist for open-sourcing your code for reproducibility
        • practice on the version control and reproducing the experiments
        • Toy pipeline on all paper reproduction with code available
      • 5.1 Case study
      • 5.2 What's next: simple is better
    • Chap 6. Prototype development
      • Design Doc
        • [WIP] Scope the project: convert an ill-defined problem to a well-defined setup
          • product cycle
          • Product Backlog refinement
      • Combine with UX design
      • Data Clean and preproceessing
        • DVC 2.0
      • Data version control
      • Training Orchestration
      • Experiments Management
      • Speed up the pipeline
        • TRTorch and torch2trt
        • Efficient Interpolation
        • Distributed training
        • Profiling Deep Learning Network
        • Nsight system
        • dlprof
        • profile
      • Version control and reproducible
        • Use YML more to record config
        • Tuning-experiment tracking
    • Chap 7. Deployment and model serving
      • Documentation
      • Cloud - instance selection
      • Environments (Staging, production) set up
      • Authentication and security
      • New integration
      • use uWSGI and NGinx to serve a DL model
      • Monitoring
      • Scaling
      • Dashboard
      • Model serving
        • Serving model formats
          • TFLite
      • Case study
        • Multiple deployment phases
        • An example of SAAS deployment TODO list
        • Convert matlab script to python's
        • PyTorch JIT
    • Chap 8. Productionization/Maintenance/Adoption
      • 8.0 versions update
        • Python version update
        • driver update
        • CUDA update
        • dependency updates
      • 8.1 online learning infrastructure
      • 8.2 drifting
    • Chap 9. PR/keep stoa
      • 9.0 Conference
      • 9.1 Challenges/Competitions
      • 9.2 Lectures/Webinars
        • Notes on Stanford MLSys Seminar Series-MLflow
      • 9.3 Tech blogs
        • Notes on machine learning in product
        • work at data science group in linkedin
        • What skillsets should a full-stack ML engineer have
        • What it takes to be a ML infra engineer
        • Google engineer tool
        • Infra at Netflix
        • Infra notes
      • 9.4 open sources wheels
        • Feedback on Clara training framework
        • Interesting applications
        • try Rembg
        • Try TF2.0+
        • try DarkTorch
    • Acknowledge
  • DataOps
    • Chap 0. Preface
    • Chap 1. Data engineering
      • DataBase
        • Benchmark on data format
        • Hue
        • MySQL
      • Tips
      • Data format conversion
      • Global variables in Spark
    • Chap 2. Data integration
    • Chap 3. Data security/privacy
    • Chap 4. Data quality
  • MODELOPS
    • Chap 0. Intro
    • Chap 1. Model registery
  • Fun Facts about Image
    • Chap 0. Preface
    • Chap 1. Process
      • Base format
      • Interpolation
    • Chap 2. Metrics
      • MTF
    • Chap 3. Case Study
      • [WIP] Unbelievable
      • Don't f**k with cats
        • Dissect images and audios from videos
        • Identify the geography feature from the image
        • Repetitive face detection
        • Text signature
  • Softskills
    • Chap 1. Mindsets
      • Read good books
      • Stick to it if you think it is right
      • Promote a bug to a feature
      • Focus on what you can do now
      • Focus on improving the tastes over the instant `success`
      • Accumulation
      • You're not a baseline
      • Accept it when I can't change it
      • It is a learning process
      • We are all the same
      • potentially being replaced?
    • Chap 2. Soft skills in getting things done
      • Praise in public and get feedback in private
      • Sell the idea before meeting
      • Double-entry bookkeeping
      • Always ask yourself questions
      • Make TODO list at Friday afternoon
      • Make the backlog transparent
      • Stop and review
      • Meeting replay template
      • a template answer
      • KPT reply
    • Chap 3. Portfolio and side projects
      • [WIP] adobe senseei
      • [WIP] fun tools to enlarge image
      • [WIP] Super-cool website and illustration on client analysis
      • [WIP] plugin in zoom: make your face always professional smiling
      • [WIP] waifu2x
    • Chap 4. Mentorship
      • IC track
      • Being an ML engineer
      • Being an ML architect
Powered by GitBook
On this page

Was this helpful?

  1. MLops
  2. Chap 5. Paper reproduction

5.1 Case study

The common protocol for reproducing the papers includes:

  • paper with code available

    • clone the git repo and create a container with the same environment (ubuntu/python/tf/pytorch etc) as the README suggested

    • download the dataset they used for training and create a mini dataset for debugging

    • run on the whole training pipeline to make sure it is reproducible (compare the loss/metrics)

    • modified the data loader/loss func/#channel etc to make it compatible with the customized data we are working at

    • get a baseline

    • run training and record the experiments

  • paper without code available

    • chose the paper in the fields you're familiar

    • build an MVP first (don't investigate too much time before you're convinced by the performance of the MVP)

    • change a module once a time

PreviousToy pipeline on all paper reproduction with code availableNext5.2 What's next: simple is better

Last updated 4 years ago

Was this helpful?