Chap 3. Infrastructure/platform design

Infrastructure is one of the core competencies for the company.

Working on infrastructure has multiple responsibilities.

Identify the gap

There are multiple gaps that the infrastructure could help to facilitate the workflow. More precisely,

  • Research (If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. Cloud integration is really important at Netflix's scale

    https://news.ycombinator.com/item?id=21702831)

    • GPU management: how to connect multiple servers

    • Data management: Data management system/data warehouse

    • Server maintenance

    • The Codebase for research: the pre-processing/post-processing pipeline(DAG), the training pipeline, visualization

    • Mini app: inference pipeline for testing (MVP)

    • Model register(Model management): a platform to track models with different functionalities

  • Product:

    • Deployment

    • Dashboard

    • The Codebase for the app

    • Monitor mechanism/ health-check

    • User activity analysis

  • Commercial:

    • A platform to track record and transfer money

    • A platform to interact with customers

  • Others/internal tools:

    • internal wiki

    • internal StackOverflow

    • internal compiler

Build the platforms

  • design reviews

  • code and test

Maintain the platforms

  • collect feedbacks from each site on period

  • fix the bug and add the patch

  • upgrade the versions of dependencies if needed

  • add new features if requested

Last updated