Chap 3. Infrastructure/platform design
Infrastructure is one of the core competencies for the company.
Working on infrastructure has multiple responsibilities.
Identify the gap
There are multiple gaps that the infrastructure could help to facilitate the workflow. More precisely,
Research (If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. Cloud integration is really important at Netflix's scale
https://news.ycombinator.com/item?id=21702831)
GPU management: how to connect multiple servers
Data management: Data management system/data warehouse
Server maintenance
The Codebase for research: the pre-processing/post-processing pipeline(DAG), the training pipeline, visualization
Mini app: inference pipeline for testing (MVP)
Model register(Model management): a platform to track models with different functionalities
Product:
Deployment
Dashboard
The Codebase for the app
Monitor mechanism/ health-check
User activity analysis
Commercial:
A platform to track record and transfer money
A platform to interact with customers
Others/internal tools:
internal wiki
internal StackOverflow
internal compiler
Build the platforms
design reviews
code and test
Maintain the platforms
collect feedbacks from each site on period
fix the bug and add the patch
upgrade the versions of dependencies if needed
add new features if requested
Last updated