A few good-to-have for training data tracking

The blog is a follow-up on the discussion about some recommended ways to keep a record of the preprocessed data for training.

Issues:

a. data size is too large to be pushed (not secure to be put in github as well)

b. data from different batches have different fields.

Potential good practice to consider:

  • set up and modulize the preprocessing pipeline: from raw data to the preprocessed results

  • set the random seed

  • dependencies and version number

  • version control the code

  • stick to one sheet to the info of all cases (multiple files might cause missing or confusing in the future)

  • could dcmdump to extract features for the dicom case

  • the file could be unstructured, but need to be comprehensive

  • compute the md5 hash of the preprocessed data directory and save it in config file (version control it as well as the code in git version system)

  • compare the md5 hash in each experiment

  • only save it when it takes long to generate

  • use dvc

More info:

Datasheets for Datasets: https://arxiv.org/pdf/1803.09010.pdf

Last updated