A few good-to-have for training data tracking

The blog is a follow-up on the discussion about some recommended ways to keep a record of the preprocessed data for training.

Issues:

a. data size is too large to be pushed (not secure to be put in github as well)

b. data from different batches have different fields.

Potential good practice to consider:

set up and modulize the preprocessing pipeline: from raw data to the preprocessed results
set the random seed
dependencies and version number
version control the code
stick to one sheet to the info of all cases (multiple files might cause missing or confusing in the future)
could dcmdump to extract features for the dicom case
the file could be unstructured, but need to be comprehensive
compute the md5 hash of the preprocessed data directory and save it in config file (version control it as well as the code in git version system)
compare the md5 hash in each experiment
only save it when it takes long to generate
use dvc

More info:

Datasheets for Datasets: https://arxiv.org/pdf/1803.09010.pdf

Last updated 4 years ago

Was this helpful?