Data Versioning Concepts¶
Automatic Data Versioning¶
ML Data Stored in Cloud Object Stores is automatically versioned at the file system level
- All changes are recorded and every version of data is perpetually preserved
Benefits¶
A consistent view of data is presented to the ML program
- Changes made to the data after the program starts are not visible to the program, thus preventing irreproducible results
This consistent view is preserved forever and a parameter is logged to the mlflow run
- In the future, Data Scientists can go back and view the exact data that was used in the ML run
- These versions are read-only and cannot be modified
- In the future, if the Data Scientist wants to make small changes to the data and re-run the experiment, then he/she can copy the version that the original run refers to, make the changes to the copy and re-run the experiment