Skip to content

Data Versioning Concepts

Automatic Data Versioning

ML Data Stored in Cloud Object Stores is automatically versioned at the file system level

  • All changes are recorded and every version of data is perpetually preserved

Benefits

A consistent view of data is presented to the ML program

  • Changes made to the data after the program starts are not visible to the program, thus preventing irreproducible results

This consistent view is preserved forever and a parameter is logged to the mlflow run

  • In the future, Data Scientists can go back and view the exact data that was used in the ML run
  • These versions are read-only and cannot be modified
  • In the future, if the Data Scientist wants to make small changes to the data and re-run the experiment, then he/she can copy the version that the original run refers to, make the changes to the copy and re-run the experiment