Data Versioning Usage¶
Import¶
The class infin_boto3 must be imported
from infinstor import infin_boto3
Note that in order for the above to succeed, you will need to have installed the infinstor python package from pypi as follows:
pip install infinstor
Code Example¶
Use boto3 as you normally would. For example, in the following example, images from the prefix test_flower_photos/ in the bucket jaganes-testbucket-2 are downloaded and displayed in a jupyterlab cell
import infinstor
import os
import boto3
from infinstor import infin_boto3
import tempfile
from IPython.display import Image
import mlflow
tmpdir = tempfile.mkdtemp()
with mlflow.start_run() as run:
client = boto3.client('s3')
resp = client.list_objects_v2(Bucket='jaganes-testbucket-2', Prefix='test_flower_photos/', Delimiter='/')
if 'Contents' in resp:
for one in resp['Contents']:
nm = one['Key']
if not nm[-1] == '/':
dnm = os.path.join(tmpdir, nm[nm.rindex('/') + 1:])
print("Downloading " + nm + " to " + dnm)
client.download_file('jaganes-testbucket-2', nm, dnm)
display(Image(dnm))
Auto-logged mlflow param¶
Observe that the mlflow run description includes a parameter infinstor_snapshot_time as shown below. This is the epoch milliseconds of the snapshot time, i.e. the time of the frozen view of the data
{
"info": {
"artifact_uri": "s3://infinstor-mlflow-artifacts-ai.isstage4.com/mlflow-artifacts/azuread_jagane@infinstor.com/2/2-16430576763570000000178",
"end_time": 1643057698877,
"experiment_id": "2",
"lifecycle_stage": "active",
"run_id": "2-16430576763570000000178",
"run_uuid": "2-16430576763570000000178",
"start_time": 1643057676357,
"status": "FINISHED",
"user_id": "azuread_jagane@infinstor.com"
},
"data": {
"metrics": {},
"params": {
"infinstor_snapshot_time": "1643057676357"
},
"tags": {
"mlflow.user": "azuread_jagane@infinstor.com",
"mlflow.source.name": "/home/jagane/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py",
"mlflow.source.type": "LOCAL"
}
}
}
UI for Browsing Snapshot¶
If you take a look at the MLflow page for the corresponding mlflow run, you will notice a new UI component that enables you to browse the cloud object store as it existed at the snapshot time
Re-running with same data¶
Re-running using a previous run's snapshot time is accomplished by setting the environment variable INFINSTOR_SNAPSHOT_TIME, for example, to run:/2-16430576763570000000178 (no trailing slash). The following is an example of setting this environment variable in a jupyterlab cell
%env INFINSTOR_SNAPSHOT_TIME run:/2-16430576763570000000178