Skip to content

Data Versioning Usage

Import

The class infin_boto3 must be imported

from infinstor import infin_boto3

Note that in order for the above to succeed, you will need to have installed the infinstor python package from pypi as follows:

pip install infinstor

Code Example

Use boto3 as you normally would. For example, in the following example, images from the prefix test_flower_photos/ in the bucket jaganes-testbucket-2 are downloaded and displayed in a jupyterlab cell

import infinstor
import os
import boto3
from infinstor import infin_boto3
import tempfile
from IPython.display import Image
import mlflow


tmpdir = tempfile.mkdtemp()
with mlflow.start_run() as run:
    client = boto3.client('s3')
    resp = client.list_objects_v2(Bucket='jaganes-testbucket-2', Prefix='test_flower_photos/', Delimiter='/')
    if 'Contents' in resp:
        for one in resp['Contents']:
            nm = one['Key']
            if not nm[-1] == '/':
                dnm = os.path.join(tmpdir, nm[nm.rindex('/') + 1:])
                print("Downloading " + nm + " to " + dnm)
                client.download_file('jaganes-testbucket-2', nm, dnm)
                display(Image(dnm))

Auto-logged mlflow param

Observe that the mlflow run description includes a parameter infinstor_snapshot_time as shown below. This is the epoch milliseconds of the snapshot time, i.e. the time of the frozen view of the data

{
    "info": {
        "artifact_uri": "s3://infinstor-mlflow-artifacts-ai.isstage4.com/mlflow-artifacts/azuread_jagane@infinstor.com/2/2-16430576763570000000178",
        "end_time": 1643057698877,
        "experiment_id": "2",
        "lifecycle_stage": "active",
        "run_id": "2-16430576763570000000178",
        "run_uuid": "2-16430576763570000000178",
        "start_time": 1643057676357,
        "status": "FINISHED",
        "user_id": "azuread_jagane@infinstor.com"
    },
    "data": {
        "metrics": {},
        "params": {
            "infinstor_snapshot_time": "1643057676357"
        },
        "tags": {
            "mlflow.user": "azuread_jagane@infinstor.com",
            "mlflow.source.name": "/home/jagane/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py",
            "mlflow.source.type": "LOCAL"
        }
    }
}

UI for Browsing Snapshot

If you take a look at the MLflow page for the corresponding mlflow run, you will notice a new UI component that enables you to browse the cloud object store as it existed at the snapshot time

Re-running with same data

Re-running using a previous run's snapshot time is accomplished by setting the environment variable INFINSTOR_SNAPSHOT_TIME, for example, to run:/2-16430576763570000000178 (no trailing slash). The following is an example of setting this environment variable in a jupyterlab cell

%env INFINSTOR_SNAPSHOT_TIME run:/2-16430576763570000000178