Restoring Dataset Provenance

restore_dataset_provenance.py

This sample illustrates how a dataset producer can create a dataset and how the provenance of the dataset can be restored after loss during copying or other transformations.

Summary

When digital objects are copied across physical media, across cloud environments, and within a cloud environment, they typically lose provenance information such as last modified time. This can lead to severe problems in cases where the timestamps are critical provenance metadata. For example:

  • When security logs are copied, modifications and tampering may not be detectable.

  • When financial datasets are copied, their revision timestamps are lost.

  • Point-in-time and bitemporal data lose the revision timestamps.

This sample illustrates how such timestamps can be restored for vBase datasets after the underlying digital objects are copied to a new AWS S3 bucket.

Detailed Description

  • A .env file defines the following environment variables for accessing AWS S3 service and vBase services:

    # Forwarder config
    # vBase test forwarder URL
    VBASE_FORWARDER_URL="https://test.api.vbase.com/forwarder/"
    # vBaseTest API key
    VBASE_API_KEY="YOUR_VBASE_API_KEY"
    
    # Private key for making commitments
    VBASE_COMMITMENT_SERVICE_PRIVATE_KEY="YOUR_VBASE_COMMITMENT_SERVICE_PRIVATE_KEY"
    
    # AWS Configuration
    AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
    AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
  • Create a vBase client object using connection parameters specified in environment variables:

    vbc = VBaseClient.create_instance_from_env()
  • Create an AWS S3 client object using connection parameters specified in environment variables:

    boto_client = create_s3_client_from_env()
  • Create the test dataset. If this is a new dataset, this operation records that the user with the above VBASE_COMMITMENT_SERVICE_PRIVATE_KEY has created the named dataset. Dataset creation is idempotent. If this is an existing dataset, the call is ignored. Such commitments are used to validate that a given collection of user datasets is complete and mitigates Sybil attacks (https://en.wikipedia.org/wiki/Sybil_attack).

    ds = VBaseDataset(vbc, SET_NAME, VBaseIntObject)
  • Add a record to the dataset. Records an object commitment for a set record. A set record commitment establishes that a dataset record with a given CID has existed at a given time for a given set.

    vbase_receipt = ds.add_record(i)
  • Verify that a given set commitment exists for a given user. This will typically be called by the data consumer to verify a producer's claims about dataset provenance.

    assert ds.verify_commitments()[0]
  • Copy a folder to another folder. This could also be a copy to a different bucket, or a different storage.

    copy_s3_bucket(
        boto_client=boto_client,
        source_bucket_name=BUCKET_NAME,
        source_folder_name=FOLDER_NAME,
        destination_bucket_name=BUCKET_NAME,
        destination_folder_name=COPY_FOLDER_NAME,
    )
  • Create a vBase dataset using the copied objects. These objects have lost the original timestamps.

    ds_copy = VBaseDataset(vbc, SET_NAME, VBaseIntObject)
    # Load all objects into the dataset.
    ds_copy = init_vbase_dataset_from_s3_objects(
        ds_copy, boto_client, BUCKET_NAME, COPY_FOLDER_NAME
    )
  • Attempt to verify the copied objects. Since these have lost the timestamps and the original provenance information, the checks will fail:

    success, l_log = ds_copy.verify_commitments()
    assert not success
    print("Verification log:")
    for log in l_log:
        print(log)
  • Fix the record timestamps using the vBase commitment information and verify the corrected provenance data:

    # Fix the timestamps.
    assert ds_copy.try_restore_timestamps_from_index()[0]
    
    print("Dataset fixed:")
    pprint.pprint(ds_copy.to_pd_object())
    
    # Verify the records again.
    assert ds_copy.verify_commitments()[0]