Restoring Dataset Provenance
This sample illustrates how a dataset producer can create a dataset and how the provenance of the dataset can be restored after loss during copying or other transformations.
You can find the implementation in restore_dataset_provenance.py
.
Summary
When digital objects are copied across physical media, across cloud environments, and within a cloud environment, they typically lose provenance information such as last modified time. This can lead to severe problems in cases where the timestamps are critical provenance metadata.
For example:
When security logs are copied, modifications and tampering may not be detectable.
When financial datasets are copied, their revision timestamps are lost.
Point-in-time and bitemporal data lose the revision timestamps.
This sample illustrates how such timestamps can be restored for vBase datasets after the underlying digital objects are copied to a new AWS S3 bucket.
Detailed Description
A
.env
file defines the following environment variables for accessing AWS S3 service and vBase services:# Forwarder config # vBase test forwarder URL VBASE_FORWARDER_URL="https://test.api.vbase.com/forwarder/" # vBaseTest API key VBASE_API_KEY="YOUR_VBASE_API_KEY" # Private key for making commitments VBASE_COMMITMENT_SERVICE_PRIVATE_KEY="YOUR_VBASE_COMMITMENT_SERVICE_PRIVATE_KEY" # AWS Configuration AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID" AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
Create a vBase client object using connection parameters specified in environment variables:
vbc = VBaseClient.create_instance_from_env()
Create an AWS S3 client object using connection parameters specified in environment variables:
boto_client = create_s3_client_from_env()
Create the test dataset. If this is a new dataset, this operation records that the user with the above VBASE_COMMITMENT_SERVICE_PRIVATE_KEY has created the named dataset. Dataset creation is idempotent. If this is an existing dataset, the call is ignored. Such commitments are used to validate that a given collection of user datasets is complete and mitigates Sybil attacks (https://en.wikipedia.org/wiki/Sybil_attack).
ds = VBaseDataset(vbc, SET_NAME, VBaseIntObject)
Add a record to the dataset. Records an object commitment for a set record. A set record commitment establishes that a dataset record with a given CID has existed at a given time for a given set.
vbase_receipt = ds.add_record(i)
Verify that a given set commitment exists for a given user. This will typically be called by the data consumer to verify a producer’s claims about dataset provenance.
assert ds.verify_commitments()[0]
Copy a folder to another folder. This could also be a copy to a different bucket, or a different storage.
copy_s3_bucket( boto_client=boto_client, source_bucket_name=BUCKET_NAME, source_folder_name=FOLDER_NAME, destination_bucket_name=BUCKET_NAME, destination_folder_name=COPY_FOLDER_NAME, )
Create a vBase dataset using the copied objects. These objects have lost the original timestamps.
ds_copy = VBaseDataset(vbc, SET_NAME, VBaseIntObject) # Load all objects into the dataset. ds_copy = init_vbase_dataset_from_s3_objects( ds_copy, boto_client, BUCKET_NAME, COPY_FOLDER_NAME )
Attempt to verify the copied objects. Since these have lost the timestamps and the original provenance information, the checks will fail:
success, l_log = ds_copy.verify_commitments() assert not success print("Verification log:") for log in l_log: print(log)
Fix the record timestamps using the vBase commitment information and verify the corrected provenance data:
# Fix the timestamps. assert ds_copy.try_restore_timestamps_from_index()[0] print("Dataset fixed:") pprint.pprint(ds_copy.to_pd_object()) # Verify the records again. assert ds_copy.verify_commitments()[0]
Last updated