Restoring Dataset Provenance
restore_dataset_provenance.py
This sample illustrates how a dataset producer can create a dataset and how the provenance of the dataset can be restored after loss during copying or other transformations.
Summary
When digital objects are copied across physical media, across cloud environments, and within a cloud environment, they typically lose provenance information such as last modified time. This can lead to severe problems in cases where the timestamps are critical provenance metadata. For example:
When security logs are copied, modifications and tampering may not be detectable.
When financial datasets are copied, their revision timestamps are lost.
Point-in-time and bitemporal data lose the revision timestamps.
This sample illustrates how such timestamps can be restored for vBase datasets after the underlying digital objects are copied to a new AWS S3 bucket.
Detailed Description
A
.env
file defines the following environment variables for accessing AWS S3 service and vBase services:Create a vBase client object using connection parameters specified in environment variables:
Create an AWS S3 client object using connection parameters specified in environment variables:
Create the test dataset. If this is a new dataset, this operation records that the user with the above VBASE_COMMITMENT_SERVICE_PRIVATE_KEY has created the named dataset. Dataset creation is idempotent. If this is an existing dataset, the call is ignored. Such commitments are used to validate that a given collection of user datasets is complete and mitigates Sybil attacks (https://en.wikipedia.org/wiki/Sybil_attack).
Add a record to the dataset. Records an object commitment for a set record. A set record commitment establishes that a dataset record with a given CID has existed at a given time for a given set.
Verify that a given set commitment exists for a given user. This will typically be called by the data consumer to verify a producer's claims about dataset provenance.
Copy a folder to another folder. This could also be a copy to a different bucket, or a different storage.
Create a vBase dataset using the copied objects. These objects have lost the original timestamps.
Attempt to verify the copied objects. Since these have lost the timestamps and the original provenance information, the checks will fail:
Fix the record timestamps using the vBase commitment information and verify the corrected provenance data: