> For the complete documentation index, see [llms.txt](https://docs.vbase.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.vbase.com/python-sdk/index/restore_dataset_provenance.md).

# Restoring Dataset Provenance

This sample illustrates how a dataset producer can create a dataset and how the provenance of the dataset can be restored after loss during copying or other transformations.

You can find the implementation in [`restore_dataset_provenance.py`](https://github.com/validityBase/vbase-py-samples/blob/main/samples/restore_dataset_provenance.py).

## Summary <a href="#summary" id="summary"></a>

When digital objects are copied across physical media, across cloud environments, and within a cloud environment, they typically lose provenance information such as last modified time. This can lead to severe problems in cases where the timestamps are critical provenance metadata.

For example:

* When security logs are copied, modifications and tampering may not be detectable.
* When financial datasets are copied, their revision timestamps are lost.
* Point-in-time and bitemporal data lose the revision timestamps.

This sample illustrates how such timestamps can be restored for vBase datasets after the underlying digital objects are copied to a new AWS S3 bucket.

## Detailed Description <a href="#detailed-description" id="detailed-description"></a>

* A `.env` file defines the following environment variables for accessing AWS S3 service and vBase services:

  ```shell
  # Forwarder config
  # vBase test forwarder URL
  VBASE_FORWARDER_URL="https://test.api.vbase.com/forwarder/"
  # vBaseTest API key
  VBASE_API_KEY="YOUR_VBASE_API_KEY"


  # Private key for making commitments
  VBASE_COMMITMENT_SERVICE_PRIVATE_KEY="YOUR_VBASE_COMMITMENT_SERVICE_PRIVATE_KEY"

  # AWS Configuration
  AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
  AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
  ```
* Create a vBase client object using connection parameters specified in environment variables:

  ```python
  vbc = VBaseClient.create_instance_from_env()
  ```
* Create an AWS S3 client object using connection parameters specified in environment variables:
* Create the test dataset. If this is a new dataset, this operation records that the user with the above VBASE\_COMMITMENT\_SERVICE\_PRIVATE\_KEY has created the named dataset. Dataset creation is idempotent. If this is an existing dataset, the call is ignored. Such commitments are used to validate that a given collection of user datasets is complete and mitigates Sybil attacks (<https://en.wikipedia.org/wiki/Sybil\\_attack>).

  ```python
  ds = VBaseDataset(vbc, SET_NAME, VBaseIntObject)
  ```
* Add a record to the dataset. Records an object commitment for a set record. A set record commitment establishes that a dataset record with a given CID has existed at a given time for a given set.

  ```python
  vbase_receipt = ds.add_record(i)
  ```
* Verify that a given set commitment exists for a given user. This will typically be called by the data consumer to verify a producer’s claims about dataset provenance.

  ```python
  assert ds.verify_commitments()[0]
  ```
* Copy a folder to another folder. This could also be a copy to a different bucket, or a different storage.

  ```python
  copy_s3_bucket(
      boto_client=boto_client,
      source_bucket_name=BUCKET_NAME,
      source_folder_name=FOLDER_NAME,
      destination_bucket_name=BUCKET_NAME,
      destination_folder_name=COPY_FOLDER_NAME,
  )
  ```
* Attempt to verify the copied objects. Since these have lost the timestamps and the original provenance information, the checks will fail:

  ```python
  success, l_log = ds_copy.verify_commitments()
  assert not success
  print("Verification log:")
  for log in l_log:
      print(log)
  ```
* Fix the record timestamps using the vBase commitment information and verify the corrected provenance data:

  ```python
  # Fix the timestamps.
  assert ds_copy.try_restore_timestamps_from_index()[0]
  ```
* Copy a folder to another folder. This could also be a copy to a different bucket, or a different storage.

  ```python
  copy_s3_bucket(
      boto_client=boto_client,
      source_bucket_name=BUCKET_NAME,
      source_folder_name=FOLDER_NAME,
      destination_bucket_name=BUCKET_NAME,
      destination_folder_name=COPY_FOLDER_NAME,
  )
  ```
* Create a vBase dataset using the copied objects. These objects have lost the original timestamps.

  ```python
  ds_copy = VBaseDataset(vbc, SET_NAME, VBaseIntObject)
  # Load all objects into the dataset.
  ds_copy = init_vbase_dataset_from_s3_objects(
      ds_copy, boto_client, BUCKET_NAME, COPY_FOLDER_NAME
  )
  ```
* Attempt to verify the copied objects. Since these have lost the timestamps and the original provenance information, the checks will fail:

  ```python
  success, l_log = ds_copy.verify_commitments()
  assert not success
  print("Verification log:")
  for log in l_log:
      print(log)
  ```
* Fix the record timestamps using the vBase commitment information and verify the corrected provenance data:

  ```python
  # Fix the timestamps.
  assert ds_copy.try_restore_timestamps_from_index()[0]

  print("Dataset fixed:")
  pprint.pprint(ds_copy.get_pd_data_frame())

  # Verify the records again.
  assert ds_copy.verify_commitments()[0]
  ```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.vbase.com/python-sdk/index/restore_dataset_provenance.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
