Types of Data Assets

Data Assets

An internal Data Asset is a copy of the dataset on Code Ocean in the virtual private cloud deployment. This is achieved by uploading data from a local machine or importing data from a cloud provider, for example, AWS or Google Cloud. An immutable copy of the data will be saved on your deployment. Authorized users can download these from the Data Assets page and access or attach them in a Capsule or Pipeline. These assets are saved on your deployment's S3 and are cached to your deployment's EFS for quick access when they are actively being used.

Datasets can be added as a link to the remote bucket on AWS S3. To establish the link, AWS credentials must be provided during setup (see Secret Management Guide for details). The data will remain in its original location, and will only be linked to your Code Ocean deployment. Only users with authorization to access the original source will have access via Code Ocean and they will need to provide the appropriate credentials for using the External Data Asset in a Capsule/Pipeline. Since the data is not saved in Code Ocean, it cannot be directly downloaded.

Workflows that use External Data Assets cannot be guaranteed to be reproducible.

Results

A Captured Result is a Data Asset created from the output of a Capsule or Pipeline computation. It records the origin of this result, including the Capsule code version, type of run, input Data Assets, and Lineage Graph. These assets are saved on S3 and are cached on EFS for quick access when they are actively being used.

The provenance is automatically recorded for Results Data Assets.

External Result Data Assets can be created from Capsule or Pipeline results which are located at a user-specified location in S3. External Result Data Assets automatically generate a Lineage Graph and provenance.

Combined Data Assets are Data Assets created from two or more External Data Assets that are already in your account. These Data Assets can be used in Pipelines, and allow you to parallelize at the level of Data Asset, instead of the items within a single Data Asset. Combined Data Assets store the metadata of the Data Assets that comprise it.

To use External Data Assets and Combined Data Assets in a Pipeline, Assumable Roles must be configured. These can be configured in your deployment by a Code Ocean admin.