Data
Recommended Practices of Data Usage and Storage
It is best practice to create Data Assets containing data files. Data Assets facilitate sharing data across the organization and internal Data Assets guarantee reproducibility. See the Data Asset Guide for more information. For small Data Assets, it is possible to upload data files and subfolders to the /data
folder.
Below are properties of data depending on the location and type of Data Asset.
Folder | Type of Dataset | Shareability | Recommended Usage |
---|---|---|---|
Data | local (directly upload to Capsule) | Only current Capsule | Small or example dataset to test the capsule |
Data | Internal Dataset | Across Capsule | The Data Asset will be saved in VPC's AWS storage. Works well with immutable data that only need to import to Code Ocean's VPC once. |
Data | External Dataset | Across Capsule | The Data Asset will need an AWS credential to access. Works well with a confidential Data Asset. Data Asset can be changed if the source changed |
Scratch (CW) | local (created in the Capsule) | Only current Capsule | Access this only in the Cloud Workstation for storing the intermediate large Data Asset/output file. Usually will be converted into an internal Data Asset for sharing across the capsule and for downstream analysis |
Scratch (RR) | local (created in the Capsule) | Only current run | Temporary storage during Reproducible Run for large data that might exceed the Capsule's size limit |
For reproducibility purposes, any files written to the /data
folder during a Reproducible Run are deleted once it’s completed.