Working with cloud files in python is a necessary pain at any biotech organization. While packages like pandas transparently handle S3 URIs, I still found myself writing the same boto3 code to manage files far too often. Nextflow solves this for workflows, but there aren’t any packages that I’m aware of that make managing S3 files easier for script and notebook use cases.
I developed s3stasher to make working with files in AWS S3 as easy if they were local files. The key principles are:
- S3 objects should be referred to as full URIs at all times. It shouldn’t be necessary to split a URI into bucket and key strings.
- Any method that reads or writes a file should transparently work on S3 URIs or local files.
- S3 objects should be cached locally and only re-downloaded when the source object has changed.
- Reading S3 objects should work identically while offline, assuming the user has the file cached.
Using s3stasher, you simply have to wrap any file reading or writing in a with statement, and all the file operations will happen behind the scenes.
from s3stasher import S3
# Download, cache, and read an S3 object
with S3.s3open("s3://my-bucket/my_data.csv") as f:
my_df = pd.read_csv(f)
# Two layers of context manager are needed for traditional open operations
with S3.s3open("s3://my-bucket/unstructured.txt") as s3f:
with open(s3f) as f:
lines = f.readlines()
# Write a file back to s3. By default, it will be saved in the cache dir
# to avoid an unnecessary download in the future
with S3.s3write("s3://my-bucket/my_data_new.csv") as f:
my_df.to_csv(f)
# Other convenience functions are provided
## List objects under a prefix
uri_list = S3.s3list("s3://my-bucket/prefix/")
## Check for existance of an object
uri_exists = S3.s3exists("s3://my-bucket/unknown_file.txt")
## copy, move, remove an S3 object
S3.s3cp("s3://my-bucket/my_file_1.txt", "s3://my-bucket/my_file_2.txt")
S3.s3mv("s3://my-bucket/my_file_2.txt", "s3://my-bucket/my_file_3.txt")
S3.s3rm("s3://my-bucket/my_file_3.txt")
By default, s3stasher uses your already set up AWS credentials, and caches files to ~/.s3_cache. All of these options can be changed with a config file or environment variables.
You can install s3stasher with a quick pip install s3stasher.
Pypi: https://pypi.org/project/s3stasher/
GitHub: https://github.com/bsiranosian/s3stasher
Feedback and PRs welcome!