Dev

What is Amazon S3? Cloud Storage Basics and S3 Files Explained

Amazon S3 is the standard for cloud storage. This comprehensive guide covers basic concepts, storage classes, cost management, and the new S3 Mountpoint feature that allows mounting as a file system.

12 min read

What is Amazon S3? Cloud Storage Basics and S3 Files Explained
Photo from Unsplash

What is Amazon S3?

Amazon S3 (Amazon Simple Storage Service) is an object storage service provided by Amazon Web Services (AWS). Released in 2006, S3 is one of the earliest cloud storage solutions and remains an industry standard widely used today.

Its use cases are diverse, ranging from static file delivery for websites, backup storage, and data lakes for big data analysis to data persistence for applications. It is trusted by enterprises and developers worldwide, with millions of active users.

This article provides a comprehensive explanation of Amazon S3, from basic concepts and storage classes to pricing, and the new feature “Mountpoint for Amazon S3 (S3 Files)” that allows it to be used as a file system.

Basic Concepts of Amazon S3

What is Object Storage?

S3 adopts an architecture called “object storage.” Unlike traditional file storage (which manages data in a folder hierarchy) or block storage (which manages data in fixed-size blocks), it stores data as individual “objects.”

Each object is assigned data (body), metadata, and a unique identifier (key). This design enables extremely high scalability.

Bucket

A bucket is the top-level container for data in S3. All objects are stored within a bucket. Bucket names must be globally unique and cannot be changed once created.

Buckets are associated with a region (geographical region). For example, a bucket created in “ap-northeast-1 (Tokyo Region)” will have its data physically stored in a data center near Tokyo. Choosing the appropriate region is crucial based on data proximity and compliance requirements.

Object and Key

Objects are identified by a “key” within a bucket. Keys resemble file paths, but notably, S3 does not have a hierarchical folder structure.

For example, even with a key like “photos/2024/vacation.jpg”, S3 treats it internally as a single string “photos/2024/vacation.jpg”. The ”/” is merely a visual delimiter and does not represent an actual folder structure. For this reason, S3 is also described as “flat storage.”

Versioning and Lifecycle

Enabling S3’s versioning feature allows you to retain multiple versions of an object with the same key. This is extremely useful for protecting data from accidental overwrites or deletions.

By setting lifecycle rules, you can automatically change storage classes or delete objects based on their age or version. This is a key feature for cost optimization.

Key Features of S3

High Durability and Availability

S3 is designed to provide 99.999999999% (11 nines) of annual durability. This means that if you store 1 billion objects, you can expect to lose fewer than one object per year on average. Data is automatically replicated across multiple facilities (Availability Zones), providing strong resilience against hardware failures and disasters.

Scalability and Performance

S3 offers virtually unlimited storage capacity and automatically scales with changes in data volume. In terms of performance, it can handle thousands to tens of thousands of requests per second and supports efficient transfer of large files via multipart uploads.

Security Features

S3 has multi-layered security features. Access control can be managed using bucket policies, ACLs (Access Control Lists), and IAM policies. All communication is encrypted via HTTPS (TLS), and encryption-at-rest options include SSE-S3, SSE-KMS, and SSE-C.

Furthermore, the S3 Block Public Access feature prevents accidental public exposure due to misconfigurations. Enabling this feature across an organization can significantly reduce the risk of accidental bucket exposure.

S3 Replication

Using S3’s Cross-Region Replication (CRR) or Same-Region Replication (SRR), you can automatically replicate objects between buckets. This is used for disaster recovery (DR), low-latency delivery to global users, and meeting compliance requirements.

S3 Event Notifications

You can send event notifications for object creation or deletion in a bucket to Amazon Lambda, Amazon SQS, or Amazon SNS. This enables building serverless architectures that trigger automatic processing based on data uploads.

S3 Storage Classes: Optimizing Cost and Performance

S3 offers multiple storage classes based on access frequency and performance requirements. Selecting the appropriate class can significantly reduce costs.

S3 Standard

The most common storage class. Designed for frequently accessed data, it provides high durability (11 nines) and availability (99.99%). Data redundancy spans three Availability Zones.

S3 Intelligent-Tiering

Ideal for data with unknown access patterns. It continuously monitors access patterns and automatically moves data to a lower-cost tier if it hasn’t been accessed in 90 days. While there is a monthly monitoring fee, it automatically optimizes costs based on access frequency.

S3 Standard-IA (Infrequent Access)

For data accessed a few times per month. It maintains the same durability as S3 Standard while reducing storage costs. However, retrieval (GET) costs are higher than S3 Standard, so accurate estimation of access frequency is necessary.

S3 One Zone-IA

A further cost-reduced class by limiting durability to a single Availability Zone instead of three. Suitable for recreatable data or data where a copy is retained in another region.

S3 Glacier Instant Retrieval

For archival data, allowing millisecond retrieval. It is even more cost-effective than S3 One Zone-IA, but requires a minimum 90-day storage period and consideration of higher retrieval costs.

S3 Glacier Flexible Retrieval

An archival class with retrieval times ranging from minutes to hours. Costs are lower, but immediate data access is not possible. Using batch restore requests allows for free restoration of large amounts of data.

S3 Glacier Deep Archive

The lowest-cost storage class, where data restoration may take 12 hours or more. Suitable for data required by regulations to be stored long-term or as a final backup for disaster recovery.

S3 Files: New Feature to Mount S3 as a File System

What is Mountpoint for Amazon S3?

In 2024, AWS officially released “Mountpoint for Amazon S3.” This is an open-source file client that allows an S3 bucket to be mounted as a local file system. This enables existing applications and tools to access data on S3 via a file system-like interface without modification.

Traditionally, S3 access was primarily through REST APIs. Migrating legacy applications that read and write data via file paths to S3 required significant code changes. Mountpoint for S3 solves this problem.

Technical Mechanism

Mountpoint for S3 uses the FUSE (Filesystem in Userspace) interface to mount an S3 bucket as a local directory. Once mounted, you can access data on S3 using common commands like ls, cat, cp, and mv.

Internally, it leverages the S3 API to translate file operations into object operations. Read operations correspond to the S3 GET API, write operations to the PUT API, and directory listings use the S3 ListObjectsV2 API.

Supported Operating Systems and Architectures

Mountpoint for S3 supports major Linux distributions such as Amazon Linux 2, Amazon Linux 2023, Ubuntu, and Red Hat Enterprise Linux. It runs on both x86_64 and ARM64 (Graviton) architectures.

It does not directly support macOS or Windows. For access from these environments, methods such as routing through an EC2 instance or directly using the S3 REST API are considered.

Practical Usage

Mounting is very simple. You can mount a bucket with the following command.

Simply enter mount-s3 bucket-name mountpoint in the terminal, and the specified S3 bucket will be mounted to that directory. After mounting, you can operate it just like a normal file system.

For example, if you specify /mnt/my-bucket as the mount point and use a data analysis tool to access the path /mnt/my-bucket/data/analysis-results.csv, you can directly read the CSV file on S3.

Suitable Use Cases

Mountpoint for S3 is particularly effective in the following use cases:

It is useful for connecting existing analysis tools or scripts directly to S3 data in data analysis workloads. It minimizes code changes when reading files from S3 in tools like Python’s Pandas or R’s data frame processing.

It can also be used for accessing machine learning training data. If training scripts expect local file paths, Mountpoint for S3 allows access to datasets on S3 without rewriting code.

In big data processing, it allows jobs from Apache Spark or Hadoop to access S3 data via the file system, maintaining compatibility with existing configuration files and scripts.

Limitations and Considerations

There are several limitations to understand with Mountpoint for S3.

First, since S3 is object storage, it cannot fully emulate all file system operations. For example, partial file writes (in-place updates) are not possible. To update a file, the entire file must be re-uploaded.

Directory renaming and symbolic link creation are also not supported. Additionally, exclusive file locking (flock) is not supported, so caution is needed for concurrent writes from multiple processes.

It primarily mounts as read-only. Write operations are supported but limited to creating new files and overwriting existing files. Deleting existing files or renaming them is not possible.

Differences from Existing Solutions

AWS has long had the “File Gateway” feature of “AWS Storage Gateway.” File Gateway is a gateway that allows file-based access to S3 from on-premises environments. In contrast, Mountpoint for S3 is a lightweight client that runs directly on an EC2 instance, eliminating the need for a gateway server.

Furthermore, S3 does not natively support NFS or SMB protocols, but File Gateway provides access via NFS/SMB. It is important to choose the appropriate solution based on the use case.

S3 Pricing Structure

S3 pricing consists of the following components:

Storage fees are charged based on the amount of data stored in the bucket and the selected storage class. Fees are calculated on a monthly basis, and the per-unit cost decreases as data volume increases (usage-based pricing).

Request fees are charged based on the number of API requests made to S3. Pricing differs between PUT, COPY, POST, LIST requests and GET, SELECT requests.

Data transfer fees are incurred when data is transferred from S3 to the internet or other AWS regions. Transfers within the same region to other AWS services are typically free. Internet transfers are free for the first 100GB per month (as of 2024), with charges per GB thereafter.

Data query fees are incurred when using S3 Select or Amazon Athena to query data within S3.

S3 Security Best Practices

Proper Bucket Policy Configuration

Bucket policies define access rules for a bucket in JSON format. It is crucial to create policies that grant only the minimum necessary access, following the principle of least privilege.

For example, you can implement fine-grained controls such as allowing access only from specific IP addresses or granting read-only permissions to specific IAM users.

Enabling S3 Block Public Access

Enabling S3 Block Public Access at the organization level prevents accidental public exposure due to misconfigured bucket policies. This can be enforced across all buckets in the AWS Management Console account settings.

Enabling Access Logging

Enabling S3 server access logging records all requests made to a bucket. This is useful for security audits and detecting unauthorized access, and logs can be saved to another bucket for analysis.

Breach Detection Mechanism

Amazon Macie is a service that automatically detects sensitive data (such as personal information or credit card numbers) within S3 buckets. It reduces the risk of sensitive data exposure due to incorrect public settings.

S3 Use Cases

Hosting Websites and Applications

S3 is widely used for hosting static websites. By placing HTML, CSS, JavaScript, and image files in a bucket and combining it with S3 website hosting features or CloudFront (CDN), you can build highly available, low-cost websites.

Backup and Disaster Recovery

Many enterprises use S3 as a backup destination for critical data. Combining versioning and lifecycle rules enables automated backup policies. Cross-region replication also allows for handling regional failures.

Data Lake and Big Data Analysis

The use of S3 as a “data lake” is expanding rapidly. It allows for low-cost, large-scale storage of raw data in any format (structured or unstructured), which can be directly queried by analysis tools like Amazon Athena, Amazon Redshift Spectrum, and Apache Spark.

Archiving and Compliance

S3 is also used for archiving data that requires long-term storage. Using Glacier classes allows for long-term storage of large amounts of data at a cost of tens of yen per month. It also supports storing regulated data such as medical records and financial transaction records.

Machine Learning and AI

S3 serves as a core data store in machine learning pipelines. It is utilized at every stage of ML workloads, from storing training data and managing model artifacts to storing inference results. The introduction of Mountpoint for S3 further eases integration with existing ML workflows.

Getting Started with S3

Once you create an AWS account, you can immediately start using S3. You can create buckets and upload data using the AWS Management Console, AWS CLI, or AWS SDK.

The first month includes a free tier (AWS Free Tier) for up to 5GB of storage, allowing you to start small-scale trials or proof-of-concept (PoC) projects at no cost.

Summary

Amazon S3 has earned trust as a cloud storage foundation for over 20 years. Its high durability, scalability, flexible storage classes, and robust security features make it an attractive option for organizations of all scales.

The newly introduced Mountpoint for S3 further expands the scope of S3 usage and facilitates integration with existing applications and tools. This feature, which allows enjoying the benefits of cloud storage while maintaining a file system-like operational feel, is expected to see even wider adoption in the future.

We hope this article serves as a reference for enterprises considering cloud migration or developers looking to optimize their existing S3 usage.

Frequently Asked Questions

What are the differences between Amazon S3, Google Cloud Storage, and Azure Blob Storage?
Their core functionalities are similar, but their ecosystems differ. S3 integrates seamlessly with AWS services and is the most mature object storage. Google Cloud Storage excels in integration with GCP, while Azure Blob Storage is strong in integration with the Microsoft environment. The choice typically depends on the existing cloud environment or tools being used.
Is S3 Mountpoint paid?
Mountpoint for S3 itself is open-source and free. However, S3 API request fees and data transfer fees incurred from accessing the mounted bucket are billed as usual. If running on an EC2 instance, instance costs are also separate.
Is there a file size limit for files stored in S3?
Individual objects can be up to 5TB. Standard file uploads allow direct upload up to 5GB; for larger files, multipart upload is used. There is no limit on the number of files per bucket.
Is data stored in S3 really secure?
S3 provides 11 nines (99.999999999%) of annual durability, achieving industry-leading data protection. However, data loss due to accidental deletion or bucket policy misconfiguration is the user's responsibility. Further security can be achieved by properly utilizing features like versioning, Cross-Region Replication, and S3 Object Lock.

Comments

← Back to Home