by Andrés Campos, Senior Data Engineer at Growth Acceleration Partners.
In the realm of data operations, efficiency and agility are essential. Amazon Web Services (AWS) offers a suite of cloud computing services that seamlessly integrate with each other to streamline data workflows.
In this article, we explore a practical data workflow for web applications leveraging AWS S3, Lambda, Glue, RDS, and QuickSight.
As a case study, we’ll apply these services through the implementation of a simple, cloud-powered personal finance management application. Our pipeline begins with AWS S3 serving as the central object storage for financial data files. Upon uploading data to S3, AWS Lambda functions are triggered, initiating automated Glue jobs to transform and feed our RDS MySQL database. This structured data then becomes the source for analysis using QuickSight, empowering users with almost real-time insights into their financial habits and trends.
This orchestrated workflow not only facilitates processing the data, but also enables comprehensive analysis, guiding users toward informed financial decision-making. Join us as we explore how these AWS services come together to enable the implementation of data pipelines, offering efficiency, scalability, and actionable insights for our data.
Architecture Diagram
The following is an architectural diagram describing the overall flow of our web application:
We’ll start by examining this diagram and then delve into why we’ve chosen each of those specific AWS technologies.
An initial aspect to note is that currently, files are manually uploaded into the bucket in CSV format. However, if this were an application or website, the data format might vary, and the ingestion method could be integrated into a frontend or an API solution, easing the process of getting the data into our data pipeline.
Whenever a new file is added to this bucket, an event is triggered. This event, in turn, activates a Lambda Function.
The Lambda Function initiates an AWS Glue job, which is responsible for transferring the data from the file in the S3 bucket to the RDS database.
Finally, QuickSight directly connects to the RDS to access the appropriate dataset for creating the requested chart.
Choosing AWS Technologies
S3: Secure and Scalable Storage
The first step is to migrate our spreadsheet files to a more secure and scalable environment. AWS S3 stands out as the ideal choice for storing our files.
The files are being stored in a single bucket within our AWS account, and are provided in a single, consistent format. These are stored in a single folder called raw, inside of which we have created subfolders for each specific type of data, such as Incomes and Expenses.
The flexibility and reliability of S3 ensure our data is available when needed, while its scalability could handle any volume of information seamlessly, which could come into play where this is a cloud service intended for millions of users to access. Since S3 is an object storage service, it is ideal for storing almost anything, including photos, videos, and files in any format.
Additionally, it allows us to retrieve whatever we have stored whenever we want, either through a graphical interface or through code.
Speaking a bit about the technical aspect, we chose Amazon S3 for several reasons.
- Flexibility: S3 is highly flexible and can handle a variety of file types, including the CSV files we use for our web application. This allows us to store our data in different file formats.
- Reliability: S3 is known for its high durability and availability. Data stored in S3 is designed to be durable, with built-in redundancy to protect against data loss. This reliability is crucial to ensure the integrity of our data.
- Scalability: One of the main attractions of S3 is its ability to easily scale to handle any volume of data. This is especially important to us as we want a solution that could grow with our financial needs seamlessly.
- Ease of Use: S3 is easy to use and offers an intuitive interface for uploading, downloading, and managing files. This makes migrating our spreadsheet files to S3 quick and uncomplicated.
- Integration with other AWS Services: By using other AWS services in our solution, such as Lambda and AWS Glue, integration with S3 is natural and seamless. This simplifies the process of automation and data processing.
RDS: Centralizing Information in a MySQL Database
Once the files are in S3, we need a centralized place to store and manage the information more dynamically.
This is where Amazon RDS comes into play. We use RDS to set up a MySQL database, allowing us to easily send and manipulate our data more efficiently.
We use views, which are made of a combination of two or more tables of our database. In some cases with some extra logic to create certain datasets, that would be useful for reporting later on.
There are several reasons why we choose to use Amazon RDS to store and manage the information:
- Ease of Configuration: RDS offers a simple and quick way to set up a relational database, allowing us to launch our solution efficiently and without unnecessary complications.
- Scalability: RDS offers scalability options that allow us to adjust the database resources according to our needs. This means we could start with a modest configuration and scale vertically or horizontally if needed in the future.
- Automated Management: Amazon RDS handles administrative tasks such as software patching, backups, and management of the underlying infrastructure. This frees up time and resources we could dedicate to other areas of the project.
- High Availability and Durability: RDS ensures high availability and durability of data by synchronously replicating data across multiple availability zones within an AWS region. This gives us the confidence that our data will be secure and accessible at all times.
AWS Glue and Lambda: Automating the Process
To ensure our data is transferred smoothly and without manual intervention, we leveraged AWS Glue and AWS Lambda.
A Lambda function is a one-time-use piece of code that can respond to events in other AWS services. When a Lambda function is created, it is necessary to specify the programming language and add the specific code we want it to run.
Additionally, to attach the event, we want to trigger it — in this case, an S3 event is configured — so it triggers a Lambda function each time a new CSV file is uploaded. This function, in turn, triggers an AWS Glue job to process and automatically loads the data into our RDS database.
To prepare our data, we use AWS Glue, which is an ETL service. AWS Glue is designed to extract, transform, and load data from different sources, and send it to different destinations.
Glue also provides a centralized metadata repository called the Data Catalog. This stores metadata information about various data sources, including databases, tables, and schemas. It helps to discover, organize, and manage data assets.
There’s an automatic way to populate a data catalog; AWS offers a service called crawler in AWS Glue, which is a feature that automatically discovers and catalogs metadata from various data sources. Its primary purpose is to analyze the data structure, infer schema information, and populate the AWS Glue Data Catalog with metadata about the data.
Jobs
A job is the component in AWS Glue that allows you to define and execute data transformation tasks.
There are three different kinds of jobs as of today’s date:
- Visual ETL: Allows users to visually design and build data transformation workflows without writing code. This feature is designed to simplify the process of creating ETL jobs by providing a visual representation of the data flow and transformations.
- Notebook: It is an interactive development environment that allows users to write, run, and test Apache Spark code or SQL code using different cells of the same notebook. These notebooks are integrated into the AWS Glue console and provide a collaborative platform for data exploration, data analysis, and ETL job development.
- Script: It is similar to the notebook with the difference that only python code can run and it is a single script.
Automate an AWS Job
There are two different ways to automate a job:
- Schedule: You can schedule job runs to occur at specific intervals (e.g., daily, hourly, weekly) based on a cron expression or a predefined schedule.
- Event: You can trigger job runs based on events, such as file arrival in an Amazon S3 bucket, AWS CloudWatch events, AWS Lambda invocations, or Amazon EventBridge events.
This automation eliminates the need for human intervention, saving us considerable time and effort.
QuickSight: Visualizing Data Intuitively
With the data now stored and updated automatically, the next step is to visualize it clearly and comprehensively. We opted for Amazon QuickSight for its ease of use and powerful visualization capabilities. With QuickSight, we are able to create a custom report that allows us to analyze and better understand our financial habits.
We chose Amazon QuickSight to visualize our data for the following reasons:
- Ease of Use: It offers an intuitive and easy-to-use interface that allows us to create visualizations quickly without the need for previous experience in data analysis or dashboard design.
- Powerful Visualization Capabilities: QuickSight provides a wide range of visualization options, including interactive charts, pivot tables, and geospatial maps.
- Integration with AWS: By using other AWS services in our solution, integration with QuickSight is straightforward. We are able to connect directly to our RDS and S3 to access data and create real-time visualizations.
Lessons Learned
Automation is key
One of the main learnings from this project is the importance of automation in data management. By leveraging the capabilities of services that enable scheduling and custom triggers — such as Lambda and Glue — we are able to eliminate repetitive manual tasks, and ensure the consistency and accuracy of our data.
Scalability is important
Another crucial aspect is scalability. By opting for services like S3 and RDS, we could be confident that our solution could grow and adapt as data needs evolve.
Visualization improves understanding
Lastly, data visualization proves to be an invaluable tool for understanding the behavior of our data and making informed decisions. BI tools like QuickSight allow us to appreciate patterns and trends that might otherwise have gone unnoticed.
Next Steps
Before concluding, here’s a list of potential next steps to take this project further.
Implementing CI/CD for Automated Deployment:
Continuous Integration (CI) and Continuous Deployment (CD) are crucial practices in modern software development that can further enhance the efficiency and reliability of our data workflow. Here are the next steps to consider for implementing CI/CD:
- Setup CI Pipeline: Integrate a code repository with a CI service like AWS CodePipeline.
- Continuous Deployment: Utilize CD tools such as AWS CodeDeploy to automate the deployment process.
- Define deployment stages like development, staging, and production with automated approvals and rollback mechanisms.
Infrastructure as Code (IaC):
Leverage tools like AWS CloudFormation or Terraform to manage the infrastructure as code, ensuring consistency and reproducibility.
Define infrastructure templates for S3 buckets, RDS databases, and Lambda functions.
Monitoring and Logging:
Implement monitoring and logging solutions such as AWS CloudWatch to track the performance and health of our data workflows.
Set up alerts for key metrics such as data processing times, database latency, and application errors.
Version Control:
Maintain version control for our infrastructure code and data transformation scripts.
Conclusion
In conclusion, the adoption of AWS technologies completely transforms the way we can manage data for web applications. From the initial migration of the CSV files to the automation of the process and data visualization, each step leads us toward a more efficient and scalable solution. This experience therefore teaches us valuable lessons about the power and versatility of the cloud.