Traditional AWS Data Lake Solutions

Fred Gu
4 min readNov 1, 2021

--

Targeted Audience : Solution Architect & Senior AWS Data Engineer

This post reviews traditional AWS data lake solutions in a high level. It is actually a preface to another post AWS Data Lake Solution based on Apache Hudi without requiring Database CDC.

Athena + CTAS Solution

Many of the data teams start their data lake journey from loading flat files to S3 bucket, and use Glue Crawler to populate tables in Data Catalog, and later build their ‘complicated’ logic in Athena views and of course sometimes Create a Table from Query Result (CTAS) ([1] Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena) to persist the calculated results into standalone tables in order to speed up the reporting queries, and eventually, build some Quicksight dashboards powered by Athena engine to present insights and value to end users. This solution is only recommended for ad-hoc reporting.

DMS + Glue Solution (recommended)

Some team will consider ingesting data from the rational source databases continuously, and usually AWS DMS is recommended as the first option. The beauty of DMS is that a single replication task can perform both full load and incremental load from whatever database to S3 bucket in a surprisingly-efficient and least-disruptive manner. I personally don’t think it is easy to achieve the similar efficiency and robustness by any engineers’ self coding.

It is always recommended that DMS replication jobs write raw data in parquet format into S3 bucket, since parquet format is more friendly to Glue spark engine. Hence, we can expect DMS constantly writes new parquet files to S3 bucket whenever there are any changes at source database side.

When raw data is extracted and written in S3 bucket by DMS, Glue jobs can be either triggered or scheduled to perform data cleaning and data transformation. Eventually, the reporting data will be written to curated S3 bucket, which is consumed by both the reporting solutions and the machine learning/AI solutions.

Please see the following architecture diagram as reference.

Traditional Data Lake Solutions

There are 3 prerequisites for this DMS + Glue solution:

  • DMS can access source databases. Some necessary setup in VPC, VPC peering, route table and database security group is required. Get AWS admin’s help if required.
  • Change Data Capture (CDC) need to be enabled at source database side, required by DMS incremental load. Unfortunately CDC is not switched on by default most of the time, so get database admin’s involvement to switch it on, especially for database servers in production.
  • Both DMS and Glue require certain roles and permissions. Get AWS admin’s help if required.

A Quick Start

There is one good AWS blog talking about the above solution in details, maybe the only one. See [2] Load ongoing data lake changes with AWS DMS and AWS Glue. That blog provides a generic approach of utilising a DynamoDB controller table to control what to load and how. It provides two CF stacks which help quickly launch the solution. You are highly recommended following that post to quickly start.

Architecture Image from [2]

Other Solutions

There are other AWS blogs talking about the similar data lake pipeline architecture, like[3] AWS serverless data analytics pipeline reference architecture. That blog is a great summary of all AWS data lake solutions which fit most of the data scenarios, including data streaming. It could be a good start point of an AWS data lake project.

Architecture Image from [3]

Appendix

[1] Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

[2] Load ongoing data lake changes with AWS DMS and AWS Glue

[3] AWS serverless data analytics pipeline reference architecture

--

--

Fred Gu

Solution Architect, Data Scientist, Full-Stack Developer, Mobile App Maker, Consultant, Project Manager, Product Owner, A Thinker, Doer and Top-performer