Name*E-mail*Phone*

Details about your project

Thanks! We'll be in touch in the next 12 hours

Oops! Something went wrong while submitting the form.

Your Quintessential Guide to AWS Athena

Sagar Khangan

Data Engineering

Tags:

serverless

AWS

athena

Introduction

Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

It has some more benefits, such as:

Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

What is AWS Athena?

AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

Some Exciting Features of Athena are:

Serverless. No ETL - Not having to set up and manage any servers or data warehouses.
Only pay for the data that is scanned.
You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
Can also handle complex analysis, including large joins, window functions, and arrays.
Athena automatically executes queries in parallel.
Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
Supports -
Support CSV, Json, Parquet, ORC, Avro data formats
Complex Joins and datatypes
View creation
Does not Support -
User-defined functions and stored procedures
Hive or Presto transactions
LZO (Snappy is supported)

Pricing of Athena

AWS Athena is priced $5 for each TB of data scanned.
Queries are rounded up to the nearest MB, with a 10 MB minimum.
Users pay for stored data at regular S3 rates.
Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

Athena vs. Redshift Spectrum

AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

Advantages of Redshift Spectrum:

Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
Athena slightly faster. SQL syntax and features.
Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
It's easy enough to connect to Athena using API, JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift.
Athena has GIS functions and lambdas.

So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

Example

Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:
‍
Basic_details:


	const outside = {weather: FRIGHTFUL}
	const inside = {fire: DELIGHTFUL}
	const go = places => places.some(p=>p>outside.weather)))

	const snow = () => (outside.weather < inside.fire && !go(places)) {
	let it = snow()
	}

	let it = snow()

	const FRIGHTFUL = 1
	const DELIGHTFUL = 1337

view raw letitsnow.js hosted with ❤ by GitHub

Bill_details:

	CREATE EXTERNAL TABLE `bil_details`(
	`id` int COMMENT '',
	`amount_paid` string COMMENT '',
	`amount_due` string COMMENT '')
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ','
	STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
	OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
	LOCATION
	's3://athena-blog/bill-details'
	TBLPROPERTIES (
	'has_encrypted_data'='false',
	'skip.header.line.count'='1')

view raw gistfile1.txt hosted with ❤ by GitHub

‍Contact_details:

	CREATE EXTERNAL TABLE `contact_details`(
	`id` int COMMENT '',
	`street` string COMMENT '',
	`city` string COMMENT '',
	`state` string COMMENT '',
	`country` string COMMENT '',
	`zip` string COMMENT '')
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ','
	STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
	OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
	LOCATION
	's3://athena-blog/contact-details'
	TBLPROPERTIES (
	'has_encrypted_data'='false',
	'skip.header.line.count'='1')

view raw gistfile1.txt hosted with ❤ by GitHub

Sample Query for - FirstNames of People from Minnesota with amount_due > $100

	WITH basic AS
	(SELECT id,
	first_name
	FROM basic_details
	WHERE lower(gender) = 'male' ), bill AS
	(SELECT id
	FROM bil_details
	WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS
	(SELECT contact_details.id
	FROM contact_details
	JOIN bill
	ON contact_details.id = bill.id
	WHERE state= 'Minnesota' )
	SELECT basic.first_name
	FROM basic
	JOIN contact
	ON basic.id = contact.id

view raw query hosted with ❤ by GitHub

Output:

Some Other Sample Queries:

1. Searching for Values in JSON

	WITH dataset AS (
	SELECT * FROM (VALUES
	(JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
	(JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
	(JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
	) AS t (users)
	)
	SELECT json_extract_scalar(users, '$.name') AS user
	FROM dataset
	WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

view raw Searching for Values in JSON hosted with ❤ by GitHub

Output:

2. Extracting properties

	WITH dataset AS (
	SELECT '{"name": "Susan Smith",
	"org": "engineering",
	"projects": [{"name":"project1", "completed":false},
	{"name":"project2", "completed":true}]}'
	AS blob
	)
	SELECT
	json_extract(blob, '$.name') AS name,
	json_extract(blob, '$.projects') AS projects
	FROM dataset

view raw extracting properties hosted with ❤ by GitHub

Output:

3. Converting JSON to Athena Data Types

	WITH dataset AS (
	SELECT
	CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
	CAST(JSON '12345' AS INTEGER) AS some_int,
	CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
	)
	SELECT * FROM dataset

view raw gistfile1.txt hosted with ❤ by GitHub

Output:

Conclusion

Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Your Quintessential Guide to AWS Athena

Introduction

It has some more benefits, such as:

Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

What is AWS Athena?

AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

Some Exciting Features of Athena are:

Serverless. No ETL - Not having to set up and manage any servers or data warehouses.
Only pay for the data that is scanned.
You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
Can also handle complex analysis, including large joins, window functions, and arrays.
Athena automatically executes queries in parallel.
Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
Supports -
Support CSV, Json, Parquet, ORC, Avro data formats
Complex Joins and datatypes
View creation
Does not Support -
User-defined functions and stored procedures
Hive or Presto transactions
LZO (Snappy is supported)

Pricing of Athena

AWS Athena is priced $5 for each TB of data scanned.
Queries are rounded up to the nearest MB, with a 10 MB minimum.
Users pay for stored data at regular S3 rates.
Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

Athena vs. Redshift Spectrum

AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

Advantages of Redshift Spectrum:

Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
Athena slightly faster. SQL syntax and features.
Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
It's easy enough to connect to Athena using API, JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift.
Athena has GIS functions and lambdas.

Example

Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:
‍
Basic_details:


	const outside = {weather: FRIGHTFUL}
	const inside = {fire: DELIGHTFUL}
	const go = places => places.some(p=>p>outside.weather)))

	const snow = () => (outside.weather < inside.fire && !go(places)) {
	let it = snow()
	}

	let it = snow()

	const FRIGHTFUL = 1
	const DELIGHTFUL = 1337

view raw letitsnow.js hosted with ❤ by GitHub

Bill_details:

	CREATE EXTERNAL TABLE `bil_details`(
	`id` int COMMENT '',
	`amount_paid` string COMMENT '',
	`amount_due` string COMMENT '')
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ','
	STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
	OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
	LOCATION
	's3://athena-blog/bill-details'
	TBLPROPERTIES (
	'has_encrypted_data'='false',
	'skip.header.line.count'='1')

view raw gistfile1.txt hosted with ❤ by GitHub

‍Contact_details:

	CREATE EXTERNAL TABLE `contact_details`(
	`id` int COMMENT '',
	`street` string COMMENT '',
	`city` string COMMENT '',
	`state` string COMMENT '',
	`country` string COMMENT '',
	`zip` string COMMENT '')
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ','
	STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
	OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
	LOCATION
	's3://athena-blog/contact-details'
	TBLPROPERTIES (
	'has_encrypted_data'='false',
	'skip.header.line.count'='1')

view raw gistfile1.txt hosted with ❤ by GitHub

Sample Query for - FirstNames of People from Minnesota with amount_due > $100

	WITH basic AS
	(SELECT id,
	first_name
	FROM basic_details
	WHERE lower(gender) = 'male' ), bill AS
	(SELECT id
	FROM bil_details
	WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS
	(SELECT contact_details.id
	FROM contact_details
	JOIN bill
	ON contact_details.id = bill.id
	WHERE state= 'Minnesota' )
	SELECT basic.first_name
	FROM basic
	JOIN contact
	ON basic.id = contact.id

view raw query hosted with ❤ by GitHub

Output:

Some Other Sample Queries:

1. Searching for Values in JSON

	WITH dataset AS (
	SELECT * FROM (VALUES
	(JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
	(JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
	(JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
	) AS t (users)
	)
	SELECT json_extract_scalar(users, '$.name') AS user
	FROM dataset
	WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

view raw Searching for Values in JSON hosted with ❤ by GitHub

Output:

2. Extracting properties

	WITH dataset AS (
	SELECT '{"name": "Susan Smith",
	"org": "engineering",
	"projects": [{"name":"project1", "completed":false},
	{"name":"project2", "completed":true}]}'
	AS blob
	)
	SELECT
	json_extract(blob, '$.name') AS name,
	json_extract(blob, '$.projects') AS projects
	FROM dataset

view raw extracting properties hosted with ❤ by GitHub

Output:

3. Converting JSON to Athena Data Types

	WITH dataset AS (
	SELECT
	CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
	CAST(JSON '12345' AS INTEGER) AS some_int,
	CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
	)
	SELECT * FROM dataset

view raw gistfile1.txt hosted with ❤ by GitHub

Output:

Conclusion

Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

serverless

AWS

athena

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Your Quintessential Guide to AWS Athena

Sagar Khangan

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

MORE POSTS BY THIS AUTHOR

Sagar Khangan

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Your Quintessential Guide to AWS Athena

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Services

By Company Stage

By Engagement Model

Expertise

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting

Subscribe to get the latest technology updates

Your Quintessential Guide to AWS Athena

Sagar Khangan

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

MORE POSTS BY THIS AUTHOR

Sagar Khangan

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Your Quintessential Guide to AWS Athena

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting