BigQuery 101: All the Basics You Need to Know

Vipul Vaibhaw

Data Engineering

Tags:

machine learning

serverless

google cloud platform

google bigquery

Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.

We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!

BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.

Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.

Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.

BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.

Architecture of Google BigQuery

You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it - you don’t need to worry about architecture and operation.

However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

Quoting an Abstract from the Google Dremel Paper -

“Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”

You can read the paper here - Dremel: Interactive Analysis of Web-Scale Datasets.

Dremel was in production at Google since 2006. Google used it for the following tasks -

Analysis of crawled web documents.
Tracking install data for applications on Android Market.
Crash reporting for Google products.
OCR results from Google Books.
Spam analysis.
Debugging of map tiles on Google Maps.
Tablet migrations in managed Bigtable instances.
Results of tests run on Google’s distributed build system.
Disk I/O statistics for hundreds of thousands of disks.
Resource monitoring for jobs run in Google’s data centers.
Symbols and dependencies in Google’s codebase.

BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.

How BigQuery Stores Data?

BigQuery stores data in a columnar format - Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.

Columnar storage has the following advantages:

Traffic minimization - When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
Higher compression ratio - Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.

(Image source: Google Dremel Paper)

Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.

How the Query Gets Executed?

BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.

Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.

(Image source: Google Dremel Paper)

BigQuery vs. MapReduce

The key differences between BigQuery and MapReduce are -

Dremel is designed as an interactive data analysis tool for large datasets
MapReduce is designed as a programming framework to batch process large datasets

Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.

Following is a comparison on running MapReduce on a row and columnar DB:

(Image source: Google Dremel Paper)

Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

Comparing BigQuery and Redshift

In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.

In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

There are more similarities in both the data warehouses than the differences.

A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

Getting Started With Google BigQuery

Following is a quick example to show how you can quickly get started with BigQuery:

There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.

2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.

	WITH question_answers_join AS (
	SELECT *
	, GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
	FROM (
	SELECT id, creation_date, title
	, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
	FROM `bigquery-public-data.stackoverflow.posts_answers`
	WHERE a.id=parent_id
	) answers
	, SPLIT(tags, '\|') tags
	FROM `bigquery-public-data.stackoverflow.posts_questions` a
	WHERE EXTRACT(year FROM creation_date) > 2014
	)
	)
	SELECT COUNT(*) questions, tag
	, ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
	, APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
	FROM question_answers_join, UNNEST(tags) tag
	WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
	AND answers.c > 0
	GROUP BY tag
	ORDER BY mean_geo_minutes

view raw gistfile1.txt hosted with ❤ by GitHub

3. Now you can execute the query and get results -

You can see that C has the best community followed by JavaScript!

How to do Machine Learning on BigQuery?

Now that you have a sound understanding of BigQuery. It’s time for some real action.

As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note - Right now BigQuery integration is only available to Google Analytics 360.

Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.

	CREATE MODEL `velotio_tutorial.sample_model`
	OPTIONS(model_type='logistic_reg') AS
	SELECT
	IF(totals.transactions IS NULL, 0, 1) AS label,
	IFNULL(device.operatingSystem, "") AS os,
	device.isMobile AS is_mobile,
	IFNULL(geoNetwork.country, "") AS country,
	IFNULL(totals.pageviews, 0) AS pageviews
	FROM
	`bigquery-public-data.google_analytics_sample.ga_sessions_*`
	WHERE
	_TABLE_SUFFIX BETWEEN '20190401' AND '20180630'

view raw gistfile1.txt hosted with ❤ by GitHub

Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar - a transaction will be made or not.

The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.

Now you just press run query to execute the query.

Conclusion

BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right!

Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

If you need help with using machine learning in product development for your organization, connect with experts at Velotio!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

BigQuery 101: All the Basics You Need to Know

Architecture of Google BigQuery

You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it - you don’t need to worry about architecture and operation.

However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

Quoting an Abstract from the Google Dremel Paper -

You can read the paper here - Dremel: Interactive Analysis of Web-Scale Datasets.

Dremel was in production at Google since 2006. Google used it for the following tasks -

Analysis of crawled web documents.
Tracking install data for applications on Android Market.
Crash reporting for Google products.
OCR results from Google Books.
Spam analysis.
Debugging of map tiles on Google Maps.
Tablet migrations in managed Bigtable instances.
Results of tests run on Google’s distributed build system.
Disk I/O statistics for hundreds of thousands of disks.
Resource monitoring for jobs run in Google’s data centers.
Symbols and dependencies in Google’s codebase.

How BigQuery Stores Data?

Columnar storage has the following advantages:

Traffic minimization - When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
Higher compression ratio - Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.

(Image source: Google Dremel Paper)

Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.

How the Query Gets Executed?

(Image source: Google Dremel Paper)

BigQuery vs. MapReduce

The key differences between BigQuery and MapReduce are -

Dremel is designed as an interactive data analysis tool for large datasets
MapReduce is designed as a programming framework to batch process large datasets

Following is a comparison on running MapReduce on a row and columnar DB:

(Image source: Google Dremel Paper)

Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

Comparing BigQuery and Redshift

In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

There are more similarities in both the data warehouses than the differences.

A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

Getting Started With Google BigQuery

Following is a quick example to show how you can quickly get started with BigQuery:

There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.

2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.

	WITH question_answers_join AS (
	SELECT *
	, GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
	FROM (
	SELECT id, creation_date, title
	, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
	FROM `bigquery-public-data.stackoverflow.posts_answers`
	WHERE a.id=parent_id
	) answers
	, SPLIT(tags, '\|') tags
	FROM `bigquery-public-data.stackoverflow.posts_questions` a
	WHERE EXTRACT(year FROM creation_date) > 2014
	)
	)
	SELECT COUNT(*) questions, tag
	, ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
	, APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
	FROM question_answers_join, UNNEST(tags) tag
	WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
	AND answers.c > 0
	GROUP BY tag
	ORDER BY mean_geo_minutes

view raw gistfile1.txt hosted with ❤ by GitHub

3. Now you can execute the query and get results -

You can see that C has the best community followed by JavaScript!

How to do Machine Learning on BigQuery?

Now that you have a sound understanding of BigQuery. It’s time for some real action.

	CREATE MODEL `velotio_tutorial.sample_model`
	OPTIONS(model_type='logistic_reg') AS
	SELECT
	IF(totals.transactions IS NULL, 0, 1) AS label,
	IFNULL(device.operatingSystem, "") AS os,
	device.isMobile AS is_mobile,
	IFNULL(geoNetwork.country, "") AS country,
	IFNULL(totals.pageviews, 0) AS pageviews
	FROM
	`bigquery-public-data.google_analytics_sample.ga_sessions_*`
	WHERE
	_TABLE_SUFFIX BETWEEN '20190401' AND '20180630'

view raw gistfile1.txt hosted with ❤ by GitHub

Now you just press run query to execute the query.

Conclusion

Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

If you need help with using machine learning in product development for your organization, connect with experts at Velotio!

machine learning

serverless

google cloud platform

google bigquery

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

BigQuery 101: All the Basics You Need to Know

Vipul Vaibhaw

Architecture of Google BigQuery

How BigQuery Stores Data?

How the Query Gets Executed?

BigQuery vs. MapReduce

Comparing BigQuery and Redshift

Getting Started With Google BigQuery

How to do Machine Learning on BigQuery?

Conclusion

MORE POSTS BY THIS AUTHOR

Vipul Vaibhaw

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

BigQuery 101: All the Basics You Need to Know

Architecture of Google BigQuery

How BigQuery Stores Data?

How the Query Gets Executed?

BigQuery vs. MapReduce

Comparing BigQuery and Redshift

Getting Started With Google BigQuery

How to do Machine Learning on BigQuery?

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting