Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Data Engineering

As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

Upsert Functionality

One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

Schema Evolution

Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

Time Travel

Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:

CODE: https://gist.github.com/velotiotech/77b4679d2a7141d21b3dd31c243954fa.js

Some configurations must pass while setting up Iceberg. 

Create Tables in Iceberg and Insert Data

CODE: https://gist.github.com/velotiotech/a11b6169caf70b506a5d4be2c421b11e.js

CODE: https://gist.github.com/velotiotech/6c89444c2c4b2c07c909e4cd3310c077.js

We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

Incoming Data 

CODE: https://gist.github.com/velotiotech/b3434b6eff713b576bbe5d027e15133d.js

We will merge this data into our existing Iceberg Table using Spark SQL.

CODE: https://gist.github.com/velotiotech/1714db016057bd41bac33df09131930b.js

Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:

  • Add – Add a new column to the iceberg table
  • Drop – If any columns get removed from the existing tables
  • Rename – Change the name of the columns from the existing table
  • Update – Change the data type or partition columns of the Iceberg table
  • Reorder – Change in the order of the Iceberg table

After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.

  1. If we add any columns, they won’t impact the existing columns.
  2. If we delete or drop any columns, they won’t impact other columns.
  3. Updating a column or field does not change values in any other column.

Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.

CODE: https://gist.github.com/velotiotech/3b9c200c8462cc6aa1665f906ef4b4b6.js

After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.

CODE: https://gist.github.com/velotiotech/1e5d5e4e9e33e467b91192207d1a4405.js

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

Query engine and integration supported:

Conclusion

After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.

Get the latest engineering blogs delivered straight to your inbox.
No spam. Only expert insights.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Iceberg: Features and Hands-on (Part 2)

As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

Upsert Functionality

One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

Schema Evolution

Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

Time Travel

Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:

CODE: https://gist.github.com/velotiotech/77b4679d2a7141d21b3dd31c243954fa.js

Some configurations must pass while setting up Iceberg. 

Create Tables in Iceberg and Insert Data

CODE: https://gist.github.com/velotiotech/a11b6169caf70b506a5d4be2c421b11e.js

CODE: https://gist.github.com/velotiotech/6c89444c2c4b2c07c909e4cd3310c077.js

We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

Incoming Data 

CODE: https://gist.github.com/velotiotech/b3434b6eff713b576bbe5d027e15133d.js

We will merge this data into our existing Iceberg Table using Spark SQL.

CODE: https://gist.github.com/velotiotech/1714db016057bd41bac33df09131930b.js

Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:

  • Add – Add a new column to the iceberg table
  • Drop – If any columns get removed from the existing tables
  • Rename – Change the name of the columns from the existing table
  • Update – Change the data type or partition columns of the Iceberg table
  • Reorder – Change in the order of the Iceberg table

After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.

  1. If we add any columns, they won’t impact the existing columns.
  2. If we delete or drop any columns, they won’t impact other columns.
  3. Updating a column or field does not change values in any other column.

Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.

CODE: https://gist.github.com/velotiotech/3b9c200c8462cc6aa1665f906ef4b4b6.js

After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.

CODE: https://gist.github.com/velotiotech/1e5d5e4e9e33e467b91192207d1a4405.js

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

Query engine and integration supported:

Conclusion

After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings