Apache Iceberg Community News: Presidential Edition

Pay gap opens cyber security trap, the US Presidential race begins, and Netflix Moments are the new TikTok.

Nov 01, 2024

It’s a tough paradox: the UK government is tasked with securing its national infrastructure from sophisticated cyber threats, yet the gap in pay between public and private sectors can make recruiting top talent a challenge.

While working for GCHQ or similar agencies offers a unique prestige and an undeniable appeal to those passionate about national security, the reality is that with the ongoing demand for cybersecurity talent in the private sector, skilled individuals are likely to be drawn by the combination of competitive pay and often less stressful corporate roles.

Cybersecurity has evolved to a point where private companies are competing head-on with governments in terms of both the resources they can offer for cybersecurity efforts, and the stakes of the digital assets they protect. This demand for cybersecurity expertise is intense, particularly for roles that require constant upskilling to keep pace with evolving threats, making market-competitive salaries and employee benefits a top priority for specialists.

The longer-term impact of this trend on national security is concerning: if the government can't bridge this pay gap, it could make the UK more vulnerable to cyber threats. A passion for King and country and the possibility of adding some 007-style finesse to their CV may attract some candidates, but a competitive offer is far more compelling.

Meanwhile, in a similar predicament across the pond, the US has just concluded its Service for America campaign, a two-month long drive to recruit candidates into the 500K cybersecurity roles that remain vacant across both public and private sectors.

Biden and Harris make strides to fill half a million cyber security roles

By appealing to a candidate's sense of patriotism and commitment to national defense, the campaign underscores that cybersecurity isn't just a job but a public service. This approach mirrors past efforts to recruit for government and military roles by calling on individuals' allegiance and devotion to their country.

By the time you’re reading this, the outcome of the US Presidential election might already be sealed. Last time around, Trump famously skipped Biden’s inauguration after a contentious transfer of power. If the tables turn and Trump wins this race, will Biden return the favor with a quiet “day off”? We’ll have to wait and see!

In the meantime, if you’re looking for a distraction from all things electoral, Netflix has just launched its new Moments feature—ideal for passing the hours in meme-worthy indulgence and curing your TikTok addiction.

Now iOS users (with Android soon to follow) can clip and share scenes from their favorite shows. Whether it’s replaying iconic moments with friends or curating a private stash of Netflix gems, this feature promises endless scrollable content.

Early industry buzz suggests Moments might just give Netflix the extra lift it’s been looking for in the streaming wars, while the rest of us ponder: what did we do with our time before the internet came along?

🗞️ Industry News

The latest news and updates from the data industry.

MotherDuck Unveils Beta pg_duckdb Extension, Bringing DuckDB Analytics Directly to PostgreSQL

MotherDuck, in partnership with Hydra and DuckDB Labs, has released the beta version of pg_duckdb, a PostgreSQL extension that integrates DuckDB’s powerful analytics engine into PostgreSQL. This extension enables users to execute analytical queries up to 1500x faster on certain workloads without altering their existing PostgreSQL setup, providing a practical 10x improvement for many other queries.

With pg_duckdb, organizations can now run complex queries, access data from lakes and lakehouses, and support columnar formats like Parquet and Apache Iceberg—scaling effortlessly using MotherDuck’s cloud resources. Jordan Tigani, MotherDuck’s CEO, explains that this extension brings “DuckDB’s analytical prowess directly to PostgreSQL users,” enhancing PostgreSQL's transactional capabilities with fast, high-powered analytics.

Dremio Introduces First Hybrid Data Catalog for Apache Iceberg, Delivering Superior Flexibility and Governance

Dremio, the unified lakehouse platform for self-service analytics and AI, now offers full flexibility with its Apache Iceberg Data Catalog, supporting on-premises, cloud, and hybrid deployments—making it the only lakehouse provider with this level of architectural choice.

In addition, Dremio has introduced integrations with both Snowflake’s managed Apache Polaris (incubating) service and Databricks’ Unity Catalog. This integration empowers customers to select the catalog that best meets their needs, while leveraging Dremio for seamless, governed analytics across all data. By unifying data infrastructure, businesses can optimize performance and reduce total cost of ownership by eliminating unnecessary infrastructure costs.

📑 Articles

Discover the hottest topics and discussions in the data engineering world.

Iceberg is an Implementation Detail

In this article, Amy Chen, Product Manager at dbt Labs, shares why the technical specifics of table formats like Apache Iceberg don't concern her as much as their utility in data workflows. While Apache Iceberg, Delta Lake, and other formats revolutionize data storage and access, Chen argues that for analytics professionals, the real priority is driving insights, not managing storage details.

With dbt's recent support for the Iceberg table format, the complexities of working with Iceberg are further abstracted, enabling users to focus on modeling data rather than the intricacies of table formats. This evolution will make it easier for dbt users to adopt Iceberg and other open table formats.

Simplifying Iceberg Table Partitioning Using Adaptive Clustering

Since the early days of Hive, data engineers have grappled with various strategies to lay out and partition data effectively in the data lake. Finding the right balance between fast queries and efficient, cost-effective data ingestion is a challenge that every data professional faces. But even when you get it right, it’s only the beginning—you’ll need to continuously adjust and refine your data layout to match ever-evolving query patterns.

While Apache Iceberg offers a simplified approach to data layout over time, it still requires making tough decisions early on—decisions that can lead to costly adjustments down the road. In this article, Upsolver’s Senior Solutions Architect, Jason Hall, explores the key challenges of partitioning data in Iceberg and shares how these issues were addressed at Upsolver using their Adaptive Clustering feature.

Govern an Open Lakehouse with Snowflake Open Catalog, a Managed Service for Apache Polaris

To improve security and streamline operations, many organizations need flexible, secure integration across their data tools on a single dataset. While open storage standards help, catalog standardization for consistent security controls remains a challenge. Snowflake has collaborated with the community to address these needs through Apache Polaris (incubating).

They have recently announced that Snowflake Open Catalog is now generally available, offering managed services for Apache Polaris to unify access controls across data lakes, enabling seamless collaboration and scalable governance for all data engines.

How to integrate Databricks with Snowflake-managed Iceberg Tables

There is a lot of discussion about table formats within the big data and analytics ecosystems. Instead of focusing on the "why" of Open vs. Closed, or Delta vs. Iceberg, on which there will be continued debate, this post focuses on the "how" of using Snowflake-managed Iceberg tables within Databricks.

Check out this guide from Paul Needleman, Principal Solution Architect at Snowflake, which includes code examples and an accompanying walk-though video.

🎓Learn Apache Iceberg

Courses to level up your data engineering skills and make you an Apache Iceberg pro.

Hands-On with Apache Iceberg

Apache Iceberg combines the robustness of SQL tables with the flexibility to seamlessly integrate with engines like Apache Spark, providing a powerful approach for big data management. This course, created by Alex Merced and Dremio, guides you through using Apache Spark and Dremio to gain practical skills in setting up, querying, and managing Iceberg tables:

You’ll build a strong foundation in Apache Iceberg, exploring its role in modern data lakehouse architectures.
Learn how to configure catalogs, create and update tables, perform maintenance, and much more.
By the end, you’ll also be able to query Iceberg metadata tables using both Spark and Dremio.

This course is aimed at intermediate level students, takes about an hour and a half to complete, and offers a certificate of completion from LinkedIn Learning.

Manage and Optimize Big Data with Apache Iceberg

As data continues to grow exponentially, Apache Iceberg stands out as a leading format for managing massive datasets. In this course, Deepak Goyal dives into Iceberg’s architecture, schema management, and table operations, providing you with essential skills to handle scalable data lakes with precision.

Learn how to maintain data integrity and enhance query performance through real-world examples and interactive demo sessions designed to elevate your data expertise. This course is aimed at intermediate level students, takes about an hour and a half to complete, and offers a certificate of completion from LinkedIn Learning.

🎙️Podcasts & Videos

Discover the hottest topics and discussions in the data engineering world.

What Types Of Projects Should Data Teams Work On?

Joe Reis recently wrote a piece titled, Playing Not To Lose, in which he referenced the fact that many data teams are simply going through the motions and doing "data stuff", as he put it.

Ben Rogojan, AKA the Seattle Data Guy invited Joe for a conversation to dig even deeper into what he meant and how he feels data teams can go on the offensive effectively to take on projects that actually are worth investing in.

Hero Talk: Freelancing as a Data Engineer With the Seattle Data Guy - Plumbers of Data Science #28

In this Hero Talk episode, Andreas Kretz invites Ben Rogojan, to be in the hero’s seat. Ben is a data engineer, YouTuber, and freelancer with a background at Facebook. He's become a go-to expert on freelancing for engineers, particularly in the data space.

In this podcast, Andreas dives into Ben's journey, from being a full-time engineer to making the switch to freelancing, how he built his own business, and the unique challenges freelancers face in this space. They also explore how to break into freelancing, the value of specializing in a specific skill, and practical tips on landing your first freelance clients.

Analytics Engineering Boot Camp Launch Q&A

Watch the replay of Zach Wilson’s livestream to answer audience questions on data engineering, career, and all-round advice for both newbies and seasoned professionals.

🗓️ Upcoming Events

Don’t miss out on these events taking place in the next few months, for the opportunity to expand your knowledge and network.

Apache Iceberg Bay Area Community Meetup

San Francisco | November 4, 2024 | 5-8 PM PST

Join this in-person event for networking and three awesome talks. Jack Ye, Senior Software Engineer at AWS Open Data Analytics, and Roni Burd, Director of Product Engineering at AWS, will open the evening with their presentation on how to accelerate your Iceberg workloads on S3.

Then, Yingjun Wum, founder of RisingWave Labs, will discuss how they implemented the Iceberg connector in Rust to address performance issues. Finally, Snehal Chennuru, Bryan Keller, and Tim Jiang, all engineers at Netflix, will talk about their journey from Hive to Iceberg, and the improvements they are currently making.

✨ Chill Data Summit ✨

Tel Aviv | November 6, 2024

The Chill Data Summit is an exciting opportunity to learn Apache Iceberg from project committers and industry users. Featuring talks, a hands-on workshop, and networking—all in a non-commercial, collaborative environment.

In addition to keynotes from experts, get hands-on training from distinguished engineers to build an optimized, production-grade Iceberg Lakehouse to support your business use cases. Tel Aviv is the last stop on our 2024 tour, so if you want to be there, be sure to register now.

Iceberg Community Sync

Online | November 13, 2024 | 11 AM EST / 4 PM GMT

Hosted every three weeks on a Wednesday, this Iceberg meeting is for anyone wanting to be involved in the Iceberg development, documentation, or learn about the roadmap.

DATA Festival Online

Online | November 13, 2024

DATA festival online is where theory meets practice to create real, actionable knowledge. This event brings together data people eager to grasp the realistic applications of AI and utilize them in their fields. The intention is to shift the conversation from hype to how, and collaborate in thought-provoking discussions, practical insights, and industry-leading expertise.

Snowflake Build 2024

Online | November 12-15, 2024

Join developers, data scientists, engineers, and data pros for three days of exclusive product announcements, in-depth technical sessions, and hands-on labs centered on Snowflake’s newest innovations. Discover how to build data pipelines, models, and applications for the era of generative AI and LLMs. Registration is now open for the Americas, EMEA, India and Asean, Australia and New Zealand, and Korea. In person events to follow!

Data & AI Meetup #11

London | November 27, 2024 | 6-8 PM GMT

Here come the girls! The exciting November event features Lydia Monnington of Stuart, talking tips and tricks with Advanced SQL with expwert advice for dealing with real world data. Lydia is an experienced data leader and has held data and analytics roles at Ocado, GHGSat, and Meta.

Also speaking, is Reyam Enad of EDF, talking Data Done Right: Essential Best Practice in Data Science. Reyam is currently a Lead Data Scientist in the nuclear energy sector and as a leader in partnerships for Women in Data, she is passionate about championing women in technology.

Join this in-person meetup to build your network, make new friends, and enjoy engaging talks.

Data Day Texas, 2025

Austin, Texas | January 25, 2025 | 8 AM - 6 PM CST

This amazing data event is returning to Texas for 2025 and you’d be mad to miss out the incredible line-up of speakers and topics on offer. Ole Olesen-Bagneux, O'Reilly author, creator of Meta Grid, and podcast host, will kick off proceedings with a Metadata Keynote, and Bethany Lyons will host a deep dive into automating financial reconciliation with linear programming and optimization.

Also part of the speaker line-up are Chill Data Summit speakers, Joe Reis, Lisa Cau, Ryan Dolley, Chris Tabb, and Matthew Housley, and more! Go check out the rest, and be sure to register your ticket!

Winter Data Conference 2025

Austria | March 7, 2025

The Winter Data Conference combines learning with Alpine fun! Experience the ultimate data immersion in the breath-taking Austrian Alps at the Zell am See Winter Data Conference!

This special event will include a full day of data-driven insights, cutting-edge discussions, and unparalleled networking opportunities, followed by cocktails and another day devoted to skiing (if you don’t have too many cocktails the night before 🍸). Enjoy talks from industry-leading data experts, including Joe Reis, author of the Fundamentals of Data Engineering.

🎥 Event Replays

This year has already been packed with lakehouse events and presentations, so here are a few you might have missed. Grab a ☕and enjoy some free learning.

Apache Iceberg and the Deconstructed Database

Watch this keynote from Julien Le Dem, a leading voice in the open-source community, for an insightful view of Apache Iceberg and the concept of the deconstructed database.

Julien, known for his contributions to projects like Apache Parquet and his pioneering work in data architecture, explores how Iceberg is transforming data lakes by offering schema evolution, partitioning, and efficient data management capabilities. This talk was recorded at the Chill Data Summit on Tour event in San Francisco, August 2024.

Open Source Data Summit

OSDS is a peer-to-peer gathering of data industry professionals, experts, and enthusiasts to explore the dynamic landscape of open source data tools and storage. The central theme of OSDS revolves around the advantages of open source data products and their pivotal role in modern data ecosystems.

The annual summit is a hub for knowledge exchange that fosters a deeper understanding of open source options and their role in shaping the data-driven future. Catch up on the talks from the Open Source Data Summit 2024, including the keynote speech from Vinoth Chandar, Founder and CEO at Onehouse, on unbundling your data platform with an open data lakehouse.

👋 Share Your News

Want your Apache Iceberg news or event to be featured in our next newsletter? Then we’d love to hear from you. DM us with the details and we’ll be in touch.

For this bi-weekly newsletter and other Chill Data Summit community posts, subscribe!