SQL Server Big Data Clusters

SQL Server 2019 is out and one of the most interesting additions is SQL Server 2019 Big Data Clusters. In this episode, Kevin Feasel covers the marriage of SQL Server and Apache Spark. We discuss what Big Data Clusters are (and how it’s not a SQL Server feature or an edition of SQL Server, but a thing in itself), cover some of the architecture behind the solution, and explain how we can use them.

A scalable compute and storage architecture in SQL Server 2019 Big Data Clusters – James Serra

Episode Quotes

“I would probably start by pointing out one of the bigger use cases for big data clusters that kind of makes all the pieces fit, and that is, I need a data lake, but I don’t want to put my data in Azure or in AWS.”

“When Microsoft ported SQL Server to Linux, it made a lot of these Linux-based applications now, all of a sudden, available, or more available.”

“We have one extra pool called the compute pool. These things are our helper nodes. An end user is never going to connect to them directly. An end user isn’t going to care that they exist. They only work to make queries faster.”

Listen to Learn

00:38     Intro to the team and topic
01:41     Compañero Shout-Outs
02:02     SQL Server in the News
02:47     What do big data clusters have to do with data virtualization?
04:20     Circumstances under which you would want to not move your data
07:40     The time- and effort-saving benefit to big data clusters
10:03     Why big data clusters isn’t just Microsoft jumping the shark
16:00     Architecture concepts – don’t look at the diagram if you’re driving
21:57     When you’re spreading the fact out, you’re still paying for some data movement
23:46     There was a reason for Java Interrupt
25:22     Will the Fonz jump over a shark next time?
27:58     Ben Weissman and Enrico van de Laar are working on SQL Server Big Data Clusters Revealed
29:35   Last thoughts on big data clusters
32:14     Closing Thoughts

Credits

Music for SQL Server in the News by Mansardian

Meet the Hosts

Carlos Chacon

With more than 10 years of working with SQL Server, Carlos helps businesses ensure their SQL Server environments meet their users’ expectations. He can provide insights on performance, migrations, and disaster recovery. He is also active in the SQL Server community and regularly speaks at user group meetings and conferences. He helps support the free database monitoring tool found at databasehealth.com and provides training through SQL Trail events.

Eugene Meidinger

Eugene works as an independent BI consultant and Pluralsight author, specializing in Power BI and the Azure Data Platform. He has been working with data for over 8 years and speaks regularly at user groups and conferences. He also helps run the GroupBy online conference.

Kevin Feasel

Kevin is a Microsoft Data Platform MVP and proprietor of Catallaxy Services, LLC, where he specializes in T-SQL development, machine learning, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL, president of the Triangle Area SQL Server Users Group, and author of the books PolyBase Revealed (Apress, 2020) and Finding Ghosts in Your Data: Anomaly Detection Techniques with Examples in Python (Apress, 2022). A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather's nice enough.

Want to Submit Some Feedback?

Did we miss something or not quite get it right? Want to be a guest or suggest a guest/topic for the podcast?

Let Us Know