
Introduction
Delta Sharing is an open data sharing protocol introduced by databricks. It allows organizations to share large tables quickly and efficiently from cloud sources. An organization can make its tables from Azure or AWS available and provide an access token to external consumers.
Deltashares are a cost effective way to move data between organizations without having to resort to older methods such as SFTP transfers.
We will discuss using Python to connect to delta sharing through Pandas and PySpark.
It is also possible to connect directly in some reporting tools such as PowerBI we will discuss this process and some of the current limitations briefly.
Apache Spark
Apache Spark is a data engine that runs in memory across clusters and is extremely fast and scalable. This makes it ideal for data science work on large datasets.
Spark is installed along with PySpark and is necessary to perform query operations on delta share tables.
Github Resources
We will first down download an access token from the delta.io github page which also contains information, source code, and a helpful quickstart guide.
Package Installation
We will need to install three packages:
- Delta_Sharing for accessing delta share tables.
- PySpark for accessing spark data frames (not required if using Pandas)
- Delta-Spark required in order to use the delta sharing connector with PySpark
Note that delta-sharing is not available on Conda at the time of this writing so we will use pip.
We will also let delta-spark install PySpark so there is less need to worry about version compatibility issues between spark and delta sharing.
pip install delta-sharing
pip install delta-spark
## Delta-Spark installs PySpark
import delta_sharing
import pandas as pd
from pyspark.sql import SparkSession
import findspark
We will load delta_sharing, pandas, Spark Session. The findspark package is unnecessary if we setup our Spark environment variables, but to keep things simple we’ll let this package find Spark for us.
Loading the Share File
To connect to delta share we need to point the package to our share file which contains the connection URL and our access token.
profile_file = "<Your file Path>/open-datasets.share"
client = delta_sharing.SharingClient(profile_file)
client.list_all_tables()
[Table(name='COVID_19_NYT', share='delta_sharing', schema='default'),
Table(name='boston-housing', share='delta_sharing', schema='default'),
Table(name='flight-asa_2008', share='delta_sharing', schema='default'),
Table(name='lending_club', share='delta_sharing', schema='default'),
...
Listing all tables lets us know what is available and how to request our data. We will explore the lending_club dataset by passing a string in the form profile_path#share.schema.tablename
table_url = profile_file + "#delta_sharing.default.lending_club"
Loading with Pandas
The simplest way to load data is with a pandas dataframe by specifying the table_url with or without a limit.
The problem with this approach if you are actually working with “big data” is that it will try to load the entire table into RAM, which is a problem for local machines. Even if you have the RAM it is extremely inefficient over a network connection.
The limit function starts reading at the beginning of the table so it is not helpful for anything other than sampling some rows.
df = delta_sharing.load_as_pandas(table_url, limit = 100)
df
Loading Table Changes
A better approach to pandas is to use table versioning. This requires that organization sharing data with you turns on CDF (Change Data Feed). Otherwise it will result in an error.
The demo data does not have CDF enabled but the process is simply to obtain the table versions required based on their dates, and then pass the version information to load_table_changes_as_pandas().
from datetime import datetime, timedelta, date
LoadDate = data.today() - timedelta(days = 3)
LoadDate = datetime.combine(LoadDate, datetime.min.time())
CurrentVer = delta_sharing.get_table_version(Table_url)
PastVer = delta_sharing.get_table_version
(Table_url, timestamp = LoadAtDate.strftime("%Y-%m-%d 00:00:00.0"))
The table versions are integer values (e.g. 2922 and 2924) which can then be passed to the pandas function. It is also possible to pass dates directly (see the github repo for details).
df = delta_sharing.load_table_changes_as_pandas(Table_url, starting_version = PastVer, ending_version = CurrentVer)
Note that either the starting or ending version can be omitted if it is not required but for incremental loading we prefer to be precise with times to avoid gaps.
Loading table changes is ideal for incrementally loading large chunks of data but as a caution if the table versioning breaks, and it can’t be restored, and you can not fit all the data in RAM, then you could find yourself in a pickle with gaps in your dataset.
Delta Sharing with PySpark
Another option for working with deltashares is to use PySpark, this is a good option because it is more versatile than loading with Pandas. It’s possible to filter, select columns, and otherwise transform the data in Spark.
Spark Session
The first thing we need to do to access our deltashare in Spark is set our Spark home directory, assuming we do not already have $SPARK_HOME set. Following this we need to create a Spark Session using the delta-sharing connectors.
import findspark
location = findspark.find()
findspark.init(location, edit_rc=True)
spark = (SparkSession
.builder
.config('spark.jar.packages', 'io.delta:delta-sharing-spark_2.12:3.1.0')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
.getOrCreate()
)
This code tells Spark where to find the required .jar files and what version to use. 2.12:3.1.0 was current at the time of writing. It is also possible to download the connectors and save them in your spark folder but that is outside the scope of this article.
If there is no instance of Spark running it will spin one up that can be accessed in any web browser at http://localhost:4040
Assuming Spark is running you can verify the correct jars are loading by clicking on the “environment” tab and looking next to spark.app.initial.jar.urls. You should see delta sharing listed.
Loading Data into Spark
We will load the lending tree data into a spark dataframe.
df = delta_sharing.load_as_spark(table_url)
df.printSchema()
df.printSchema()
root
|-- id: string (nullable = true)
|-- member_id: string (nullable = true)
|-- loan_amnt: float (nullable = true)
|-- funded_amnt: integer (nullable = true)
|-- funded_amnt_inv: double (nullable = true)
|-- term: string (nullable = true)
|-- int_rate: string (nullable = true)
...
df.select(df.columns[:3]).take(5)
df.select(df.columns[:3]).take(5)
Out[26]:
[Row(id=None, member_id=None, loan_amnt=15000.0),
Row(id=None, member_id=None, loan_amnt=14000.0),
Row(id=None, member_id=None, loan_amnt=20000.0),
Row(id=None, member_id=None, loan_amnt=8925.0),
Row(id=None, member_id=None, loan_amnt=8000.0)]
Similar to Pandas there is also a function to load table changes which can be found in the documentation on github.
Reporting Tools
We can access delta sharing directly in some reporting tools. In practice this is less convenient than advertised for large datasets. Loading times are fast but significantly longer than over hardware connections.
PowerBI has a built in delta-sharing connector which is great for smaller datasets that can be loaded at once. For very large datasets PowerBI Desktop may run out of RAM, and the connector does not support CDF versioning.
This requires setting up some type of filtering, aggregation in Power Query, possibly in conjunction with incremental loading so the older data isn’t deleted when new data is fetched. This can be clunky, and the connector occasionally runs into other problems reading data. This can cause headaches with large datasets in production.
Tableau requires a connector running over Node.js which introduces and extra layer to worry about in the reporting chain.
We hope these reporting solutions progress, but currently our preferred approach is to use Python / PySpark to load data.
Conclusion
Delta Sharing is a fast and convenient way to share data across organizations. Depending on your use case, or your constraints, as the case may be, you may find it easier to use Pandas, Spark, or load directly into a reporting tool.
Happy sharing!
Recent Post
Delta Sharing in PySpark and Pandas
- 17 February 2025
- 7 min read
Partial Customer Correlations
- 26 January 2025
- 5 min read
Peeking inside the basket with lists
- 31 December 2024
- 5 min read
Streamline Workflows in R Studio
- 23 November 2024
- 6 min read
Customer Clusters with Gaussian Mixed Models
- 22 October 2024
- 8 min read