Pandas#
PyStarburst can export any DataFrame to a
pandas DataFrame
via to_pandas().
df = session.table("tpch.sf1.orders")
pandas_df = df.filter(df.orderstatus == "O").to_pandas()
Arrow-accelerated conversion#
When the server has Arrow spooling protocol enabled, to_pandas() fetches data as
Apache Arrow Columnar Format segments and
decodes them in parallel, which is up to 7× faster than using direct protocol.
Prerequisites#
Install the
pyarrowextra:pip install "pystarburst[pyarrow]"Starburst Enterprise server must have Arrow spooling enabled:
Configure support for the spooling protocol on a Starburst cluster - Configuration steps
Add property to enable arrow or arrow with zstd compression:
protocol.spooling.encoding.arrow.enabled=true # or protocol.spooling.encoding.arrow+zstd.enabled=true
Add the following to the JVM configuration:
--add-opens=java.base/java.nio=ALL-UNNAMED
Configuration#
Pass encoding when creating the session. arrow_max_workers controls the
thread pool size used for parallel IPC decoding.
import trino
from pystarburst import Session
session = Session.builder.configs({
"host": "<host>",
"port": "<port>",
"http_scheme": "https",
"auth": trino.auth.BasicAuthentication("<user>", "<password>"),
"encoding": "arrow-preview+zstd", # or "arrow-preview"
"arrow_max_workers": 8, # optional
}).create()
pandas_df = session.sql("SELECT * FROM tpch.sf1.lineitem LIMIT 2_000_000").to_pandas()
You can also override arrow_max_workers for a single call:
pandas_df = df.to_pandas(arrow_max_workers=4)