Pandas#

PyStarburst can export any DataFrame to a pandas DataFrame via to_pandas().

df = session.table("tpch.sf1.orders")
pandas_df = df.filter(df.orderstatus == "O").to_pandas()

Arrow-accelerated conversion#

When the server has Arrow spooling protocol enabled, to_pandas() fetches data as Apache Arrow Columnar Format segments and decodes them in parallel, which is up to 7× faster than using direct protocol.

Prerequisites#

Install the pyarrow extra:
```
pip install "pystarburst[pyarrow]"
```
Starburst Enterprise server must have Arrow spooling enabled:

Configure support for the spooling protocol on a Starburst cluster - Configuration steps

Add property to enable arrow or arrow with zstd compression:
```
protocol.spooling.encoding.arrow.enabled=true
# or
protocol.spooling.encoding.arrow+zstd.enabled=true
```
Add the following to the JVM configuration:
```
--add-opens=java.base/java.nio=ALL-UNNAMED
```

Configuration#

Pass encoding when creating the session. arrow_max_workers controls the thread pool size used for parallel IPC decoding.

import trino
from pystarburst import Session

session = Session.builder.configs({
    "host": "<host>",
    "port": "<port>",
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication("<user>", "<password>"),
    "encoding": "arrow-preview+zstd",   # or "arrow-preview"
    "arrow_max_workers": 8,             # optional
}).create()

pandas_df = session.sql("SELECT * FROM tpch.sf1.lineitem LIMIT 2_000_000").to_pandas()

You can also override arrow_max_workers for a single call:

pandas_df = df.to_pandas(arrow_max_workers=4)