Pandas#

PyStarburst can export any DataFrame to a pandas DataFrame via to_pandas().

df = session.table("tpch.sf1.orders")
pandas_df = df.filter(df.orderstatus == "O").to_pandas()

Arrow-accelerated conversion#

When the server has Arrow spooling protocol enabled, to_pandas() fetches data as Apache Arrow Columnar Format segments and decodes them in parallel, which is up to 7× faster than using direct protocol.

Prerequisites#

  1. Install the pyarrow extra:

    pip install "pystarburst[pyarrow]"
    
  2. Starburst Enterprise server must have Arrow spooling enabled:

    Configure support for the spooling protocol on a Starburst cluster - Configuration steps

    Add property to enable arrow or arrow with zstd compression:

    protocol.spooling.encoding.arrow.enabled=true
    # or
    protocol.spooling.encoding.arrow+zstd.enabled=true
    

    Add the following to the JVM configuration:

    --add-opens=java.base/java.nio=ALL-UNNAMED
    

Configuration#

Pass encoding when creating the session. arrow_max_workers controls the thread pool size used for parallel IPC decoding.

import trino
from pystarburst import Session

session = Session.builder.configs({
    "host": "<host>",
    "port": "<port>",
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication("<user>", "<password>"),
    "encoding": "arrow-preview+zstd",   # or "arrow-preview"
    "arrow_max_workers": 8,             # optional
}).create()

pandas_df = session.sql("SELECT * FROM tpch.sf1.lineitem LIMIT 2_000_000").to_pandas()

You can also override arrow_max_workers for a single call:

pandas_df = df.to_pandas(arrow_max_workers=4)