Dataframe write functions#

Dataframe write methods#

class pystarburst.dataframe_writer.DataFrameWriter(dataframe: DataFrame)#

Bases: object

Provides methods for writing data from a DataFrame to supported output destinations.

To use this object:

Create an instance of a DataFrameWriter by accessing the DataFrame.write property.
(Optional) Specify the save mode by calling mode(), which returns the same DataFrameWriter that is configured to save data using the specified mode. The default mode is “errorifexists”.
Call save_as_table() or copy_into_location() to save the data to the specified destination.

copy_into_location(location: str, format: str, *, catalog: str | None = None, compression: str | None = None, separator: str | None = None, header: bool | None = None) → None#

Executes UNLOAD command to copy the data to the specified location.

Parameters:

location – An object storage location where the output is written
format – Supported format parameters: ORC, PARQUET, AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, OPENX_JSON, TEXTFILE, CSV
catalog – A hive catalog where UNLOAD function is registered. If no catalog is provided, session’s catalog will be used.
compression – Supported compression parameters: NONE (default), SNAPPY, LZ4, ZSTD, GZIP
separator – Custom separator for the output file. Default is ‘,’ for CSV and ‘’ for TEXTFILE
header – If output file should include header (True) or not (False). Default is False

The separator and header parameters are applicable only when the format argument is set to CSV or TEXTFILE.

Each format has its own set of constraints. The CSV format exclusively supports VARCHAR columns, and AVRO files do not permit special characters in the column names.

Examples

>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"])
>>> df.write.copy_into_location(location="s3://mybucket/my/location", format="CSV")

>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"])
>>> df.write.copy_into_location(location="s3://mybucket/my/location2", format="CSV", catalog="hive", compression="GZIP")

mode(save_mode: str) → DataFrameWriter#

Set the save mode of this DataFrameWriter.

Parameters:

save_mode –

one of the following strings:

”append”: Append data of this DataFrame to existing data.
”overwrite”: Overwrite existing data.
”errorifexists”: Throw an exception if data already exists.
”ignore”: Ignore this operation if data already exists.

Default value is “errorifexists”.

Returns:

The DataFrameWriter itself.

Writes the data to the specified table in a Trino cluster.

saveAsTable() is an alias of save_as_table().

Parameters:

table_name – A string or list of strings that specify the table name or fully-qualified object identifier (database name, schema name, and table name).
mode –
One of the following values. When it’s None or not provided, the save mode set by mode() is used.
- ”append”: Append data of this DataFrame to existing data.
- ”overwrite”: Overwrite existing data.
- ”errorifexists”: Throw an exception if data already exists.
- ”ignore”: Ignore this operation if data already exists.
column_order –
When mode is “append”, data will be inserted into the target table by matching column sequence or column name. Default is “index”. When mode is not “append”, the column_order makes no difference.
- ”index”: Data will be inserted into the target table by column sequence.
- ”name”: Data will be inserted into the target table by matching column names. If the target table has more columns than the source DataFrame, use this one.
table_properties – Any custom table properties used to create the table.

Examples

>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"])
>>> df.write.mode("overwrite").save_as_table("my_table")
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.save_as_table("my_table", mode="append")
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4), Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"format": "parquet"})
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"partitioning": ["a"]})
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]

Writes the data to the specified table in a Trino cluster.

saveAsTable() is an alias of save_as_table().

Parameters:

table_name – A string or list of strings that specify the table name or fully-qualified object identifier (database name, schema name, and table name).
mode –
One of the following values. When it’s None or not provided, the save mode set by mode() is used.
- ”append”: Append data of this DataFrame to existing data.
- ”overwrite”: Overwrite existing data.
- ”errorifexists”: Throw an exception if data already exists.
- ”ignore”: Ignore this operation if data already exists.
column_order –
When mode is “append”, data will be inserted into the target table by matching column sequence or column name. Default is “index”. When mode is not “append”, the column_order makes no difference.
- ”index”: Data will be inserted into the target table by column sequence.
- ”name”: Data will be inserted into the target table by matching column names. If the target table has more columns than the source DataFrame, use this one.
table_properties – Any custom table properties used to create the table.

Examples

>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"])
>>> df.write.mode("overwrite").save_as_table("my_table")
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.save_as_table("my_table", mode="append")
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4), Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"format": "parquet"})
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]
>>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"partitioning": ["a"]})
>>> session.table("my_table").collect()
[Row(A=1, B=2), Row(A=3, B=4)]