Dataframe write functions#
Dataframe write methods#
- class pystarburst.dataframe_writer.DataFrameWriter(dataframe: DataFrame)#
Bases:
object
Provides methods for writing data from a
DataFrame
to supported output destinations.To use this object:
Create an instance of a
DataFrameWriter
by accessing theDataFrame.write
property.(Optional) Specify the save mode by calling
mode()
, which returns the sameDataFrameWriter
that is configured to save data using the specified mode. The default mode is “errorifexists”.Call
save_as_table()
orcopy_into_location()
to save the data to the specified destination.
- copy_into_location(location: str, format: str, *, catalog: str | None = None, compression: str | None = None, separator: str | None = None, header: bool | None = None) None #
Executes UNLOAD command to copy the data to the specified location.
- Parameters:
location – An object storage location where the output is written
format – Supported format parameters: ORC, PARQUET, AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, OPENX_JSON, TEXTFILE, CSV
catalog – A hive catalog where UNLOAD function is registered. If no catalog is provided, session’s catalog will be used.
compression – Supported compression parameters: NONE (default), SNAPPY, LZ4, ZSTD, GZIP
separator – Custom separator for the output file. Default is ‘,’ for CSV and ‘’ for TEXTFILE
header – If output file should include header (True) or not (False). Default is False
The separator and header parameters are applicable only when the format argument is set to CSV or TEXTFILE.
Each format has its own set of constraints. The CSV format exclusively supports VARCHAR columns, and AVRO files do not permit special characters in the column names.
Examples
>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"]) >>> df.write.copy_into_location(location="s3://mybucket/my/location", format="CSV")
>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"]) >>> df.write.copy_into_location(location="s3://mybucket/my/location2", format="CSV", catalog="hive", compression="GZIP")
- mode(save_mode: str) DataFrameWriter #
Set the save mode of this
DataFrameWriter
.- Parameters:
save_mode –
one of the following strings:
”append”: Append data of this DataFrame to existing data.
”overwrite”: Overwrite existing data.
”errorifexists”: Throw an exception if data already exists.
”ignore”: Ignore this operation if data already exists.
Default value is “errorifexists”.
- Returns:
The
DataFrameWriter
itself.
- saveAsTable(table_name: str | Iterable[str], *, mode: str | None = None, column_order: str = 'index', table_properties: Dict[str, str | bool | int | float] = None, statement_properties: Dict[str, str] | None = None) None #
Writes the data to the specified table in a Trino cluster.
saveAsTable()
is an alias ofsave_as_table()
.- Parameters:
table_name – A string or list of strings that specify the table name or fully-qualified object identifier (database name, schema name, and table name).
mode –
One of the following values. When it’s
None
or not provided, the save mode set bymode()
is used.”append”: Append data of this DataFrame to existing data.
”overwrite”: Overwrite existing data.
”errorifexists”: Throw an exception if data already exists.
”ignore”: Ignore this operation if data already exists.
column_order –
When
mode
is “append”, data will be inserted into the target table by matching column sequence or column name. Default is “index”. Whenmode
is not “append”, thecolumn_order
makes no difference.”index”: Data will be inserted into the target table by column sequence.
”name”: Data will be inserted into the target table by matching column names. If the target table has more columns than the source DataFrame, use this one.
table_properties – Any custom table properties used to create the table.
Examples
>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"]) >>> df.write.mode("overwrite").save_as_table("my_table") >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.save_as_table("my_table", mode="append") >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4), Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"format": "parquet"}) >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"partitioning": ["a"]}) >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)]
- save_as_table(table_name: str | Iterable[str], *, mode: str | None = None, column_order: str = 'index', table_properties: Dict[str, str | bool | int | float] = None, statement_properties: Dict[str, str] | None = None) None #
Writes the data to the specified table in a Trino cluster.
saveAsTable()
is an alias ofsave_as_table()
.- Parameters:
table_name – A string or list of strings that specify the table name or fully-qualified object identifier (database name, schema name, and table name).
mode –
One of the following values. When it’s
None
or not provided, the save mode set bymode()
is used.”append”: Append data of this DataFrame to existing data.
”overwrite”: Overwrite existing data.
”errorifexists”: Throw an exception if data already exists.
”ignore”: Ignore this operation if data already exists.
column_order –
When
mode
is “append”, data will be inserted into the target table by matching column sequence or column name. Default is “index”. Whenmode
is not “append”, thecolumn_order
makes no difference.”index”: Data will be inserted into the target table by column sequence.
”name”: Data will be inserted into the target table by matching column names. If the target table has more columns than the source DataFrame, use this one.
table_properties – Any custom table properties used to create the table.
Examples
>>> df = session.create_dataframe([[1,2],[3,4]], schema=["a", "b"]) >>> df.write.mode("overwrite").save_as_table("my_table") >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.save_as_table("my_table", mode="append") >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4), Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"format": "parquet"}) >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)] >>> df.write.mode("overwrite").save_as_table("my_table", table_properties={"partitioning": ["a"]}) >>> session.table("my_table").collect() [Row(A=1, B=2), Row(A=3, B=4)]