Column#
Column methods#
- class pystarburst.column.CaseExpr(expr: CaseWhen)#
Bases:
Column
Represents a CASE expression.
To construct this object for a CASE expression, call the
functions.when()
specifying a condition and the corresponding result for that condition. Then, callwhen()
andotherwise()
methods to specify additional conditions and results.Examples:
>>> from pystarburst.functions import when, col, lit >>> df = session.create_dataframe([[None], [1], [2]], schema=["a"]) >>> df.select(when(col("a").is_null(), lit(1)) \ ... .when(col("a") == 1, lit(2)) \ ... .otherwise(lit(3)).alias("case_when_column")).collect() [Row(CASE_WHEN_COLUMN=1), Row(CASE_WHEN_COLUMN=2), Row(CASE_WHEN_COLUMN=3)]
- else_(value: ColumnOrLiteral) CaseExpr #
Sets the default result for this CASE expression.
else_()
is an alias ofotherwise()
.
- otherwise(value: ColumnOrLiteral) CaseExpr #
Sets the default result for this CASE expression.
else_()
is an alias ofotherwise()
.
- class pystarburst.column.Column(expr: str | Expression, type: DataType = None)#
Bases:
object
Represents a column or an expression in a
DataFrame
.To access a Column object that refers a column in a
DataFrame
, you can:Use the column name.
Use the
functions.col()
function.Use the
DataFrame.col()
method.Use the index operator
[]
on a dataframe object with a column name.Use the dot operator
.
on a dataframe object with a column name.>>> from pystarburst.functions import col >>> df = session.create_dataframe([["John", 1], ["Mike", 11]], schema=["name", "age"]) >>> df.select("name").collect() [Row(NAME='John'), Row(NAME='Mike')] >>> df.select(col("name")).collect() [Row(NAME='John'), Row(NAME='Mike')] >>> df.select(df.col("name")).collect() [Row(NAME='John'), Row(NAME='Mike')] >>> df.select(df["name"]).collect() [Row(NAME='John'), Row(NAME='Mike')] >>> df.select(df.name).collect() [Row(NAME='John'), Row(NAME='Mike')]
Trino object identifiers are case-insensitive.
The returned column names after a DataFrame is evaluated follow the case-sensitivity rules too. The above
df
was created with column name “name” while the returned column name aftercollect()
was called became “NAME”. It’s because the column is regarded as ignore-case so the Trino cluster returns the upper case.
To create a Column object that represents a constant value, use
functions.lit()
:>>> from pystarburst.functions import lit >>> df.select(col("name"), lit("const value").alias("literal_column")).collect() [Row(NAME='John', LITERAL_COLUMN='const value'), Row(NAME='Mike', LITERAL_COLUMN='const value')]
This class also defines utility functions for constructing expressions with Columns. Column objects can be built with the operators, summarized by operator precedence, in the following table:
Operator
Description
x[index]
Index operator to get an item out of a Trino ARRAY or OBJECT
**
Power
-x
,~x
Unary minus, unary not
*
,/
,%
Multiply, divide, remainder
+
,-
Plus, minus
&
And
|
Or
==
,!=
,<
,<=
,>
,>=
Equal to, not equal to, less than, less than or equal to, greater than, greater than or equal to
The following examples demonstrate how to use Column objects in expressions:
>>> df = session.create_dataframe([[20, 5], [1, 2]], schema=["a", "b"]) >>> df.filter((col("a") == 20) | (col("b") <= 10)).collect() # use parentheses before and after the | operator. [Row(A=20, B=5), Row(A=1, B=2)] >>> df.filter((df["a"] + df.b) < 10).collect() [Row(A=1, B=2)] >>> df.select((col("b") * 10).alias("c")).collect() [Row(C=50), Row(C=20)]
When you use
|
,&
, and~
as logical operators on columns, you must always enclose column expressions with parentheses as illustrated in the above example, because their order precedence is higher than==
,<
, etc.Do not use
and
,or
, andnot
logical operators on column objects, for instance,(df.col1 > 1) and (df.col2 > 2)
is wrong. The reason is Python doesn’t have a magic method, or dunder method for them. It will raise an error and tell you to use|
,&
or~
, for which Python has magic methods. A side effect isif column:
will raise an error because it has a hidden call tobool(a_column)
, like using theand
operator. Useif a_column is None:
instead.To access elements of a semi-structured Object and Array, use
[]
on a Column object:>>> from pystarburst.types import StringType, IntegerType >>> df_with_semi_data = session.create_dataframe([[{"k1": "v1", "k2": "v2"}, ["a0", 1, "a2"]]], schema=["object_column", "array_column"]) >>> df_with_semi_data.select(df_with_semi_data["object_column"]["k1"].alias("k1_value"), df_with_semi_data["array_column"][0].alias("a0_value"), df_with_semi_data["array_column"][1].alias("a1_value")).collect() [Row(K1_VALUE='"v1"', A0_VALUE='"a0"', A1_VALUE='1')] >>> # The above two returned string columns have JSON literal values because children of semi-structured data are semi-structured. >>> # The next line converts JSON literal to a string >>> df_with_semi_data.select(df_with_semi_data["object_column"]["k1"].cast(StringType()).alias("k1_value"), df_with_semi_data["array_column"][0].cast(StringType()).alias("a0_value"), df_with_semi_data["array_column"][1].cast(IntegerType()).alias("a1_value")).collect() [Row(K1_VALUE='v1', A0_VALUE='a0', A1_VALUE=1)]
This class has methods for the most frequently used column transformations and operators. Module
functions
defines many functions to transform columns.- asc_nulls_first() Column #
Returns a Column expression with values sorted in ascending order (null values sorted before non-null values).
- asc_nulls_last() Column #
Returns a Column expression with values sorted in ascending order (null values sorted after non-null values).
- astype(to: str | DataType) Column #
Casts the value of the Column to the specified data type. It raises an error when the conversion can not be performed.
- between(lower_bound: ColumnOrLiteral | Expression, upper_bound: ColumnOrLiteral | Expression) Column #
Between lower bound and upper bound.
- cast(to: str | DataType) Column #
Casts the value of the Column to the specified data type. It raises an error when the conversion can not be performed.
- contains(other: ColumnOrLiteralStr) Column #
Returns true if this Column contains another string.
- Parameters:
other – A
Column
or astr
that is used to check if this column contains it. Astr
will be interpreted as a literal value instead of a column name.
- desc_nulls_first() Column #
Returns a Column expression with values sorted in descending order (null values sorted before non-null values).
- desc_nulls_last() Column #
Returns a Column expression with values sorted in descending order (null values sorted after non-null values).
- ends_with(other: ColumnOrLiteralStr) Column #
Returns true if this Column ends with another string.
- Parameters:
other – A
Column
or astr
that is used to check if this column ends with it. Astr
will be interpreted as a literal value instead of a column name.
- endswith(other: ColumnOrLiteralStr) Column #
Returns true if this Column ends with another string.
- Parameters:
other – A
Column
or astr
that is used to check if this column ends with it. Astr
will be interpreted as a literal value instead of a column name.
- eqNullSafe(other: ColumnOrLiteral | Expression)#
Null safe equals.
- equal_null(other: ColumnOrLiteral | Expression)#
Null safe equals.
- ilike(pattern: ColumnOrLiteralStr) Column #
Allows case-insensitive matching of strings based on comparison with a pattern.
- Parameters:
pattern – A
Column
or astr
that indicates the pattern. Astr
will be interpreted as a literal value instead of a column name.
For details, see the Trino documentation on LIKE.
- in_(*vals: LiteralType | Iterable[LiteralType] | DataFrame) Column #
Returns a conditional expression that you can pass to the
DataFrame.filter()
or whereDataFrame.where()
to perform the equivalent of a WHERE … IN query with a specified list of values. You can also pass this to aDataFrame.select()
call.The expression evaluates to true if the value in the column is one of the values in a specified sequence.
For example, the following code returns a DataFrame that contains the rows where the column “a” contains the value 1, 2, or 3. This is equivalent to
SELECT * FROM table WHERE a IN (1, 2, 3)
.- Parameters:
vals – The values, or a
DataFrame
instance to use to check for membership against this column.
Examples
>>> from pystarburst.functions import lit >>> df = session.create_dataframe([[1, "x"], [2, "y"] ,[4, "z"]], schema=["a", "b"]) >>> # Basic example >>> df.filter(df["a"].in_(lit(1), lit(2), lit(3))).collect() [Row(A=1, B='x'), Row(A=2, B='y')] >>> # Check in membership for a DataFrame that has a single column >>> df_for_in = session.create_dataframe([[1], [2] ,[3]], schema=["col1"]) >>> df.filter(df["a"].in_(df_for_in)).sort(df["a"].asc()).collect() [Row(A=1, B='x'), Row(A=2, B='y')] >>> # Use in with a select method call >>> df.select(df["a"].in_(lit(1), lit(2), lit(3)).alias("is_in_list")).collect() [Row(IS_IN_LIST=True), Row(IS_IN_LIST=True), Row(IS_IN_LIST=False)]
- isin(*vals: LiteralType | Iterable[LiteralType] | DataFrame) Column #
Returns a conditional expression that you can pass to the
DataFrame.filter()
or whereDataFrame.where()
to perform the equivalent of a WHERE … IN query with a specified list of values. You can also pass this to aDataFrame.select()
call.The expression evaluates to true if the value in the column is one of the values in a specified sequence.
For example, the following code returns a DataFrame that contains the rows where the column “a” contains the value 1, 2, or 3. This is equivalent to
SELECT * FROM table WHERE a IN (1, 2, 3)
.- Parameters:
vals – The values, or a
DataFrame
instance to use to check for membership against this column.
Examples
>>> from pystarburst.functions import lit >>> df = session.create_dataframe([[1, "x"], [2, "y"] ,[4, "z"]], schema=["a", "b"]) >>> # Basic example >>> df.filter(df["a"].in_(lit(1), lit(2), lit(3))).collect() [Row(A=1, B='x'), Row(A=2, B='y')] >>> # Check in membership for a DataFrame that has a single column >>> df_for_in = session.create_dataframe([[1], [2] ,[3]], schema=["col1"]) >>> df.filter(df["a"].in_(df_for_in)).sort(df["a"].asc()).collect() [Row(A=1, B='x'), Row(A=2, B='y')] >>> # Use in with a select method call >>> df.select(df["a"].in_(lit(1), lit(2), lit(3)).alias("is_in_list")).collect() [Row(IS_IN_LIST=True), Row(IS_IN_LIST=True), Row(IS_IN_LIST=False)]
- like(pattern: ColumnOrLiteralStr) Column #
Allows case-sensitive matching of strings based on comparison with a pattern.
- Parameters:
pattern – A
Column
or astr
that indicates the pattern. Astr
will be interpreted as a literal value instead of a column name.
For details, see the Trino documentation on LIKE.
- over(window: WindowSpec | None = None) Column #
Returns a window frame, based on the specified
WindowSpec
.
- regexp(pattern: ColumnOrLiteralStr) Column #
Returns true if this Column contains the specified regular expression.
- Parameters:
pattern – A
Column
or astr
that indicates the pattern. Astr
will be interpreted as a literal value instead of a column name.
For details, see the Trino documentation on regular expressions.
rlike()
is an alias ofregexp()
.regexp_like()
is an alias ofregexp()
.
- regexp_like(pattern: ColumnOrLiteralStr) Column #
Returns true if this Column contains the specified regular expression.
- Parameters:
pattern – A
Column
or astr
that indicates the pattern. Astr
will be interpreted as a literal value instead of a column name.
For details, see the Trino documentation on regular expressions.
rlike()
is an alias ofregexp()
.regexp_like()
is an alias ofregexp()
.
- rlike(pattern: ColumnOrLiteralStr) Column #
Returns true if this Column contains the specified regular expression.
- Parameters:
pattern – A
Column
or astr
that indicates the pattern. Astr
will be interpreted as a literal value instead of a column name.
For details, see the Trino documentation on regular expressions.
rlike()
is an alias ofregexp()
.regexp_like()
is an alias ofregexp()
.
- starts_with(other: ColumnOrLiteralStr) Column #
Returns true if this Column starts with another string.
- Parameters:
other – A
Column
or astr
that is used to check if this column starts with it. Astr
will be interpreted as a literal value instead of a column name.
- startswith(other: ColumnOrLiteralStr) Column #
Returns true if this Column starts with another string.
- Parameters:
other – A
Column
or astr
that is used to check if this column starts with it. Astr
will be interpreted as a literal value instead of a column name.
- substr(start_pos: Column | int, length: Column | int) Column #
Returns a substring of this string column.
- Parameters:
start_pos – The starting position of the substring. Please note that the first character has position 1 instead of 0 in Trino cluster.
length – The length of the substring.
substring()
is an alias ofsubstr()
.
- substring(start_pos: Column | int, length: Column | int) Column #
Returns a substring of this string column.
- Parameters:
start_pos – The starting position of the substring. Please note that the first character has position 1 instead of 0 in Trino cluster.
length – The length of the substring.
substring()
is an alias ofsubstr()
.