Column#

Column methods#

class pystarburst.column.CaseExpr(expr: CaseWhen)#

Bases: Column

Represents a CASE expression.

To construct this object for a CASE expression, call the functions.when() specifying a condition and the corresponding result for that condition. Then, call when() and otherwise() methods to specify additional conditions and results.

Examples:

>>> from pystarburst.functions import when, col, lit

>>> df = session.create_dataframe([[None], [1], [2]], schema=["a"])
>>> df.select(when(col("a").is_null(), lit(1)) \
...     .when(col("a") == 1, lit(2)) \
...     .otherwise(lit(3)).alias("case_when_column")).collect()
[Row(CASE_WHEN_COLUMN=1), Row(CASE_WHEN_COLUMN=2), Row(CASE_WHEN_COLUMN=3)]
else_(value: ColumnOrLiteral) CaseExpr#

Sets the default result for this CASE expression.

else_() is an alias of otherwise().

otherwise(value: ColumnOrLiteral) CaseExpr#

Sets the default result for this CASE expression.

else_() is an alias of otherwise().

when(condition: ColumnOrSqlExpr, value: ColumnOrLiteral) CaseExpr#

Appends one more WHEN condition to the CASE expression.

Parameters:
  • condition – A Column expression or SQL text representing the specified condition.

  • value – A Column expression or a literal value, which will be returned if condition is true.

class pystarburst.column.Column(expr: str | Expression, type: DataType = None)#

Bases: object

Represents a column or an expression in a DataFrame.

To access a Column object that refers a column in a DataFrame, you can:

  • Use the column name.

  • Use the functions.col() function.

  • Use the DataFrame.col() method.

  • Use the index operator [] on a dataframe object with a column name.

  • Use the dot operator . on a dataframe object with a column name.

    >>> from pystarburst.functions import col
    >>> df = session.create_dataframe([["John", 1], ["Mike", 11]], schema=["name", "age"])
    >>> df.select("name").collect()
    [Row(NAME='John'), Row(NAME='Mike')]
    
    >>> df.select(col("name")).collect()
    [Row(NAME='John'), Row(NAME='Mike')]
    
    >>> df.select(df.col("name")).collect()
    [Row(NAME='John'), Row(NAME='Mike')]
    
    >>> df.select(df["name"]).collect()
    [Row(NAME='John'), Row(NAME='Mike')]
    
    >>> df.select(df.name).collect()
    [Row(NAME='John'), Row(NAME='Mike')]
    

    Trino object identifiers are case-insensitive.

    The returned column names after a DataFrame is evaluated follow the case-sensitivity rules too. The above df was created with column name “name” while the returned column name after collect() was called became “NAME”. It’s because the column is regarded as ignore-case so the Trino cluster returns the upper case.

To create a Column object that represents a constant value, use functions.lit():

>>> from pystarburst.functions import lit
>>> df.select(col("name"), lit("const value").alias("literal_column")).collect()
[Row(NAME='John', LITERAL_COLUMN='const value'), Row(NAME='Mike', LITERAL_COLUMN='const value')]

This class also defines utility functions for constructing expressions with Columns. Column objects can be built with the operators, summarized by operator precedence, in the following table:

Operator

Description

x[index]

Index operator to get an item out of a Trino ARRAY or OBJECT

**

Power

-x, ~x

Unary minus, unary not

*, /, %

Multiply, divide, remainder

+, -

Plus, minus

&

And

|

Or

==, !=, <, <=, >, >=

Equal to, not equal to, less than, less than or equal to, greater than, greater than or equal to

The following examples demonstrate how to use Column objects in expressions:

>>> df = session.create_dataframe([[20, 5], [1, 2]], schema=["a", "b"])
>>> df.filter((col("a") == 20) | (col("b") <= 10)).collect()  # use parentheses before and after the | operator.
[Row(A=20, B=5), Row(A=1, B=2)]
>>> df.filter((df["a"] + df.b) < 10).collect()
[Row(A=1, B=2)]
>>> df.select((col("b") * 10).alias("c")).collect()
[Row(C=50), Row(C=20)]

When you use |, &, and ~ as logical operators on columns, you must always enclose column expressions with parentheses as illustrated in the above example, because their order precedence is higher than ==, <, etc.

Do not use and, or, and not logical operators on column objects, for instance, (df.col1 > 1) and (df.col2 > 2) is wrong. The reason is Python doesn’t have a magic method, or dunder method for them. It will raise an error and tell you to use |, & or ~, for which Python has magic methods. A side effect is if column: will raise an error because it has a hidden call to bool(a_column), like using the and operator. Use if a_column is None: instead.

To access elements of a semi-structured Object and Array, use [] on a Column object:

>>> from pystarburst.types import StringType, IntegerType
>>> df_with_semi_data = session.create_dataframe([[{"k1": "v1", "k2": "v2"}, ["a0", 1, "a2"]]], schema=["object_column", "array_column"])
>>> df_with_semi_data.select(df_with_semi_data["object_column"]["k1"].alias("k1_value"), df_with_semi_data["array_column"][0].alias("a0_value"), df_with_semi_data["array_column"][1].alias("a1_value")).collect()
[Row(K1_VALUE='"v1"', A0_VALUE='"a0"', A1_VALUE='1')]
>>> # The above two returned string columns have JSON literal values because children of semi-structured data are semi-structured.
>>> # The next line converts JSON literal to a string
>>> df_with_semi_data.select(df_with_semi_data["object_column"]["k1"].cast(StringType()).alias("k1_value"), df_with_semi_data["array_column"][0].cast(StringType()).alias("a0_value"), df_with_semi_data["array_column"][1].cast(IntegerType()).alias("a1_value")).collect()
[Row(K1_VALUE='v1', A0_VALUE='a0', A1_VALUE=1)]

This class has methods for the most frequently used column transformations and operators. Module functions defines many functions to transform columns.

alias(alias: str) Column#

Returns a new renamed Column. Alias of name().

as_(alias: str) Column#

Returns a new renamed Column. Alias of name().

asc() Column#

Returns a Column expression with values sorted in ascending order.

asc_nulls_first() Column#

Returns a Column expression with values sorted in ascending order (null values sorted before non-null values).

asc_nulls_last() Column#

Returns a Column expression with values sorted in ascending order (null values sorted after non-null values).

astype(to: str | DataType) Column#

Casts the value of the Column to the specified data type. It raises an error when the conversion can not be performed.

between(lower_bound: ColumnOrLiteral | Expression, upper_bound: ColumnOrLiteral | Expression) Column#

Between lower bound and upper bound.

bitand(other: ColumnOrLiteral | Expression) Column#

Bitwise and.

bitor(other: ColumnOrLiteral | Expression) Column#

Bitwise or.

bitwiseAND(other: ColumnOrLiteral | Expression) Column#

Bitwise and.

bitwiseOR(other: ColumnOrLiteral | Expression) Column#

Bitwise or.

bitwiseXOR(other: ColumnOrLiteral | Expression) Column#

Bitwise xor.

bitxor(other: ColumnOrLiteral | Expression) Column#

Bitwise xor.

cast(to: str | DataType) Column#

Casts the value of the Column to the specified data type. It raises an error when the conversion can not be performed.

contains(other: ColumnOrLiteralStr) Column#

Returns true if this Column contains another string.

Parameters:

other – A Column or a str that is used to check if this column contains it. A str will be interpreted as a literal value instead of a column name.

desc() Column#

Returns a Column expression with values sorted in descending order.

desc_nulls_first() Column#

Returns a Column expression with values sorted in descending order (null values sorted before non-null values).

desc_nulls_last() Column#

Returns a Column expression with values sorted in descending order (null values sorted after non-null values).

ends_with(other: ColumnOrLiteralStr) Column#

Returns true if this Column ends with another string.

Parameters:

other – A Column or a str that is used to check if this column ends with it. A str will be interpreted as a literal value instead of a column name.

endswith(other: ColumnOrLiteralStr) Column#

Returns true if this Column ends with another string.

Parameters:

other – A Column or a str that is used to check if this column ends with it. A str will be interpreted as a literal value instead of a column name.

eqNullSafe(other: ColumnOrLiteral | Expression)#

Null safe equals.

equal_nan() Column#

Is NaN.

equal_null(other: ColumnOrLiteral | Expression)#

Null safe equals.

ilike(pattern: ColumnOrLiteralStr) Column#

Allows case-insensitive matching of strings based on comparison with a pattern.

Parameters:

pattern – A Column or a str that indicates the pattern. A str will be interpreted as a literal value instead of a column name.

For details, see the Trino documentation on LIKE.

in_(*vals: LiteralType | Iterable[LiteralType] | DataFrame) Column#

Returns a conditional expression that you can pass to the DataFrame.filter() or where DataFrame.where() to perform the equivalent of a WHERE … IN query with a specified list of values. You can also pass this to a DataFrame.select() call.

The expression evaluates to true if the value in the column is one of the values in a specified sequence.

For example, the following code returns a DataFrame that contains the rows where the column “a” contains the value 1, 2, or 3. This is equivalent to SELECT * FROM table WHERE a IN (1, 2, 3).

isin() is an alias for in_().

Parameters:

vals – The values, or a DataFrame instance to use to check for membership against this column.

Examples

>>> from pystarburst.functions import lit
>>> df = session.create_dataframe([[1, "x"], [2, "y"] ,[4, "z"]], schema=["a", "b"])
>>> # Basic example
>>> df.filter(df["a"].in_(lit(1), lit(2), lit(3))).collect()
[Row(A=1, B='x'), Row(A=2, B='y')]

>>> # Check in membership for a DataFrame that has a single column
>>> df_for_in = session.create_dataframe([[1], [2] ,[3]], schema=["col1"])
>>> df.filter(df["a"].in_(df_for_in)).sort(df["a"].asc()).collect()
[Row(A=1, B='x'), Row(A=2, B='y')]

>>> # Use in with a select method call
>>> df.select(df["a"].in_(lit(1), lit(2), lit(3)).alias("is_in_list")).collect()
[Row(IS_IN_LIST=True), Row(IS_IN_LIST=True), Row(IS_IN_LIST=False)]
isNotNull() Column#

Is not null.

isNull() Column#

Is null.

is_not_null() Column#

Is not null.

is_null() Column#

Is null.

isin(*vals: LiteralType | Iterable[LiteralType] | DataFrame) Column#

Returns a conditional expression that you can pass to the DataFrame.filter() or where DataFrame.where() to perform the equivalent of a WHERE … IN query with a specified list of values. You can also pass this to a DataFrame.select() call.

The expression evaluates to true if the value in the column is one of the values in a specified sequence.

For example, the following code returns a DataFrame that contains the rows where the column “a” contains the value 1, 2, or 3. This is equivalent to SELECT * FROM table WHERE a IN (1, 2, 3).

isin() is an alias for in_().

Parameters:

vals – The values, or a DataFrame instance to use to check for membership against this column.

Examples

>>> from pystarburst.functions import lit
>>> df = session.create_dataframe([[1, "x"], [2, "y"] ,[4, "z"]], schema=["a", "b"])
>>> # Basic example
>>> df.filter(df["a"].in_(lit(1), lit(2), lit(3))).collect()
[Row(A=1, B='x'), Row(A=2, B='y')]

>>> # Check in membership for a DataFrame that has a single column
>>> df_for_in = session.create_dataframe([[1], [2] ,[3]], schema=["col1"])
>>> df.filter(df["a"].in_(df_for_in)).sort(df["a"].asc()).collect()
[Row(A=1, B='x'), Row(A=2, B='y')]

>>> # Use in with a select method call
>>> df.select(df["a"].in_(lit(1), lit(2), lit(3)).alias("is_in_list")).collect()
[Row(IS_IN_LIST=True), Row(IS_IN_LIST=True), Row(IS_IN_LIST=False)]
like(pattern: ColumnOrLiteralStr) Column#

Allows case-sensitive matching of strings based on comparison with a pattern.

Parameters:

pattern – A Column or a str that indicates the pattern. A str will be interpreted as a literal value instead of a column name.

For details, see the Trino documentation on LIKE.

name(alias: str) Column#

Returns a new renamed Column.

over(window: WindowSpec | None = None) Column#

Returns a window frame, based on the specified WindowSpec.

regexp(pattern: ColumnOrLiteralStr) Column#

Returns true if this Column contains the specified regular expression.

Parameters:

pattern – A Column or a str that indicates the pattern. A str will be interpreted as a literal value instead of a column name.

For details, see the Trino documentation on regular expressions.

rlike() is an alias of regexp(). regexp_like() is an alias of regexp().

regexp_like(pattern: ColumnOrLiteralStr) Column#

Returns true if this Column contains the specified regular expression.

Parameters:

pattern – A Column or a str that indicates the pattern. A str will be interpreted as a literal value instead of a column name.

For details, see the Trino documentation on regular expressions.

rlike() is an alias of regexp(). regexp_like() is an alias of regexp().

rlike(pattern: ColumnOrLiteralStr) Column#

Returns true if this Column contains the specified regular expression.

Parameters:

pattern – A Column or a str that indicates the pattern. A str will be interpreted as a literal value instead of a column name.

For details, see the Trino documentation on regular expressions.

rlike() is an alias of regexp(). regexp_like() is an alias of regexp().

starts_with(other: ColumnOrLiteralStr) Column#

Returns true if this Column starts with another string.

Parameters:

other – A Column or a str that is used to check if this column starts with it. A str will be interpreted as a literal value instead of a column name.

startswith(other: ColumnOrLiteralStr) Column#

Returns true if this Column starts with another string.

Parameters:

other – A Column or a str that is used to check if this column starts with it. A str will be interpreted as a literal value instead of a column name.

substr(start_pos: Column | int, length: Column | int) Column#

Returns a substring of this string column.

Parameters:
  • start_pos – The starting position of the substring. Please note that the first character has position 1 instead of 0 in Trino cluster.

  • length – The length of the substring.

substring() is an alias of substr().

substring(start_pos: Column | int, length: Column | int) Column#

Returns a substring of this string column.

Parameters:
  • start_pos – The starting position of the substring. Please note that the first character has position 1 instead of 0 in Trino cluster.

  • length – The length of the substring.

substring() is an alias of substr().

try_cast(to: str | DataType) Column#

Tries to cast the value of the Column to the specified data type. It returns a NULL value instead of raising an error when the conversion can not be performed.