Dataframe functions#

Dataframe methods#

Provides utility and SQL functions that generate Column expressions that you can pass to DataFrame transformation methods.

These utility functions generate references to columns, literals, and SQL expressions (e.g. “c + 1”).

  • Use col() to convert a column name to a Column object. Refer to the API docs of Column to know more ways of referencing a column.

  • Use lit() to convert a Python value to a Column object that represents a constant value in Trino SQL.

  • Use sql_expr() to convert a Trino SQL expression to a Column.

    >>> df = session.create_dataframe([[1, 'a', True, '2022-03-16'], [3, 'b', False, '2023-04-17']], schema=["a", "b", "c", "d"])
    >>> res1 = df.filter(col("a") == 1).collect()
    >>> res2 = df.filter(lit(1) == col("a")).collect()
    >>> res3 = df.filter(sql_expr("a = 1")).collect()
    >>> assert res1 == res2 == res3
    >>> res1
    [Row(A=1, B='a', C=True, D='2022-03-16')]
    

Some DataFrame methods accept column names or SQL expressions text aside from a Column object for convenience. For instance:

>>> df.filter("a = 1").collect()  # use the SQL expression directly in filter
[Row(A=1, B='a', C=True, D='2022-03-16')]
>>> df.select("a").collect()
[Row(A=1), Row(A=3)]

whereas Column objects enable you to use chained column operators and transformations with Python code fluently:

>>> # Use columns and literals in expressions.
>>> df.select(((col("a") + 1).cast("string")).alias("add_one")).show()
-------------
|"ADD_ONE"  |
-------------
|2          |
|4          |
-------------

Trino has hundreds of SQL functions This module provides Python functions that correspond to the Trino SQL functions. They typically accept Column objects or column names as input parameters and return a new Column objects. The following examples demonstrate the use of some of these functions:

>>> # This example calls the function that corresponds to the TO_DATE() SQL function.
>>> df.select(dateadd('day', lit(1), to_date(col("d")))).show()
---------------------------------------
|"DATEADD('DAY', 1, TO_DATE(""D""))"  |
---------------------------------------
|2022-03-17                           |
|2023-04-18                           |
---------------------------------------

If you want to use a SQL function in Trino but can’t find the corresponding Python function here, you can create your own Python function with function():

>>> my_radians = function("radians")  # "radians" is the SQL function name.
>>> df.select(my_radians(col("a")).alias("my_radians")).show()
------------------------
|"MY_RADIANS"          |
------------------------
|0.017453292519943295  |
|0.05235987755982988   |
------------------------

or call the SQL function directly:

>>> df.select(call_function("radians", col("a")).as_("call_function_radians")).show()
---------------------------
|"CALL_FUNCTION_RADIANS"  |
---------------------------
|0.017453292519943295     |
|0.05235987755982988      |
---------------------------

How to find help on input parameters of the Python functions for SQL functions The Python functions have the same name as the corresponding SQL functions.

By reading the API docs or the source code of a Python function defined in this module, you’ll see the type hints of the input parameters and return type. The return type is always Column. The input types tell you the acceptable values:

  • ColumnOrName accepts a Column object, or a column name in str. Most functions accept this type. If you still want to pass a literal to it, use lit(value), which returns a Column object that represents a literal value.

    >>> df.select(avg("a")).show()
    ----------------
    |"AVG(""A"")"  |
    ----------------
    |2.000000      |
    ----------------
    
    >>> df.select(avg(col("a"))).show()
    ----------------
    |"AVG(""A"")"  |
    ----------------
    |2.000000      |
    ----------------
    
  • LiteralType accepts a value of type bool, int, float, str, bytearray, decimal.Decimal, datetime.date, datetime.datetime, datetime.time, or bytes. An example is the third parameter of lead().

    >>> import datetime
    >>> from pystarburst.window import Window
    >>> df.select(col("d"), lead("d", 1, datetime.date(2024, 5, 18), False).over(Window.order_by("d")).alias("lead_day")).show()
    ---------------------------
    |"D"         |"LEAD_DAY"  |
    ---------------------------
    |2022-03-16  |2023-04-17  |
    |2023-04-17  |2024-05-18  |
    ---------------------------
    
  • ColumnOrLiteral accepts a Column object, or a value of LiteralType mentioned above. The difference from ColumnOrLiteral is ColumnOrLiteral regards a str value as a SQL string value instead of a column name. When a function is much more likely to accept a SQL constant value than a column expression, ColumnOrLiteral is used. Yet you can still pass in a Column object if you need to. An example is the second parameter of :func:when.

    >>> df.select(when(df["a"] > 2, "Greater than 2").else_("Less than 2").alias("compare_with_2")).show()
    --------------------
    |"COMPARE_WITH_2"  |
    --------------------
    |Less than 2       |
    |Greater than 2    |
    --------------------
    
  • int, bool, str, or another specific type accepts a value of that type. An example is to_decimal().

    >>> df.with_column("e", lit("1.2")).select(to_decimal("e", 5, 2)).show()
    -----------------------------
    |"TO_DECIMAL(""E"", 5, 2)"  |
    -----------------------------
    |1.20                       |
    |1.20                       |
    -----------------------------
    
  • ColumnOrSqlExpr accepts a Column object, or a SQL expression. For instance, the first parameter in :func:when.

    >>> df.select(when("a > 2", "Greater than 2").else_("Less than 2").alias("compare_with_2")).show()
    --------------------
    |"COMPARE_WITH_2"  |
    --------------------
    |Less than 2       |
    |Greater than 2    |
    --------------------
    
pystarburst.functions.abs(e: ColumnOrName) Column#

Returns the absolute value of a numeric expression.

pystarburst.functions.acos(e: ColumnOrName) Column#

Computes the inverse cosine (arc cosine) of its input; the result is a number in the interval [-pi, pi].

pystarburst.functions.acosh(e: ColumnOrName) Column#

Computes the inverse hyperbolic cosine of its input; the result is a number in the interval [-pi, pi].

pystarburst.functions.add_months(date_or_timestamp: ColumnOrName, number_of_months: Column | int) Column#

Adds or subtracts a specified number of months to a date or timestamp, preserving the end-of-month information.

Example

>>> import datetime
>>> df = session.create_dataframe([datetime.date(2022, 4, 6)], schema=["d"])
>>> df.select(add_months("d", 4)).collect()[0][0]
datetime.date(2022, 8, 6)
pystarburst.functions.aggregate(array: ColumnOrName, initialState: ColumnOrName, input_func: Callable, output_func: Callable | None = None) Column#

Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value. It may be the identity function (i -> i) (default if not specified).

Example::
>>> df = session.createDataFrame([(1, [20.0, 4.0, 2.0, 6.0, 10.0])], ["id", "values"])
>>> df.select(aggregate("values", lit(0.0), lambda acc, x: acc + x).alias("sum")).show()
---------
|"sum"  |
---------
|42.0   |
---------
pystarburst.functions.all_match(array: ColumnOrName, func: Callable) Column#

Returns whether all elements of an array match the given predicate. Returns true if all the elements match the predicate (a special case is when the array is empty); false if one or more elements don’t match; NULL if the predicate function returns NULL for one or more elements and true for all other elements.

Example::
>>> df = session.createDataFrame(
...     [(1, ["bar"]), (2, ["foo", "bar"]), (3, ["foobar", "foo"])],
...     ["key", "values"]
... )
>>> df.select(forall("values", lambda x: x.rlike("foo")).alias("all_foo")).show()
-------------
|"all_foo"  |
-------------
|False      |
|False      |
|True       |
-------------
pystarburst.functions.any_match(array: ColumnOrName, func: Callable) Column#

Returns whether any elements of an array match the given predicate. Returns true if one or more elements match the predicate; false if none of the elements matches (a special case is when the array is empty); NULL if the predicate function returns NULL for one or more elements and false for all other elements.

Example::
>>> df = session.createDataFrame([(1, [1, 2, 3, 4]), (2, [3, -1, 0])],["key", "values"])
>>> df.select(exists("values", lambda x: x < 0).alias("any_negative")).show()
------------------
|"any_negative"  |
------------------
|False           |
|True            |
------------------
pystarburst.functions.any_value(e: ColumnOrName) Column#

Returns a non-deterministic any value for the specified column. This is an aggregate and window function.

Example

>>> df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"])
>>> result = df.select(any_value("a")).collect()
>>> assert len(result) == 1  # non-deterministic value in result.
pystarburst.functions.approx_count_distinct(e: ColumnOrName) Column#

Uses HyperLogLog to return an approximation of the distinct cardinality of the input (i.e. HLL(col1, col2, … ) returns an approximation of COUNT(DISTINCT col1, col2, … )).

pystarburst.functions.approx_distinct(e: ColumnOrName) Column#

Uses HyperLogLog to return an approximation of the distinct cardinality of the input (i.e. HLL(col1, col2, … ) returns an approximation of COUNT(DISTINCT col1, col2, … )).

pystarburst.functions.approx_percentile(col: ColumnOrName, percentile: float) Column#

Returns an approximated value for the desired percentile. This function uses the t-Digest algorithm.

pystarburst.functions.approx_percentile_accumulate(col: ColumnOrName) Column#

Returns the internal representation of the t-Digest state (as a JSON object) at the end of aggregation. This function uses the t-Digest algorithm.

pystarburst.functions.approx_percentile_combine(state: ColumnOrName) Column#

Combines (merges) percentile input states into a single output state. This allows scenarios where APPROX_PERCENTILE_ACCUMULATE is run over horizontal partitions of the same table, producing an algorithm state for each table partition. These states can later be combined using APPROX_PERCENTILE_COMBINE, producing the same output state as a single run of APPROX_PERCENTILE_ACCUMULATE over the entire table.

pystarburst.functions.approx_percentile_estimate(state: ColumnOrName, percentile: float) Column#

Returns the desired approximated percentile value for the specified t-Digest state. APPROX_PERCENTILE_ESTIMATE(APPROX_PERCENTILE_ACCUMULATE(.)) is equivalent to APPROX_PERCENTILE(.).

pystarburst.functions.array_agg(col: ColumnOrName, is_distinct: bool = False) Column#

Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.

pystarburst.functions.array_append(array: ColumnOrName, element: ColumnOrName) Column#

Returns an ARRAY containing all elements from the source ARRAY as well as the new element. The new element is located at end of the ARRAY.

Parameters:
  • array – The column containing the source ARRAY.

  • element – The column containing the element to be appended. The data type does need to match the data type of the existing elements in the ARRAY.

pystarburst.functions.array_cat(*cols: ColumnOrName) Column#

Concatenates one or more strings, binary values, arrays. If any of the values is null, the result is also null.

Args::

cols: A list of the columns to concatenate.

Example::
>>> df = session.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df = df.select(concat(df.s, df.d).alias('s'))
>>> df.collect()
[Row(s='abcd123')]
>>> df = session.createDataFrame([([1, 2], [3, 4], [5]), ([1, 2], None, [3])], ['a', 'b', 'c'])
>>> df = df.select(concat(df.a, df.b, df.c).alias("arr"))
>>> df.collect()
[Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
pystarburst.functions.array_compact(array: ColumnOrName) Column#

Returns a compacted ARRAY with missing and null values removed, effectively converting sparse arrays into dense arrays.

Parameters:

array – Column containing the source ARRAY to be compacted

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, None, 3])])
>>> df.select("a", array_compact("a").alias("compacted")).show()
------------------------------
|"a"           |"compacted"  |
------------------------------
|[1, None, 3]  |[1, 3]       |
------------------------------
pystarburst.functions.array_construct(*args: ColumnOrLiteral) Column#

Creates a new array column.

Parameters:

args – Columns containing the values (or expressions that evaluate to values).

Example::
>>> df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"])
>>> df.select(array_construct("a", "b").alias("result")).show()
------------
|"result"  |
------------
|[1, 2]    |
|[3, 4]    |
------------
pystarburst.functions.array_contains(array: ColumnOrName, element: ColumnOrName) Column#

Returns true if the array contains the element.

pystarburst.functions.array_distinct(col: ColumnOrName)#

Remove duplicate values from the array.

Parameters:

col – The array column

Returns:

Returns a new ARRAY that contains only the distinct elements from the input ARRAY.

Example:

>>> from pystarburst.functions import array_construct,array_distinct,lit
>>> df = session.createDataFrame([["1"]], ["A"])
>>> df = df.withColumn("array", array_construct(lit(1), lit(1), lit(1), lit(2), lit(3), lit(2), lit(2)))
>>> df.withColumn("array_d", array_distinct("ARRAY")).show()
|"a"  |"array"                |"array_d"  |
-------------------------------------------
|1    |[1, 1, 1, 2, 3, 2, 2]  |[1, 2, 3]  |
-------------------------------------------
pystarburst.functions.array_except(array1: ColumnOrName, array2: ColumnOrName) Column#

Returns an array of elements in array1 but not in array2, without duplicates.

Example:

>>> from pystarburst import Row
>>> df = session.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])])
>>> df.select(array_except(df.c1, df.c2)).collect()
[Row(array_except(c1, c2)=['b'])]
pystarburst.functions.array_flatten(array: ColumnOrName) Column#

Returns a single array from an array of arrays. If the array is nested more than two levels deep, then only a single level of nesting is removed.

Parameters:

array – the input array

Example::
>>> df = session.createDataFrame([([[1, 2, 3], [4, 5], [6]],), ([None, [4, 5]],)], ['data'])
>>> df.select(flatten(df.data)).show()
----------------------
|"flatten(data)"     |
----------------------
|[1, 2, 3, 4, 5, 6]  |
|[4, 5]              |
----------------------
pystarburst.functions.array_insert(array: ColumnOrName, pos: ColumnOrName | int, element: ColumnOrName) Column#

Returns an ARRAY containing all elements from the source ARRAY as well as the new element.

Parameters:
  • array – Column containing the source ARRAY.

  • pos – Column containing a (one-based) position in the source ARRAY. The new element is inserted at this position. The original element from this position (if any) and all subsequent elements (if any) are shifted by one position to the right in the resulting array (i.e. inserting at position 1 has the same effect as using array_prepend). A negative position is interpreted as an index from the back of the array (e.g. -1 results in insertion before the last element in the array).

  • element – Column containing the element to be inserted. The new element is located at position pos. The relative order of the other elements from the source array is preserved.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row([1, 2]), Row([1, 3])], schema=["a"])
>>> df.select(array_insert("a", lit(1), lit(10)).alias("result")).show()
--------------
|"result"    |
--------------
|[10, 1, 2]  |
|[10, 1, 3]  |
--------------
pystarburst.functions.array_intersect(array1: ColumnOrName, array2: ColumnOrName) Column#

Returns an array that contains the matching elements in the two input arrays.

The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.

Parameters:
  • array1 – An ARRAY that contains elements to be compared.

  • array2 – An ARRAY that contains elements to be compared.

pystarburst.functions.array_intersection(array1: ColumnOrName, array2: ColumnOrName) Column#

Returns an array that contains the matching elements in the two input arrays.

The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.

Parameters:
  • array1 – An ARRAY that contains elements to be compared.

  • array2 – An ARRAY that contains elements to be compared.

pystarburst.functions.array_join(array: ColumnOrName, delimiter: ColumnOrName, null_replacement: ColumnOrName | None = None) Column#

Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.

Parameters:
  • array – Column containing the ARRAY of elements to convert to a string.

  • delimiter – Column containing the string to put between each element (e.g. a space, comma, or other human-readable delimiter).

  • null_replacement – Optional value to replace nulls.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, 45, None])])
>>> df.select(array_join("a", lit(",")).alias("result")).show()
------------
|"result"  |
------------
|1,45      |
------------
pystarburst.functions.array_max(array: ColumnOrName) Column#

Returns the maximum value of input array. Null values are omitted.

Example::
>>> df = session.createDataFrame([([2, 1, 3],), ([None, 10, -1],)], ['data'])
>>> df.select(array_max(df.data).alias('max')).collect()
[Row(max=3), Row(max=10)]
pystarburst.functions.array_min(array: ColumnOrName) Column#

Returns the minimum value of input array. Null values are omitted.

Example::
>>> df = session.createDataFrame([([2, 1, 3],), ([None, 10, -1],)], ['data'])
>>> df.select(array_min(df.data).alias('min')).collect()
[Row(min=1), Row(min=-1)]
pystarburst.functions.array_position(array: ColumnOrName, element: ColumnOrName) Column#

Returns the position of the first occurrence of the element in array (or 0 if not found). Notice that Trino ARRAY indexing starts from 1

pystarburst.functions.array_prepend(array: ColumnOrName, element: ColumnOrName) Column#

Returns an ARRAY containing the new element as well as all elements from the source ARRAY. The new element is positioned at the beginning of the ARRAY.

Parameters:
  • array – Column containing the source ARRAY.

  • element – Column containing the element to be prepended. The data type does need to match the data type of the existing elements in the ARRAY.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, 2, 3])])
>>> df.select(array_prepend("a", lit(4)).alias("result")).show()
----------------
|"result"      |
----------------
|[4, 1, 2, 3]  |
----------------
pystarburst.functions.array_remove(array: ColumnOrName, element: ColumnOrName | int | float)#

Remove all elements that equal element from array.

pystarburst.functions.array_repeat(element: ColumnOrName, count: ColumnOrName | int) Column#

Creates an array containing an element repeated count times.

pystarburst.functions.array_size(array: ColumnOrName) Column#

Returns the cardinality (size) of the array or map.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, 2, 3])])
>>> df.select(size("a").alias("result")).show()
------------
|"result"  |
------------
|3         |
------------
pystarburst.functions.array_slice(array: ColumnOrName, start: ColumnOrName | int, length: ColumnOrName | int) Column#

Subsets array starting from index start (or starting from the end if start is negative) with a length of length. The position of the first element is 1

pystarburst.functions.array_sort(array: ColumnOrName, func: Callable = None) Column#

Sorts and returns the array based on the given comparator function. The comparator will take two nullable arguments representing two nullable elements of the array. It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. If the comparator function returns other values (including NULL), the query will fail and raise an error.

Example

>>> df = session.createDataFrame([([2, 1, None, 3],),([1],),([],)], ['data'])
>>> df.select(array_sort(df.data).alias('r')).collect()
[Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])]
>>> df = session.create_dataframe([([3, 2, 5, 1, 2],)], ["data"])
>>> df.select(array_sort("data", lambda x, y: iff(x < y, 1, iff(x == y, 0, -1)))).collect()
[Row([5, 3, 2, 2, 1])]
pystarburst.functions.array_to_string(array: ColumnOrName, delimiter: ColumnOrName, null_replacement: ColumnOrName | None = None) Column#

Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.

Parameters:
  • array – Column containing the ARRAY of elements to convert to a string.

  • delimiter – Column containing the string to put between each element (e.g. a space, comma, or other human-readable delimiter).

  • null_replacement – Optional value to replace nulls.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, 45, None])])
>>> df.select(array_join("a", lit(",")).alias("result")).show()
------------
|"result"  |
------------
|1,45      |
------------
pystarburst.functions.array_union(array1: ColumnOrName, array2: ColumnOrName) Column#

Returns an array of the elements in the union of array1 and array2, without duplicates.

Example:

>>> from pystarburst import Row
>>> df = session.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])])
>>> df.select(array_union(df.c1, df.c2)).collect()
[Row(array_union(c1, c2)=['b', 'a', 'c', 'd', 'f'])]
pystarburst.functions.arrays_overlap(array1: ColumnOrName, array2: ColumnOrName) Column#

Compares whether two ARRAYs have at least one element in common. Returns TRUE if there is at least one element in common; otherwise returns FALSE. The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.

pystarburst.functions.arrays_zip(*columns: ColumnOrName) Column#

Merges the given arrays, element-wise, into a single array of rows. The M-th element of the N-th argument will be the N-th field of the M-th output element. If the arguments have an uneven length, missing values are filled with NULL. Currently, this function accepts at most 5 arrays (limitation of Trino zip function).

pystarburst.functions.asc(c: ColumnOrName) Column#

Returns a Column expression with values sorted in ascending order.

pystarburst.functions.asc_nulls_first(c: ColumnOrName) Column#

Returns a Column expression with values sorted in ascending order (null values sorted before non-null values).

pystarburst.functions.asc_nulls_last(c: ColumnOrName) Column#

Returns a Column expression with values sorted in ascending order (null values sorted after non-null values).

pystarburst.functions.ascii(e: ColumnOrName) Column#

Returns the unicode code for the first character of a string. If the string is empty, a value of 0 is returned.

pystarburst.functions.asin(e: ColumnOrName) Column#

Computes the inverse sine (arc sine) of its input; the result is a number in the interval [-pi, pi].

pystarburst.functions.assert_true(col: ColumnOrName, error_description: ColumnOrLiteralStr = None) Column#

Returns null if the input column is true; throws an exception with the provided error message otherwise.

Examples

>>> df = session.createDataFrame([(0,1)], ['a', 'b'])
>>> df.select(assert_true(df.a < df.b).alias('r')).collect()
[Row(r=None)]
>>> df = session.createDataFrame([(0,1)], ['a', 'b'])
>>> df.session(assert_true(df.a < df.b, df.a).alias('r')).collect()
[Row(r=None)]
>>> df = session.createDataFrame([(0,1)], ['a', 'b'])
>>> df.select(assert_true(df.a < df.b, 'error').alias('r')).collect()
[Row(r=None)]
pystarburst.functions.atan(e: ColumnOrName) Column#

Computes the inverse tangent (arc tangent) of its input; the result is a number in the interval [-pi, pi].

pystarburst.functions.atan2(y: ColumnOrName, x: ColumnOrName) Column#

Computes the inverse tangent (arc tangent) of its input; the result is a number in the interval [-pi, pi].

pystarburst.functions.avg(e: ColumnOrName) Column#

Returns the average of non-NULL records. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.base64(e: ColumnOrName) Column#

Encodes binary into a base64 string representation.

pystarburst.functions.bit_length(e: ColumnOrName) Column#

“Calculates the bit length for the specified string column.

pystarburst.functions.bitand(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the bitwise negation of a numeric expression.

Example

>>> df = session.create_dataframe([19, 25], schema=["a", "b"])
>>> df.select(bitand("a", "b")).collect()[0][0]
17
pystarburst.functions.bitnot(e: ColumnOrName) Column#

Returns the bitwise negation of a numeric expression.

Example

>>> df = session.create_dataframe([1], schema=["a"])
>>> df.select(bitnot("a")).collect()[0][0]
-2
pystarburst.functions.bitor(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the bitwise negation of a numeric expression.

Example

>>> df = session.create_dataframe([19, 25], schema=["a", "b"])
>>> df.select(bitor("a", "b")).collect()[0][0]
27
pystarburst.functions.bitshiftleft(to_shift_column: ColumnOrName, n: Column | int) Column#

Returns the left shifted value of value

Example

>>> df = session.create_dataframe([2], schema=["a"])
>>> df.select(bitshiftleft("a", 1)).collect()[0][0]
4
pystarburst.functions.bitshiftright(to_shift_column: ColumnOrName, n: Column | int) Column#

Returns the right shifted value of value

Example

>>> df = session.create_dataframe([2], schema=["a"])
>>> df.select(bitshiftright("a", 1)).collect()[0][0]
1
pystarburst.functions.bitwiseNOT(e: ColumnOrName) Column#

Returns the bitwise negation of a numeric expression.

Example

>>> df = session.create_dataframe([1], schema=["a"])
>>> df.select(bitnot("a")).collect()[0][0]
-2
pystarburst.functions.bitwise_not(e: ColumnOrName) Column#

Returns the bitwise negation of a numeric expression.

Example

>>> df = session.create_dataframe([1], schema=["a"])
>>> df.select(bitnot("a")).collect()[0][0]
-2
pystarburst.functions.bitxor(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the bitwise XOR of x and y in 2’s complement representation,

Example

>>> df = session.create_dataframe([19, 25], schema=["a", "b"])
>>> df.select(bitxor("a", "b")).collect()[0][0]
10
pystarburst.functions.builtin(function_name: str) Callable#

Function object to invoke a Trino system-defined function (built-in function). Use this to invoke any built-in functions not explicitly listed in this object.

Parameters:

function_name – The name of built-in function in Trino.

Returns:

A Callable object for calling a Trino system-defined function.

Example::
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"])  # a single column with 4 rows
>>> df.select(call_function("avg", col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------

>>> my_avg = function('avg')
>>> df.select(my_avg(col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------
pystarburst.functions.call_builtin(function_name: str, *args: ColumnOrLiteral) Column#

Invokes a Trino system-defined function (built-in function) with the specified name and arguments.

Parameters:
  • function_name – The name of built-in function in Trino

  • args

    Arguments can be in two types:

    • Column, or

    • Basic Python types, which are converted to pystarburst literals.

Example::
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"])  # a single column with 4 rows
>>> df.select(call_function("avg", col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------
pystarburst.functions.call_function(function_name: str, *args: ColumnOrLiteral) Column#

Invokes a Trino system-defined function (built-in function) with the specified name and arguments.

Parameters:
  • function_name – The name of built-in function in Trino

  • args

    Arguments can be in two types:

    • Column, or

    • Basic Python types, which are converted to pystarburst literals.

Example::
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"])  # a single column with 4 rows
>>> df.select(call_function("avg", col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------
pystarburst.functions.call_table_function(function_name: str, *args: ColumnOrLiteral, **kwargs: ColumnOrLiteral) TableFunctionCall#

Invokes a Trino table function, including system-defined table functions and user-defined table functions.

It returns a TableFunctionCall() so you can specify the partition clause.

Parameters:
  • function_name – The name of the table function.

  • args – The positional arguments of the table function.

  • **kwargs – The named arguments of the table function. Some table functions (e.g., flatten) have named arguments instead of positional ones.

Example

>>> from pystarburst.functions import lit
>>> session.table_function(call_table_function("sequence", lit(0), lit(4)).over()).collect()
[Row(sequential_number=0), Row(sequential_number=1), Row(sequential_number=2), Row(sequential_number=3), Row(sequential_number=4)]
pystarburst.functions.cast(column: ColumnOrName, to: str | DataType) Column#

Converts a value of one data type into another data type. The semantics of CAST are the same as the semantics of the corresponding to datatype conversion functions. If the cast is not possible, an error is raised.

pystarburst.functions.cbrt(e: ColumnOrName) Column#

Computes the cube root of its argument.

pystarburst.functions.ceil(e: ColumnOrName) Column#

Returns values from the specified column rounded to the nearest equal or larger integer.

pystarburst.functions.coalesce(*e: ColumnOrName) Column#

Returns the first non-NULL expression among its arguments, or NULL if all its arguments are NULL.

pystarburst.functions.codepoint(e: ColumnOrName) Column#

Returns the unicode code for the first character of a string. If the string is empty, a value of 0 is returned.

pystarburst.functions.col(col_name: str) Column#

Returns the Column with the specified name.

pystarburst.functions.collectList(col: ColumnOrName, is_distinct: bool = False) Column#

Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.

pystarburst.functions.collectSet(*args, **kwargs)#
pystarburst.functions.collect_list(col: ColumnOrName, is_distinct: bool = False) Column#

Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.

pystarburst.functions.collect_set(*args, **kwargs)#
pystarburst.functions.column(col_name: str) Column#

Returns a Column with the specified name. Alias for col.

pystarburst.functions.concat(*cols: ColumnOrName) Column#

Concatenates one or more strings, binary values, arrays. If any of the values is null, the result is also null.

Args::

cols: A list of the columns to concatenate.

Example::
>>> df = session.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df = df.select(concat(df.s, df.d).alias('s'))
>>> df.collect()
[Row(s='abcd123')]
>>> df = session.createDataFrame([([1, 2], [3, 4], [5]), ([1, 2], None, [3])], ['a', 'b', 'c'])
>>> df = df.select(concat(df.a, df.b, df.c).alias("arr"))
>>> df.collect()
[Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
pystarburst.functions.concat_ws(*cols: ColumnOrName) Column#

Concatenates two or more strings, or concatenates two or more binary values. If any of the values is null, the result is also null. The CONCAT_WS operator requires at least two arguments, and uses the first argument to separate all following arguments.

pystarburst.functions.contains(col: ColumnOrName, string: ColumnOrName) Column#

Returns true if col contains str.

pystarburst.functions.conv(e: ColumnOrName, from_base: int, to_base: int) Column#

Computes a number in the string form from one base to another.

pystarburst.functions.convert_timezone(target_timezone: ColumnOrName, source_time: ColumnOrName, source_timezone: ColumnOrName | None = None) Column#

Converts the given source_time to the target timezone.

For timezone information, refer to the Trino time conversion notes

Args:

target_timezone: The time zone to which the input timestamp should be converted.= source_time: The timestamp to convert. When it’s a TIMESTAMP_LTZ, use None for source_timezone. source_timezone: The time zone for the source_time. Required for timestamps with no time zone (i.e. TIMESTAMP_NTZ). Use None if the timestamps have a time zone (i.e. TIMESTAMP_LTZ). Default is None.

Note:

The sequence of the 3 params is different from the SQL function, which two overloads:

  • CONVERT_TIMEZONE( <source_tz> , <target_tz> , <source_timestamp_ntz> )

  • CONVERT_TIMEZONE( <target_tz> , <source_timestamp> )

The first parameter source_tz is optional. But in Python an optional argument shouldn’t be placed at the first. So source_timezone is after source_time.

Example:
>>> import datetime
>>> from dateutil import tz
>>> datetime_with_tz = datetime.datetime(2022, 4, 6, 9, 0, 0, tzinfo=tz.tzoffset("myzone", -3600*7))
>>> datetime_with_no_tz = datetime.datetime(2022, 4, 6, 9, 0, 0)
>>> df = session.create_dataframe([[datetime_with_tz, datetime_with_no_tz]], schema=["a", "b"])
>>> result = df.select(convert_timezone(lit("UTC"), col("a")), convert_timezone(lit("UTC"), col("b"), lit("Asia/Shanghai"))).collect()
>>> result[0][0]
datetime.datetime(2022, 4, 6, 16, 0, tzinfo=<UTC>)
>>> result[0][1]
datetime.datetime(2022, 4, 6, 1, 0)
pystarburst.functions.corr(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the correlation coefficient for non-null pairs in a group.

pystarburst.functions.cos(e: ColumnOrName) Column#

Computes the cosine of its argument; the argument should be expressed in radians.

pystarburst.functions.cosh(e: ColumnOrName) Column#

Computes the hyperbolic cosine of its argument.

pystarburst.functions.cot(e: ColumnOrName) Column#

Computes the cotangent of its argument.

pystarburst.functions.count(e: ColumnOrName) Column#

Returns either the number of non-NULL records for the specified columns, or the total number of records.

pystarburst.functions.countDistinct(*cols: ColumnOrName) Column#

Returns either the number of non-NULL distinct records for the specified columns, or the total number of the distinct records.

pystarburst.functions.count_distinct(*cols: ColumnOrName) Column#

Returns either the number of non-NULL distinct records for the specified columns, or the total number of the distinct records.

pystarburst.functions.covar_pop(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the population covariance for non-null pairs in a group.

pystarburst.functions.covar_samp(column1: ColumnOrName, column2: ColumnOrName) Column#

Returns the sample covariance for non-null pairs in a group.

pystarburst.functions.crc32(e: ColumnOrName) Column#

Computes the CRC-32 of binary.

pystarburst.functions.create_map(*cols: ColumnOrName) Column#

Creates a new map out of a series of rows

Example

>>> df = session.createDataFrame(
[
    ("Alice", 2),
    ("Bob", 5),
    ("Charlie", 6),
],
schema=["a", "b"])
>>> df.select(create_map("a", "b")).show()
---------------------------------------------
|"map_from_entries(array_agg(row (a, b)))"  |
---------------------------------------------
|{'Bob': 5, 'Alice': 2, 'Charlie': 6}       |
---------------------------------------------
Parameters:

cols – the column containing two-element rows, each one containing the key-value pair

pystarburst.functions.csc(e: ColumnOrName) Column#

Computes the cosecant of its argument.

pystarburst.functions.cume_dist() Column#

Finds the cumulative distribution of a value with regard to other values within the same window partition.

pystarburst.functions.current_catalog() Column#

Returns the name of the catalog in use for the current session.

Example

>>> # Return result is tied to session, so we only test if the result exists
>>> result = session.create_dataframe([1]).select(current_catalog()).collect()
>>> assert result is not None
pystarburst.functions.current_date() Column#

Returns the current date for the system.

pystarburst.functions.current_groups() Column#

Returns the list of groups for the current user running the query.

Example

>>> # Return result is tied to session, so we only test if the result exists
>>> result = session.create_dataframe([1]).select(current_groups()).collect()
>>> assert result is not None
pystarburst.functions.current_schema() Column#

Returns the name of the schema in use for the current session.

Example

>>> # Return result is tied to session, so we only test if the result exists
>>> result = session.create_dataframe([1]).select(current_schema()).collect()
>>> assert result is not None
pystarburst.functions.current_time() Column#

Returns the current time for the system.

pystarburst.functions.current_timestamp(precision: int | None = None) Column#

Returns the current timestamp for the system.

pystarburst.functions.current_timezone() Column#

Returns the current time zone for the system.

pystarburst.functions.current_user() Column#

Returns the name of the user currently logged into the system.

Example

>>> # Return result is tied to session, so we only test if the result exists
>>> result = session.create_dataframe([1]).select(current_user()).collect()
>>> assert result is not None
pystarburst.functions.date_add(col: ColumnOrName, days: ColumnOrName | int) Column#

Adds the specified value in days from the specified date.

Supported date and time parts

Example:

>>> # add 24 days on dates
>>> import datetime
>>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"])
>>> date_df.select(date_add(col("date_col"), lit(24)).alias("date")).show()
----------------
|"DATE"        |
----------------
|2020-11-13    |
----------------
Parameters:
  • col – The timestamp column

  • days – The number of days to add

pystarburst.functions.date_format(col: ColumnOrName, date_time_format: str) Column#

Format string that is compatible with JodaTime’s DateTimeFormat pattern format.

Supported date and time parts

Example:

>>> # add one year on dates
>>> import datetime
>>> date_df = session.create_dataframe([[datetime.date(2020, 1, 1)]], schema=["date_col"])
>>> date_df.select(date_format(col("date_col"), "YYYY/MM/dd").alias("date")).show()
----------------
|"DATE"        |
----------------
|"2021/01/01"  |
----------------
Parameters:
  • col – The timestamp column

  • date_time_format – The format string

pystarburst.functions.date_sub(col: ColumnOrName, days: ColumnOrName | int) Column#

Subtracts the specified value in days from the specified date.

Supported date and time parts

Example:

>>> # subtracts 24 days on dates
>>> import datetime
>>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"])
>>> date_df.select(date_sub(col("date_col"), lit(24)).alias("date")).show()
----------------
|"DATE"        |
----------------
|2020-09-26    |
----------------
Parameters:
  • col – The timestamp column

  • days – The number of days to subtract

pystarburst.functions.date_trunc(part: str, expr: ColumnOrName) Column#

Truncates a DATE, TIME, or TIMESTAMP to the specified precision.

Note that truncation is not the same as extraction. For example: - Truncating a timestamp down to the quarter returns the timestamp corresponding to midnight of the first day of the quarter for the input timestamp. - Extracting the quarter date part from a timestamp returns the quarter number of the year in the timestamp.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(date_trunc("YEAR", "a"), date_trunc("MONTH", "a"), date_trunc("DAY", "a")).collect()
[Row(DATE_TRUNC("YEAR", "A")=datetime.datetime(2020, 1, 1, 0, 0), DATE_TRUNC("MONTH", "A")=datetime.datetime(2020, 5, 1, 0, 0), DATE_TRUNC("DAY", "A")=datetime.datetime(2020, 5, 1, 0, 0))]
>>> df.select(date_trunc("HOUR", "a"), date_trunc("MINUTE", "a"), date_trunc("SECOND", "a")).collect()
[Row(DATE_TRUNC("HOUR", "A")=datetime.datetime(2020, 5, 1, 13, 0), DATE_TRUNC("MINUTE", "A")=datetime.datetime(2020, 5, 1, 13, 11), DATE_TRUNC("SECOND", "A")=datetime.datetime(2020, 5, 1, 13, 11, 20))]
>>> df.select(date_trunc("QUARTER", "a")).collect()
[Row(DATE_TRUNC("QUARTER", "A")=datetime.datetime(2020, 4, 1, 0, 0))]
pystarburst.functions.dateadd(col: ColumnOrName, days: ColumnOrName | int) Column#

Adds the specified value in days from the specified date.

Supported date and time parts

Example:

>>> # add 24 days on dates
>>> import datetime
>>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"])
>>> date_df.select(date_add(col("date_col"), lit(24)).alias("date")).show()
----------------
|"DATE"        |
----------------
|2020-11-13    |
----------------
Parameters:
  • col – The timestamp column

  • days – The number of days to add

pystarburst.functions.datediff(part: str, col1: ColumnOrName, col2: ColumnOrName) Column#

Calculates the difference between two date, time, or timestamp columns based on the date or time part requested.

Supported date and time parts

Example:

>>> # year difference between two date columns
>>> import datetime
>>> date_df = session.create_dataframe([[datetime.date(2020, 1, 1), datetime.date(2021, 1, 1)]], schema=["date_col1", "date_col2"])
>>> date_df.select(datediff("year", col("date_col1"), col("date_col2")).alias("year_diff")).show()
---------------
|"YEAR_DIFF"  |
---------------
|1            |
---------------
Parameters:
  • part – The time part to use for calculating the difference

  • col1 – The first timestamp column or minuend in the datediff

  • col2 – The second timestamp column or the subtrahend in the datediff

pystarburst.functions.day(e: ColumnOrName) Column#

Extracts the day from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(day("a")).collect()
[Row(day("a")=1), Row(day("a")=21)]
pystarburst.functions.dayofmonth(e: ColumnOrName) Column#

Extracts the corresponding day (number) of the month from a date or timestamp.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(dayofmonth("a")).collect()
[Row(day_of_month("A")=1)]
pystarburst.functions.dayofweek(e: ColumnOrName) Column#

Extracts the corresponding day (number) of the week from a date or timestamp.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(dayofweek("a")).collect()
[Row(day_of_week("A")=5)]
pystarburst.functions.dayofyear(e: ColumnOrName) Column#

Extracts the corresponding day (number) of the year from a date or timestamp.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(dayofyear("a")).collect()
[Row(day_of_year("A")=122)]
pystarburst.functions.degrees(e: ColumnOrName) Column#

Converts radians to degrees.

pystarburst.functions.dense_rank() Column#

Returns the rank of a value within a group of values, without gaps in the ranks. The rank value starts at 1 and continues up sequentially. If two values are the same, they will have the same rank.

pystarburst.functions.desc(c: ColumnOrName) Column#

Returns a Column expression with values sorted in descending order.

pystarburst.functions.desc_nulls_first(c: ColumnOrName) Column#

Returns a Column expression with values sorted in descending order (null values sorted before non-null values).

pystarburst.functions.desc_nulls_last(c: ColumnOrName) Column#

Returns a Column expression with values sorted in descending order (null values sorted after non-null values).

pystarburst.functions.div0(dividend: ColumnOrName | int | float, divisor: ColumnOrName | int | float) Column#

Performs division like the division operator (/), but returns 0 when the divisor is 0 (rather than reporting an error).

pystarburst.functions.element_at(col1: ColumnOrName | int, col2: ColumnOrName | int) Column#

Returns element of array at given index or value for given key in map. Notice that Trino ARRAY indexing starts from 1.

Example:

>>> from pystarburst.functions import lit
>>> df = session.createDataFrame([({"a": 1.0, "b": 2.0}, [1, 2, 3],), ({}, [],)], ["map", "list"])
>>> df.select(element_at(df.list, 1).as_("idx1")).sort(col("idx1")).show()
----------
|"idx1"  |
----------
|NULL    |
|1       |
----------


>>> df.select(element_at(df.map, lit("a")).as_("get_a")).sort(col("get_a")).show()
-----------
|"get_a"  |
-----------
|NULL     |
|1.0      |
-----------
pystarburst.functions.ends_with(col: ColumnOrName, str: ColumnOrName) Column#

Returns true if col starts with str.

pystarburst.functions.endswith(col: ColumnOrName, str: ColumnOrName) Column#

Returns true if col starts with str.

pystarburst.functions.equal_nan(e: ColumnOrName) Column#

Return true if the value in the column is not a number (NaN).

pystarburst.functions.exists(array: ColumnOrName, func: Callable) Column#

Returns whether any elements of an array match the given predicate. Returns true if one or more elements match the predicate; false if none of the elements matches (a special case is when the array is empty); NULL if the predicate function returns NULL for one or more elements and false for all other elements.

Example::
>>> df = session.createDataFrame([(1, [1, 2, 3, 4]), (2, [3, -1, 0])],["key", "values"])
>>> df.select(exists("values", lambda x: x < 0).alias("any_negative")).show()
------------------
|"any_negative"  |
------------------
|False           |
|True            |
------------------
pystarburst.functions.exp(e: ColumnOrName) Column#

Computes Euler’s number e raised to a floating-point value.

pystarburst.functions.expm1(e: ColumnOrName) Column#

Computes Euler’s number e raised to a floating-point value minus one.

pystarburst.functions.expr(sql: str) Column#

Creates a Column expression from raw SQL text. Note that the function does not interpret or check the SQL text.

pystarburst.functions.fail(error_description: ColumnOrLiteralStr) Column#

Throws an exception with the provided error message.

Parameters:

error_description – A Column object or column name that determines the error description

Examples:

df.select(fail("unsupported operation"))
pystarburst.functions.filter(array: ColumnOrName, func: Callable) Column#

Constructs an array from those elements of array for which func returns true

Example::
>>> df = session.createDataFrame(
...     [(1, ["2018-09-20",  "2019-02-03", "2019-07-01", "2020-06-01"])],
...     ["key", "values"]
... )
>>> def after_second_quarter(x):
...     return month(to_date(x)) > 6
>>> df.select(
...     filter("values", after_second_quarter).alias("after_second_quarter")
... ).show()
--------------------------------
|"after_second_quarter"        |
--------------------------------
|['2018-09-20', '2019-07-01']  |
--------------------------------
pystarburst.functions.first_value(e: ColumnOrName, ignore_nulls: bool = False) Column#

Returns the first value within an ordered group of values.

pystarburst.functions.flatten(array: ColumnOrName) Column#

Returns a single array from an array of arrays. If the array is nested more than two levels deep, then only a single level of nesting is removed.

Parameters:

array – the input array

Example::
>>> df = session.createDataFrame([([[1, 2, 3], [4, 5], [6]],), ([None, [4, 5]],)], ['data'])
>>> df.select(flatten(df.data)).show()
----------------------
|"flatten(data)"     |
----------------------
|[1, 2, 3, 4, 5, 6]  |
|[4, 5]              |
----------------------
pystarburst.functions.floor(e: ColumnOrName) Column#

Returns values from the specified column rounded to the nearest equal or smaller integer.

pystarburst.functions.forall(array: ColumnOrName, func: Callable) Column#

Returns whether all elements of an array match the given predicate. Returns true if all the elements match the predicate (a special case is when the array is empty); false if one or more elements don’t match; NULL if the predicate function returns NULL for one or more elements and true for all other elements.

Example::
>>> df = session.createDataFrame(
...     [(1, ["bar"]), (2, ["foo", "bar"]), (3, ["foobar", "foo"])],
...     ["key", "values"]
... )
>>> df.select(forall("values", lambda x: x.rlike("foo")).alias("all_foo")).show()
-------------
|"all_foo"  |
-------------
|False      |
|False      |
|True       |
-------------
pystarburst.functions.format_string(format: str, *e: ColumnOrName) Column#

Formats the arguments in printf-style and returns the result as a string column.

pystarburst.functions.from_json(col: ColumnOrName, to: str | DataType) Column#

Casting to BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE or VARCHAR is supported. Casting to ARRAY and MAP is supported when the element type of the array is one of the supported types, or when the key type of the map is VARCHAR and value type of the map is one of the supported types.

Example::
>>> df = session.sql("SELECT JSON '{"v1":123,"v2":"abc","v3":true}' as col1")
>>> schema = StructType([StructField('v1', IntegerType(), nullable=True), StructField('v2', StringType(), nullable=True), StructField('v3', BooleanType(), nullable=True)])
>>> df.select(from_json("col1", schema)).collect()
[Row(col1=(v1: 123, v2: 'abc', v3: True))]
>>> df = session.sql("select JSON '[1,null,456]' as col1")
>>> df.select(from_json("col1", ArrayType(IntegerType()))).collect()
[Row(col1=[1, None, 456])]
pystarburst.functions.from_unixtime(col: ColumnOrName, date_time_format: str = 'yyyy-MM-dd HH:mm:ss') Column#

Convert a Unix timestamp into a string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default)

Supported date and time parts

Example:

>>> # Example run in the EST timezone
>>> import datetime
>>> date_df = session.create_dataframe([[1428476356]], schema=["unix_time"])
>>> date_df.select(from_unixtime(col("unix_time"), "YYYY/MM/dd hh:mm:ss").alias("datetime")).show()
------------------------
|"DATETIME"            |
------------------------
|"2015/04/08 02:59:16" |
------------------------
Parameters:
  • col – The Unix timestamp column

  • date_time_format – The format string

pystarburst.functions.from_utc_timestamp(col: ColumnOrName, col_tz: ColumnOrLiteralStr) Column#

Takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

Supported date and time parts

Example::
>>> df = session.create_dataframe(
...     [["1997-02-28 1:30:00", "Japan")]],
...     schema=["timestamp", "tz"],
... )
>>> date_df.select(from_utc_timestamp(col("timestamp"), col("tz")).alias("datetime")).show()
---------------------------
|"DATETIME"               |
---------------------------
|1997-02-28 10:30:00 Japan|
---------------------------
Parameters:
  • col – The timestamp column

  • col_tz – the timezone column

pystarburst.functions.function(function_name: str) Callable#

Function object to invoke a Trino system-defined function (built-in function). Use this to invoke any built-in functions not explicitly listed in this object.

Parameters:

function_name – The name of built-in function in Trino.

Returns:

A Callable object for calling a Trino system-defined function.

Example::
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"])  # a single column with 4 rows
>>> df.select(call_function("avg", col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------

>>> my_avg = function('avg')
>>> df.select(my_avg(col("a"))).show()
----------------
|"avg(""a"")"  |
----------------
|2.500000      |
----------------
pystarburst.functions.get(col1: ColumnOrName | int, col2: ColumnOrName | int) Column#

Returns element of array at given (0-based) index. If the index points outside of the array boundaries, then this function returns NULL.

Example:

>>> df = session.createDataFrame([({"a": 1.0, "b": 2.0}, [1, 2, 3],), ({}, [],)], ["map", "list"])
>>> df.select(get(df.list, 1).as_("idx1")).sort(col("idx1")).show()
----------
|"idx1"  |
----------
|NULL    |
|2       |
----------
pystarburst.functions.get_json_object(col: ColumnOrName, path: str, json_path_mode: str = 'lax') Column#

Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

Parameters:
  • col – string column in json format

  • path – path to the json object to extract

  • json_path_mode – The JSON path expression can be evaluated in two modes: strict and lax. In the strict mode, it is required that the input JSON data strictly fits the schema required by the path expression. In the lax mode, the input JSON data can diverge from the expected schema. Details and examples: https://trino.io/docs/current/functions/json.html#json-path-modes

Example::
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
>>> df = session.createDataFrame(data, ["key", "jstring"])
>>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), get_json_object(df.jstring, '$.f2').alias("c1")).collect()
[Row(key='1', c0='"value1"', c1='"value2"'), Row(key='2', c0='"value12"', c1=None)]
pystarburst.functions.greatest(*columns: ColumnOrName) Column#

Returns the largest value from a list of expressions. If any of the argument values is NULL, the result is NULL. GREATEST supports all data types, including VARIANT.

pystarburst.functions.grouping(*cols: ColumnOrName) Column#

Describes which of a list of expressions are grouped in a row produced by a GROUP BY query.

grouping_id() is an alias of grouping().

Example::
>>> from pystarburst import GroupingSets
>>> df = session.create_dataframe([[1, 2, 3], [4, 5, 6]],schema=["a", "b", "c"])
>>> grouping_sets = GroupingSets([col("a")], [col("b")], [col("a"), col("b")])
>>> df.group_by_grouping_sets(grouping_sets).agg([count("c"), grouping("a"), grouping("b"), grouping("a", "b")]).collect()
[Row(A=1, B=2, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=0, GROUPING(A, B)=0), Row(A=4, B=5, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=0, GROUPING(A, B)=0), Row(A=1, B=None, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=1, GROUPING(A, B)=1), Row(A=4, B=None, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=1, GROUPING(A, B)=1), Row(A=None, B=2, COUNT(C)=1, GROUPING(A)=1, GROUPING(B)=0, GROUPING(A, B)=2), Row(A=None, B=5, COUNT(C)=1, GROUPING(A)=1, GROUPING(B)=0, GROUPING(A, B)=2)]
pystarburst.functions.hash(e: ColumnOrName) Column#

Computes the 128-bit MurmurHash3 hash of binaryhash

Example:

>>> df = session.create_dataframe(['a'], schema=["a"])
>>> df.select(xxhash64("a").alias("hash")).collect()
[Row(hash=-3292477735350538661)]
Parameters:

e – The column or value to be hashed

pystarburst.functions.hex(e: ColumnOrName) Column#

Computes the hex value of the given column.

pystarburst.functions.hour(e: ColumnOrName) Column#

Extracts the hour from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(hour("a")).collect()
[Row(HOUR("A")=13), Row(HOUR("A")=1)]
pystarburst.functions.hypot(e: ColumnOrName, f: ColumnOrName) Column#

Returns sqrt(a^2 + b^2).

pystarburst.functions.iff(condition: ColumnOrSqlExpr, expr1: ColumnOrLiteral, expr2: ColumnOrLiteral) Column#

Returns one of two specified expressions, depending on a condition. This is equivalent to an if-then-else expression.

Parameters:
  • condition – A Column expression or SQL text representing the specified condition.

  • expr1 – A Column expression or a literal value, which will be returned if condition is true.

  • expr2 – A Column expression or a literal value, which will be returned if condition is false.

pystarburst.functions.in_(cols: List[ColumnOrName], *vals: DataFrame | LiteralType | Iterable[LiteralType]) Column#

Returns a conditional expression that you can pass to the filter or where methods to perform the equivalent of a WHERE … IN query that matches rows containing a sequence of values.

The expression evaluates to true if the values in a row matches the values in one of the specified sequences.

The following code returns a DataFrame that contains the rows in which the columns c1 and c2 contain the values: - 1 and “a”, or - 2 and “b” This is equivalent to SELECT * FROM table WHERE (c1, c2) IN ((1, 'a'), (2, 'b')).

Example:

>>> df = session.create_dataframe([[1, "a"], [2, "b"], [3, "c"]], schema=["col1", "col2"])
>>> df.filter(in_([col("col1"), col("col2")], [[1, "a"], [2, "b"]])).show()
-------------------
|"COL1"  |"COL2"  |
-------------------
|1       |a       |
|2       |b       |
-------------------

The following code returns a DataFrame that contains the rows where the values of the columns c1 and c2 in df2 match the values of the columns a and b in df1. This is equivalent to SELECT * FROM table2 WHERE (c1, c2) IN (SELECT a, b FROM table1).

Example:

>>> df1 = session.sql("select 1, 'a'")
>>> df.filter(in_([col("col1"), col("col2")], df1)).show()
-------------------
|"COL1"  |"COL2"  |
-------------------
|1       |a       |
-------------------
Args::

cols: A list of the columns to compare for the IN operation. vals: A list containing the values to compare for the IN operation.

pystarburst.functions.initcap(e: ColumnOrName) Column#

Translate the first letter of each word to upper case in the sentence.

pystarburst.functions.instr(e: ColumnOrName, substring: ColumnOrLiteral) Column#

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

pystarburst.functions.is_nan(x: ColumnOrName) Column#

Determine if x is not-a-number.

Example

>>> df = session.create_dataframe([2], schema=["a"])
>>> df.select(is_nan("a")).collect()[0][0]
False
pystarburst.functions.is_not_null(e: ColumnOrName) Column#

Return true if the value in the column is not null.

pystarburst.functions.is_null(e: ColumnOrName) Column#

Return true if the value in the column is null.

pystarburst.functions.isnan(x: ColumnOrName) Column#

Determine if x is not-a-number.

Example

>>> df = session.create_dataframe([2], schema=["a"])
>>> df.select(is_nan("a")).collect()[0][0]
False
pystarburst.functions.isnull(e: ColumnOrName) Column#

Return true if the value in the column is null.

pystarburst.functions.json_array_length(col: ColumnOrName) Column#

Returns the array length of json (a string containing a JSON array):

Example::
>>> df = session.createDataFrame([(None,), ('[1, 2, 3]',), ('[]',)], ['data'])
>>> df.select(json_array_length(df.data)).show()
-----------------------------
|"json_array_length(data)"  |
-----------------------------
|NULL                       |
|3                          |
|0                          |
-----------------------------
pystarburst.functions.json_extract(col: ColumnOrName, path: ColumnOrName) Column#

Parses a JSON string and returns the value of an element at a specified path in the resulting JSON document.

pystarburst.functions.json_parse(e: ColumnOrName) Column#

Parse the value of the specified column as a JSON string and returns the resulting JSON document.

pystarburst.functions.json_tuple(col: ColumnOrName, *fields: str, json_path_mode: str = 'lax') List[Column]#

Creates a new row for a json column according to the given field names.

Parameters:
  • col – string column in json format

  • fields – a field or fields to extract

  • json_path_mode – The JSON path expression can be evaluated in two modes: strict and lax. In the strict mode, it is required that the input JSON data strictly fits the schema required by the path expression. In the lax mode, the input JSON data can diverge from the expected schema. Details and examples: https://trino.io/docs/current/functions/json.html#json-path-modes

Example::
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
>>> df = session.createDataFrame(data, ["key", "jstring"])
>>> json_tuple_list = json_tuple(df.jstring, 'f1', 'f2')
>>> df.select(df.key, *json_tuple_list).collect()
[Row('1', '"value1"', '"value2"'), Row('2', '"value12"', None)]
pystarburst.functions.kurtosis(e: ColumnOrName) Column#

Returns the population excess kurtosis of non-NULL records. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.lag(e: ColumnOrName, offset: int = 1, default_value: ColumnOrLiteral | None = None, ignore_nulls: bool = False) Column#

Accesses data in a previous row in the same result set without having to join the table to itself.

pystarburst.functions.last_day(col: ColumnOrName) Column#

Returns the last day of the month which the given date belongs to.

Supported date and time parts

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> date_df.select(last_day(col("a")).alias("date")).show()
----------------
|"DATE"        |
----------------
|2020-05-31    |
----------------
Parameters:

col – The timestamp column

pystarburst.functions.last_value(e: ColumnOrName, ignore_nulls: bool = False) Column#

Returns the last value within an ordered group of values.

pystarburst.functions.lead(e: ColumnOrName, offset: int = 1, default_value: Column | LiteralType | None = None, ignore_nulls: bool = False) Column#

Accesses data in a subsequent row in the same result set without having to join the table to itself.

pystarburst.functions.least(*columns: ColumnOrName) Column#

Returns the smallest value from a list of expressions. LEAST supports all data types, including VARIANT.

pystarburst.functions.length(e: ColumnOrName) Column#

Returns the length of an input string or binary value. For strings, the length is the number of characters, and UTF-8 characters are counted as a single character. For binary, the length is the number of bytes.

pystarburst.functions.levenshtein(l: ColumnOrName, r: ColumnOrLiteral) Column#

Computes the Levenshtein distance of the two given strings.

pystarburst.functions.levenshtein_distance(l: ColumnOrName, r: ColumnOrLiteral) Column#

Computes the Levenshtein distance of the two given strings.

pystarburst.functions.listagg(e: ColumnOrName, delimiter: str = '', *within_group: ColumnOrName | Iterable[ColumnOrName], is_distinct: bool = False) Column#

Returns the concatenated input values, separated by delimiter string. See LISTAGG for details.

Parameters:
  • e – A Column object or column name that determines the values to be put into the list.

  • delimiter – A string delimiter.

  • is_distinct – Whether the input expression is distinct.

Examples:

df.group_by(df.col1).agg(listagg(df.col2. ",", f.col2.asc()))
df.select(listagg(df["col2"], ",", f.col2.asc(), is_distinct=False)
pystarburst.functions.lit(literal: LiteralType) Column#

Creates a Column expression for a literal value. It supports basic Python data types, including int, float, str, bool, bytes, bytearray, datetime.time, datetime.date, datetime.datetime and decimal.Decimal. Also, it supports Python structured data types, including list, tuple and dict, but this container must be JSON serializable.

pystarburst.functions.locate(substring: ColumnOrLiteral, e: ColumnOrName, pos: ColumnOrName | int = 1) Column#

Locate the position of the first occurrence of substr in a string column, after position pos.

pystarburst.functions.log(base: ColumnOrName | int | float, x: ColumnOrName | int | float) Column#

Returns the logarithm of a numeric expression.

pystarburst.functions.log10(x: ColumnOrName | int | float) Column#

Returns the logarithm in base 10 of a numeric expression.

pystarburst.functions.log1p(x: ColumnOrName | int | float) Column#

Returns the logarithm of a numeric expression plus one.

pystarburst.functions.log2(x: ColumnOrName | int | float) Column#

Returns the base 2 logarithm of a numeric expression.

pystarburst.functions.lower(e: ColumnOrName) Column#

Returns the input string with all characters converted to lowercase.

pystarburst.functions.lpad(e: ColumnOrName, len: Column | int, pad: ColumnOrName) Column#

Left-pads a string with characters from another string, or left-pads a binary value with bytes from another binary value.

pystarburst.functions.ltrim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column#

Removes leading characters, including whitespace, from a string.

pystarburst.functions.make_date(col_year: ColumnOrName, col_month: ColumnOrName, col_day: ColumnOrName) Column#

Generate a date from a year, month and day columns.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[2020, 1, 30]],
...     schema=["a", "b", "c"],
... )
>>> df.select(make_date("a", "b", "c")).collect()
[Row(MAKE_DATE("A", "B", "C")=datetime.date(2020, 1, 30)]
Parameters:
  • col_year – The year column

  • col_month – The month column

  • col_day – The day column

pystarburst.functions.map_concat(*cols: ColumnOrName) Column#

Returns the union of all the given maps

Example

>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a, MAP_FROM_ENTRIES(ARRAY[(3, 'c')]) as b")
>>> df.select(map_concat('a', 'b')).show()

|”map_concat(a, b)” |#

|{‘1’: ‘a’, ‘2’: ‘b’, ‘3’: ‘c’} |#

param cols:

the maps to concatenate

pystarburst.functions.map_entries(col: ColumnOrName) Column#

Returns an unordered array of all entries in the given map

Example

>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a")
>>> df.select(map_entries('a')).show()

|”map_entries(a)” |#

|[(1, ‘a’), (2, ‘b’)] |#

param col:

the input map

pystarburst.functions.map_filter(col: ColumnOrName, func: Callable) Column#

Constructs a map from those entries of map for which function returns true.

Example::
>>> df = session.createDataFrame([(1, {"foo": 42.0, "bar": 1.0, "baz": 32.0})], ["id", "data"])
>>> row = df.select(map_filter(
...     "data", lambda _, v: v > 30.0).alias("data_filtered")
...                 ).head()
>>> sorted(row["data_filtered"].items())
[('baz', 32.0), ('foo', 42.0)]
pystarburst.functions.map_from_arrays(col1: ColumnOrName, col2: ColumnOrName) Column#

Creates a new map from two arrays

Example

>>> df = session.createDataFrame([(["Alice", "Bob", "Charlie"], [2, 5, 8])], schema=["a", "b"])
>>> df.select(map_from_arrays("a", "b")).show()
----------------------------------------
|"map_from_entries(zip(a, b))"         |
----------------------------------------
|{'Bob': 5, 'Alice': 2, 'Charlie': 8}  |
----------------------------------------
Parameters:
  • col1 – the array containing all the keys

  • col2 – the array containing all the values

pystarburst.functions.map_from_entries(col: ColumnOrName) Column#

Converts an array of entries (key value struct types) to a map of values

Example

>>> df = session.sql("SELECT ARRAY[(1, 'a'), (2, 'b')] as a")
>>> df.select(map_from_entries('a')).show()

|”map_from_entries(a)” |#

|{‘1’: ‘a’, ‘2’: ‘b’} |#

param col:

the input map

pystarburst.functions.map_keys(col: ColumnOrName) Column#

Returns an unordered array containing the keys of the map

Example

>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a")
>>> df.select(map_keys('a')).show()

|”map_keys(a)” |#

|[1, 2] |#

param col:

the input map

pystarburst.functions.map_values(col: ColumnOrName) Column#

Returns an unordered array containing the values of the map

Example

>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a")
>>> df.select(map_values('a')).show()

|”map_values(a)” |#

|[‘a’, ‘b’] |#

param col:

the input map

pystarburst.functions.map_zip_with(col1: ColumnOrName, col2: ColumnOrName, func: Callable) Column#

Merges the two given maps into a single map by applying function to the pair of values with the same key. For keys only presented in one map, NULL will be passed as the value for the missing key.

Example::
>>> df = session.createDataFrame([
...     (1, {"IT": 24.0, "SALES": 12.00}, {"IT": 2.0, "SALES": 1.4})],
...     ["id", "base", "ratio"]
... )
>>> row = df.select(map_zip_with(
...     "base", "ratio", lambda k, v1, v2: round(v1 * v2, 2)).alias("updated_data")
...                 ).head()
>>> sorted(row["updated_data"].items())
[('IT', 48.0), ('SALES', 16.8)]
pystarburst.functions.max(e: ColumnOrName) Column#

Returns the maximum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

pystarburst.functions.max_by(e: ColumnOrName, f: ColumnOrName) Column#

Returns the maximum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

pystarburst.functions.md5(e: ColumnOrName) Column#

Returns a 32-character hex-encoded string containing the 128-bit MD5 message digest.

pystarburst.functions.mean(e: ColumnOrName) Column#

Return the average for the specific numeric columns. Alias of avg().

pystarburst.functions.min(e: ColumnOrName) Column#

Returns the minimum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

pystarburst.functions.min_by(e: ColumnOrName, f: ColumnOrName) Column#

Returns the minimum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.

pystarburst.functions.minute(e: ColumnOrName) Column#

Extracts the minute from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(minute("a")).collect()
[Row(MINUTE("A")=11), Row(MINUTE("A")=30)]
pystarburst.functions.mode(e: ColumnOrName) Column#

Returns the most frequent value for the records in a group. NULL values are ignored. If all the values are NULL, or there are 0 rows, then the function returns NULL.

pystarburst.functions.month(e: ColumnOrName) Column#

Extracts the month from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(month("a")).collect()
[Row(MONTH("A")=5), Row(MONTH("A")=8)]
pystarburst.functions.negate(e: ColumnOrName) Column#

Returns the negation of the value in the column (equivalent to a unary minus).

pystarburst.functions.not_(e: ColumnOrName) Column#

Returns the inverse of a boolean expression.

pystarburst.functions.nth_value(e: ColumnOrName, offset: int, ignore_nulls: bool = False) Column#

Returns the value at the specified offset from the beginning of the window. Offsets start at 1. The offset can be any scalar expression. If the offset is null or greater than the number of values in the window, null is returned. It is an error for the offset to be zero or negative.

pystarburst.functions.ntile(e: int | ColumnOrName) Column#

Divides an ordered data set equally into the number of buckets specified by n. Buckets are sequentially numbered 1 through n.

Parameters:

e – The desired number of buckets; must be a positive integer value.

pystarburst.functions.octet_length(e: ColumnOrName) Column#

Calculates the byte length for the specified string column.

pystarburst.functions.overlay(e: ColumnOrName, replace: ColumnOrLiteral, pos: Column | int, len: Column | int = -1) Column#

Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.

pystarburst.functions.parse_json(e: ColumnOrName) Column#

Parse the value of the specified column as a JSON string and returns the resulting JSON document.

pystarburst.functions.percent_rank() Column#

Returns the relative rank of a value within a group of values, specified as a percentage ranging from 0.0 to 1.0.

pystarburst.functions.percentile_approx(col: ColumnOrName, percentile: float) Column#

Returns an approximated value for the desired percentile. This function uses the t-Digest algorithm.

pystarburst.functions.pmod(e: ColumnOrName, f: ColumnOrName) Column#

Returns the positive modulus for the input

pystarburst.functions.pow(left: ColumnOrName | int | float, right: ColumnOrName | int | float) Column#

Returns a number (left) raised to the specified power (right).

pystarburst.functions.power(x: ColumnOrName, p: Column | int) Column#

Returns x raised to the power of p.

Example

>>> df = session.create_dataframe([2], schema=["a"])
>>> df.select(power("a", 2)).collect()[0][0]
4
pystarburst.functions.product(e: ColumnOrName) Column#

Returns the product of all values.

pystarburst.functions.quarter(e: ColumnOrName) Column#

Extracts the quarter from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(quarter("a")).collect()
[Row(QUARTER("A")=2), Row(QUARTER("A")=3)]
pystarburst.functions.radians(e: ColumnOrName) Column#

Converts degrees to radians.

pystarburst.functions.raise_error(error_description: ColumnOrLiteralStr) Column#

Throws an exception with the provided error message.

Parameters:

error_description – A Column object or column name that determines the error description

Examples:

df.select(fail("unsupported operation"))
pystarburst.functions.random(m: ColumnOrName | int | None = None, n: ColumnOrName | int | None = None) Column#

Each call returns a pseudo-random value

pystarburst.functions.rank() Column#

Returns the rank of a value within an ordered group of values. The rank value starts at 1 and continues up.

pystarburst.functions.reduce(array: ColumnOrName, initialState: ColumnOrName, input_func: Callable, output_func: Callable | None = None) Column#

Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value. It may be the identity function (i -> i) (default if not specified).

Example::
>>> df = session.createDataFrame([(1, [20.0, 4.0, 2.0, 6.0, 10.0])], ["id", "values"])
>>> df.select(aggregate("values", lit(0.0), lambda acc, x: acc + x).alias("sum")).show()
---------
|"sum"  |
---------
|42.0   |
---------
pystarburst.functions.regexp_count(subject: ColumnOrName, pattern: ColumnOrLiteralStr) Column#

Returns the number of times that a pattern occurs in the subject.

pystarburst.functions.regexp_extract(subject: ColumnOrName, pattern: ColumnOrLiteralStr, group: Column | int | None = 0) Column#

Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

pystarburst.functions.regexp_replace(subject: ColumnOrName, pattern: ColumnOrLiteralStr, replacement: ColumnOrLiteralStr = '') Column#

Returns the subject with the specified pattern (or all occurrences of the pattern) either removed or replaced by a replacement string. If no matches are found, returns the original subject.

pystarburst.functions.repeat(s: ColumnOrName, n: Column | int) Column#

Builds a string by repeating the input for the specified number of times.

pystarburst.functions.replace(subject: ColumnOrName, pattern: ColumnOrLiteralStr, replacement: ColumnOrLiteralStr = '') Column#

Removes all occurrences of a specified subject and optionally replaces them with replacement.

pystarburst.functions.reverse(col: ColumnOrName) Column#

Returns a reversed string or an array with reverse order of elements.

Example:

>>> df = session.create_dataframe([["Hello"], ["abc"]], schema=["col1"])
>>> df.select(reverse(col("col1"))).show()
-----------------------
|"reverse(col1)"      |
-----------------------
|olleH                |
|cba                  |
-----------------------


>>> df = session.createDataFrame([([2, 1, 3],) ,([1],) ,([],)], ['data'])
>>> res = df.select(reverse(df.data).alias('r')).collect()
[Row(r=[3, 1, 2]), Row(r=[1]), Row(r=[])]
pystarburst.functions.round(e: ColumnOrName, scale: ColumnOrName | int | float = 0) Column#

Returns values from the specified column rounded to the nearest equal or smaller integer.

pystarburst.functions.row_number() Column#

Returns a unique row number for each row within a window partition. The row number starts at 1 and continues up sequentially.

pystarburst.functions.rpad(e: ColumnOrName, len: Column | int, pad: ColumnOrName) Column#

Right-pads a string with characters from another string, or right-pads a binary value with bytes from another binary value.

pystarburst.functions.rtrim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column#

Removes trailing characters, including whitespace, from a string.

pystarburst.functions.sec(e: ColumnOrName) Column#

Computes the secant of its argument.

pystarburst.functions.second(e: ColumnOrName) Column#

Extracts the second from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(second("a")).collect()
[Row(SECOND("A")=20), Row(SECOND("A")=5)]
pystarburst.functions.sequence(start: ColumnOrName, stop: ColumnOrName, step: ColumnOrName | None = None) Column#

Generate a sequence of integers from start to stop, incrementing by step. If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1.

Parameters:
  • start – the column that contains the integer to start with (inclusive).

  • stop – the column that contains the integer to stop (inclusive).

  • step – the column that contains the integer to increment.

Example::
>>> df1 = session.create_dataframe([(-2, 2)], ["a", "b"])
>>> df1.select(sequence("a", "b").alias("result")).show()
---------------------
|"result"           |
---------------------
|[-2, -1, 0, 1, 2]  |
---------------------

>>> df2 = session.create_dataframe([(4, -4, -2)], ["a", "b", "c"])
>>> df2.select(sequence("a", "b", "c").alias("result")).show()
---------------------
|"result"           |
---------------------
|[4, 2, 0, -2, -4]  |
---------------------
pystarburst.functions.sha1(e: ColumnOrName) Column#

Returns a 40-character hex-encoded string containing the 160-bit SHA-1 message digest.

pystarburst.functions.sha2(e: ColumnOrName, num_bits: int) Column#

Returns a hex-encoded string containing the SHA-256 or SHA-512 message digest

pystarburst.functions.shuffle(array: ColumnOrName) Column#

Generates a random permutation of the given array.

Parameters:

array – The column containing the source ARRAY.

Examples::
>>> df = spark.createDataFrame([([1, 20, 3, 5],), ([1, 20, None, 3],)], ['data'])
>>> df.select(shuffle(df.data).alias('s')).collect()
[Row(s=[3, 1, 5, 20]), Row(s=[20, None, 3, 1])]
pystarburst.functions.signum(e: ColumnOrName) Column#

Computes the signum of its argument.

pystarburst.functions.sin(e: ColumnOrName) Column#

Computes the sine of its argument; the argument should be expressed in radians.

pystarburst.functions.sinh(e: ColumnOrName) Column#

Computes the hyperbolic sine of its argument.

pystarburst.functions.size(array: ColumnOrName) Column#

Returns the cardinality (size) of the array or map.

Example::
>>> from pystarburst import Row
>>> df = session.create_dataframe([Row(a=[1, 2, 3])])
>>> df.select(size("a").alias("result")).show()
------------
|"result"  |
------------
|3         |
------------
pystarburst.functions.skew(e: ColumnOrName) Column#

Returns the sample skewness of non-NULL records. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.skewness(e: ColumnOrName) Column#

Returns the sample skewness of non-NULL records. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.sort_array(array: ColumnOrName, sort_ascending: bool | None = True) Column#

Returns rows of array column in sorted order. Users can choose the sort order.

Parameters:
  • array – name of the column or column element which describes the column

  • sort_ascending – Boolean that decides if array elements are sorted in ascending order. Defaults to True.

Examples::
>>> df = session.sql("select array[20, 0, null, 10] as a")
>>> df.select(sort_array(df.a).as_("sorted_a")).show()
---------------------
|"sorted_a"         |
---------------------
|[0, 10, 20, None]  |
---------------------

>>> df.select(sort_array(df.a, False).as_("sorted_a")).show()
---------------------
|"sorted_a"         |
---------------------
|[None, 20, 10, 0]  |
---------------------
pystarburst.functions.soundex(e: ColumnOrName) Column#

Returns a string that contains a phonetic representation of the input string.

pystarburst.functions.split(str: ColumnOrName, pattern: ColumnOrLiteralStr) Column#

Splits a given string with a given separator and returns the result in an array of strings. To specify a string separator, use the lit() function.

pystarburst.functions.sql_expr(sql: str) Column#

Creates a Column expression from raw SQL text. Note that the function does not interpret or check the SQL text.

pystarburst.functions.sqrt(e: ColumnOrName) Column#

Returns the square-root of a non-negative numeric expression.

pystarburst.functions.starts_with(col: ColumnOrName, str: ColumnOrName) Column#

Returns true if col starts with str.

pystarburst.functions.startswith(col: ColumnOrName, str: ColumnOrName) Column#

Returns true if col starts with str.

pystarburst.functions.stddev(e: ColumnOrName) Column#

Returns the sample standard deviation (square root of sample variance) of non-NULL values. If all records inside a group are NULL, returns NULL.

pystarburst.functions.stddev_pop(e: ColumnOrName) Column#

Returns the population standard deviation (square root of variance) of non-NULL values. If all records inside a group are NULL, returns NULL.

pystarburst.functions.stddev_samp(e: ColumnOrName) Column#

Returns the sample standard deviation (square root of sample variance) of non-NULL values. If all records inside a group are NULL, returns NULL. Alias of stddev().

pystarburst.functions.strpos(e: ColumnOrName, substring: ColumnOrLiteral, instance: int | ColumnOrName)#

Returns the position of the N-th instance of substring in string. When instance is a negative number the search will start from the end of string. Positions start with 1. If not found, 0 is returned.

pystarburst.functions.substr(str: ColumnOrName, pos: Column | int, len: Column | int | None = None) Column#

Returns the portion of the string or binary value str, starting from the character/byte specified by pos, with limited length. The length should be greater than or equal to zero. If the length is a negative number, the function returns an empty string.

Note

For pos, 1 is the first character of the string in Trino

substr() is an alias of substring().

pystarburst.functions.substring(str: ColumnOrName, pos: Column | int, len: Column | int | None = None) Column#

Returns the portion of the string or binary value str, starting from the character/byte specified by pos, with limited length. The length should be greater than or equal to zero. If the length is a negative number, the function returns an empty string.

Note

For pos, 1 is the first character of the string in Trino

substr() is an alias of substring().

pystarburst.functions.substring_index(str: ColumnOrName, delim: ColumnOrLiteral, count: Column | int) Column#

Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned.

If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.

pystarburst.functions.sum(e: ColumnOrName) Column#

Returns the sum of non-NULL records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.sumDistinct(e: ColumnOrName) Column#

Returns the sum of non-NULL distinct records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.sum_distinct(e: ColumnOrName) Column#

Returns the sum of non-NULL distinct records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.

pystarburst.functions.table_function(function_name: str) Callable#

Create a function object to invoke a Trino table function.

Parameters:

function_name – The name of the table function.

Example

>>> from pystarburst.functions import lit
>>> sequence = table_function("sequence")
>>> session.table_function(sequence(lit(0), lit(4)).over()).collect()
[Row(sequential_number=0), Row(sequential_number=1), Row(sequential_number=2), Row(sequential_number=3), Row(sequential_number=4)]
pystarburst.functions.tan(e: ColumnOrName) Column#

Computes the tangent of its argument; the argument should be expressed in radians.

pystarburst.functions.tanh(e: ColumnOrName) Column#

Computes the hyperbolic tangent of its argument.

pystarburst.functions.timestamp_seconds(col: ColumnOrName) Column#

Convert a Unix timestamp into a local datetime.

Supported date and time parts

Example:

>>> # Example run in the EST timezone
>>> import datetime
>>> date_df = session.create_dataframe([[1428476356]], schema=["unix_time"])
>>> date_df.select(timestamp_seconds(col("unix_time")).alias("datetime")).show()
--------------------------
|"DATETIME"              |
--------------------------
|2015-04-08 02:59:16.000 |
--------------------------
Parameters:

col – The Unix timestamp column

pystarburst.functions.to_date(e: ColumnOrName, fmt: Column | None = None) Column#

Converts an input expression into a date.

pystarburst.functions.to_hex(e: ColumnOrName) Column#

Encodes binary into a hex string representation.

pystarburst.functions.to_json(col: ColumnOrName) Column#

Cast to a JSON string.

Example::
>>> data = [(1, {"name": "Alice"})]
>>> df = session.createDataFrame(data, ["key", "value"])
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json='{"name":"Alice"}')]
>>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])]
>>> df = session.createDataFrame(data, ["key", "value"])
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json='[{"name":"Alice"},{"name":"Bob"}]')]
>>> data = [(1, ["Alice", "Bob"])]
>>> df = session.createDataFrame(data, ["key", "value"])
>>> df.select(to_json(df.value).alias("json")).collect()
[Row(json='["Alice","Bob"]')]
pystarburst.functions.to_time(e: ColumnOrName, fmt: Column | None = None) Column#

Converts an input expression into the corresponding time.

pystarburst.functions.to_timestamp(e: ColumnOrName, fmt: Column | None = None) Column#

Converts an input expression into the corresponding timestamp.

pystarburst.functions.to_utc_timestamp(col: ColumnOrName, col_tz: ColumnOrLiteralStr) Column#

Takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in the given timezone, and renders that timestamp as a timestamp in UTC.

Supported date and time parts

Example::
>>> df = session.create_dataframe(
...     [["1997-02-28 10:30:00", "Japan")]],
...     schema=["timestamp", "tz"],
... )
>>> date_df.select(to_utc_timestamp(col("timestamp"), col("tz")).alias("datetime")).show()
------------------------
|"DATETIME"            |
------------------------
|1997-02-28 1:30:00 UTC|
------------------------
Parameters:
  • col – The timestamp column

  • col_tz – the timezone column

pystarburst.functions.transform(array: ColumnOrName, func: Callable) Column#

Returns an array that is the result of applying function to each element of array

pystarburst.functions.transform_keys(col: ColumnOrName, func: Callable) Column#

Returns a map that applies function to each entry of map and transforms the keys.

Example::
>>> df = session.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ["id", "data"])
>>> row = df.select(transform_keys(
...     "data", lambda k, _: upper(k)).alias("data_upper")
... ).head()
>>> sorted(row["data_upper"].items())
[('BAR', 2.0), ('FOO', -2.0)]
pystarburst.functions.transform_values(col: ColumnOrName, func: Callable) Column#

Returns a map that applies function to each entry of map and transforms the values.

Example::
>>> df = session.createDataFrame([(1, {"IT": 10.0, "SALES": 2.0, "OPS": 24.0})], ["id", "data"])
>>> row = df.select(transform_values(
...     "data", lambda k, v: v + 10.0
... ).alias("new_data")).head()
>>> sorted(row["new_data"].items())
[('IT', 20.0), ('OPS', 34.0), ('SALES', 12.0)]
pystarburst.functions.translate(src: ColumnOrName, matching_string: ColumnOrName, replace_string: ColumnOrName) Column#

Translates src from the characters in matchingString to the characters in replaceString.

pystarburst.functions.trim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column#

Removes leading and trailing characters from a string.

pystarburst.functions.trunc(col: ColumnOrName, trunc_format: ColumnOrLiteralStr) Column#

Truncates a DATE, TIME, or TIMESTAMP to the specified precision.

Note that truncation is not the same as extraction. For example: - Truncating a timestamp down to the quarter returns the timestamp corresponding to midnight of the first day of the quarter for the input timestamp. - Extracting the quarter date part from a timestamp returns the quarter number of the year in the timestamp.

Example::
>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(trunc("YEAR", "a"), trunc("MONTH", "a")).collect()
[Row(DATE_TRUNC("YEAR", "A")=datetime.date(2020, 1, 1), DATE_TRUNC("MONTH", "A")=datetime.datetime(2020, 5, 1)]
Parameters:
  • col – the date/time/timestamp column

  • trunc_format – the truncation format

pystarburst.functions.try_cast(column: ColumnOrName, to: str | DataType) Column#

A special version of CAST for a subset of data type conversions. It performs the same operation (i.e. converts a value of one data type into another data type), but returns a NULL value instead of raising an error when the conversion can not be performed.

The column argument must be a column in Trino.

pystarburst.functions.typeof(col: ColumnOrName) Column#

Reports the type of a value stored in a VARIANT column. The type is returned as a string.

pystarburst.functions.unbase64(e: ColumnOrName) Column#

Decodes a BASE64 encoded string column and returns it as a binary column.

pystarburst.functions.unhex(e: ColumnOrName) Column#

Computes each pair of characters as a hexadecimal number and converts to the byte representation of number.

pystarburst.functions.unix_timestamp(col: ColumnOrName, date_time_format: str = 'yyyy-MM-dd HH:mm:ss') Column#

Convert a time string with a given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix timestamp (in seconds).

Supported date and time parts

Example:

>>> # Example run in the EST timezone
>>> import datetime
>>> date_df = session.create_dataframe([["04/08/2015 02:59:16"]], schema=["date_col"])
>>> date_df.select(to_unixtime(col("date_col"), "MM/dd/yyyy hh:mm:ss").alias("unix_time")).show()
----------------
|"UNIX_TIME"   |
----------------
|1428476356    |
----------------
Parameters:
  • col – The timestamp column or addend in the date_format

  • date_time_format – The format string

pystarburst.functions.upper(e: ColumnOrName) Column#

Returns the input string with all characters converted to uppercase.

pystarburst.functions.var_pop(e: ColumnOrName) Column#

Returns the population variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.

pystarburst.functions.var_samp(e: ColumnOrName) Column#

Returns the sample variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned. Alias of variance()

pystarburst.functions.variance(e: ColumnOrName) Column#

Returns the sample variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.

pystarburst.functions.week_of_year(e: ColumnOrName) Column#

Extracts the corresponding week (number) of the year from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(weekofyear("a")).collect()
[Row(WEEKOFYEAR("A")=18)]
pystarburst.functions.weekofyear(e: ColumnOrName) Column#

Extracts the corresponding week (number) of the year from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe(
...     [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]],
...     schema=["a"],
... )
>>> df.select(weekofyear("a")).collect()
[Row(WEEKOFYEAR("A")=18)]
pystarburst.functions.when(condition: ColumnOrSqlExpr, value: ColumnOrLiteral) CaseExpr#

Works like a cascading if-then-else statement. A series of conditions are evaluated in sequence. When a condition evaluates to TRUE, the evaluation stops and the associated result (after THEN) is returned. If none of the conditions evaluate to TRUE, then the result after the optional OTHERWISE is returned, if present; otherwise NULL is returned.

Parameters:
  • condition – A Column expression or SQL text representing the specified condition.

  • value – A Column expression or a literal value, which will be returned if condition is true.

pystarburst.functions.when_matched(condition: Column | None = None) WhenMatchedClause#

Specifies a matched clause for the Table.merge action. See WhenMatchedClause for details.

pystarburst.functions.when_not_matched(condition: Column | None = None) WhenNotMatchedClause#

Specifies a not-matched clause for the Table.merge action. See WhenNotMatchedClause for details.

pystarburst.functions.xxhash64(e: ColumnOrName) Column#

Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a varbinary column.

Example:

>>> df = session.create_dataframe(['a'], schema=["a"])
>>> df.select(xxhash64("a").alias("xxhash64")).collect()
[Row(xxhash64=-3292477735350538661)]
Parameters:

e – The column or value to be hashed

pystarburst.functions.year(e: ColumnOrName) Column#

Extracts the year from a date or timestamp.

Example:

>>> import datetime
>>> df = session.create_dataframe([
...     datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"),
...     datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f")
... ], schema=["a"])
>>> df.select(year("a")).collect()
[Row(YEAR("A")=2020), Row(YEAR("A")=2020)]
pystarburst.functions.zip_with(array1: ColumnOrName, array2: ColumnOrName, func: Callable) Column#

Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.