Dataframe functions#
Dataframe methods#
Provides utility and SQL functions that generate Column
expressions that you can pass to DataFrame
transformation methods.
These utility functions generate references to columns, literals, and SQL expressions (e.g. “c + 1”).
Use
col()
to convert a column name to aColumn
object. Refer to the API docs ofColumn
to know more ways of referencing a column.Use
lit()
to convert a Python value to aColumn
object that represents a constant value in Trino SQL.Use
sql_expr()
to convert a Trino SQL expression to aColumn
.>>> df = session.create_dataframe([[1, 'a', True, '2022-03-16'], [3, 'b', False, '2023-04-17']], schema=["a", "b", "c", "d"]) >>> res1 = df.filter(col("a") == 1).collect() >>> res2 = df.filter(lit(1) == col("a")).collect() >>> res3 = df.filter(sql_expr("a = 1")).collect() >>> assert res1 == res2 == res3 >>> res1 [Row(A=1, B='a', C=True, D='2022-03-16')]
Some DataFrame
methods accept column names or SQL expressions text aside from a Column object for convenience.
For instance:
>>> df.filter("a = 1").collect() # use the SQL expression directly in filter
[Row(A=1, B='a', C=True, D='2022-03-16')]
>>> df.select("a").collect()
[Row(A=1), Row(A=3)]
whereas Column
objects enable you to use chained column operators and transformations with
Python code fluently:
>>> # Use columns and literals in expressions.
>>> df.select(((col("a") + 1).cast("string")).alias("add_one")).show()
-------------
|"ADD_ONE" |
-------------
|2 |
|4 |
-------------
Trino has hundreds of SQL functions
This module provides Python functions that correspond to the Trino SQL functions. They typically accept Column
objects or column names as input parameters and return a new Column
objects.
The following examples demonstrate the use of some of these functions:
>>> # This example calls the function that corresponds to the TO_DATE() SQL function.
>>> df.select(dateadd('day', lit(1), to_date(col("d")))).show()
---------------------------------------
|"DATEADD('DAY', 1, TO_DATE(""D""))" |
---------------------------------------
|2022-03-17 |
|2023-04-18 |
---------------------------------------
If you want to use a SQL function in Trino but can’t find the corresponding Python function here,
you can create your own Python function with function()
:
>>> my_radians = function("radians") # "radians" is the SQL function name.
>>> df.select(my_radians(col("a")).alias("my_radians")).show()
------------------------
|"MY_RADIANS" |
------------------------
|0.017453292519943295 |
|0.05235987755982988 |
------------------------
or call the SQL function directly:
>>> df.select(call_function("radians", col("a")).as_("call_function_radians")).show()
---------------------------
|"CALL_FUNCTION_RADIANS" |
---------------------------
|0.017453292519943295 |
|0.05235987755982988 |
---------------------------
How to find help on input parameters of the Python functions for SQL functions
The Python functions have the same name as the corresponding SQL functions.
By reading the API docs or the source code of a Python function defined in this module, you’ll see the type hints of the input parameters and return type.
The return type is always Column
. The input types tell you the acceptable values:
ColumnOrName
accepts aColumn
object, or a column name in str. Most functions accept this type. If you still want to pass a literal to it, use lit(value), which returns aColumn
object that represents a literal value.>>> df.select(avg("a")).show() ---------------- |"AVG(""A"")" | ---------------- |2.000000 | ---------------- >>> df.select(avg(col("a"))).show() ---------------- |"AVG(""A"")" | ---------------- |2.000000 | ----------------
LiteralType
accepts a value of typebool
,int
,float
,str
,bytearray
,decimal.Decimal
,datetime.date
,datetime.datetime
,datetime.time
, orbytes
. An example is the third parameter oflead()
.>>> import datetime >>> from pystarburst.window import Window >>> df.select(col("d"), lead("d", 1, datetime.date(2024, 5, 18), False).over(Window.order_by("d")).alias("lead_day")).show() --------------------------- |"D" |"LEAD_DAY" | --------------------------- |2022-03-16 |2023-04-17 | |2023-04-17 |2024-05-18 | ---------------------------
ColumnOrLiteral
accepts aColumn
object, or a value ofLiteralType
mentioned above. The difference fromColumnOrLiteral
isColumnOrLiteral
regards a str value as a SQL string value instead of a column name. When a function is much more likely to accept a SQL constant value than a column expression,ColumnOrLiteral
is used. Yet you can still pass in aColumn
object if you need to. An example is the second parameter of :func:when
.>>> df.select(when(df["a"] > 2, "Greater than 2").else_("Less than 2").alias("compare_with_2")).show() -------------------- |"COMPARE_WITH_2" | -------------------- |Less than 2 | |Greater than 2 | --------------------
int
,bool
,str
, or another specific type accepts a value of that type. An example isto_decimal()
.>>> df.with_column("e", lit("1.2")).select(to_decimal("e", 5, 2)).show() ----------------------------- |"TO_DECIMAL(""E"", 5, 2)" | ----------------------------- |1.20 | |1.20 | -----------------------------
ColumnOrSqlExpr
accepts aColumn
object, or a SQL expression. For instance, the first parameter in :func:when
.>>> df.select(when("a > 2", "Greater than 2").else_("Less than 2").alias("compare_with_2")).show() -------------------- |"COMPARE_WITH_2" | -------------------- |Less than 2 | |Greater than 2 | --------------------
- pystarburst.functions.abs(e: ColumnOrName) Column #
Returns the absolute value of a numeric expression.
- pystarburst.functions.acos(e: ColumnOrName) Column #
Computes the inverse cosine (arc cosine) of its input; the result is a number in the interval [-pi, pi].
- pystarburst.functions.acosh(e: ColumnOrName) Column #
Computes the inverse hyperbolic cosine of its input; the result is a number in the interval [-pi, pi].
- pystarburst.functions.add_months(date_or_timestamp: ColumnOrName, number_of_months: Column | int) Column #
Adds or subtracts a specified number of months to a date or timestamp, preserving the end-of-month information.
Examples
>>> import datetime >>> df = session.create_dataframe([datetime.date(2022, 4, 6)], schema=["d"]) >>> df.select(add_months("d", 4)).collect()[0][0] datetime.date(2022, 8, 6)
- pystarburst.functions.aggregate(array: ColumnOrName, initialState: ColumnOrName, input_func: Callable, output_func: Callable | None = None) Column #
Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value. It may be the identity function (i -> i) (default if not specified).
Examples
>>> df = session.createDataFrame([(1, [20.0, 4.0, 2.0, 6.0, 10.0])], ["id", "values"]) >>> df.select(aggregate("values", lit(0.0), lambda acc, x: acc + x).alias("sum")).show() --------- |"sum" | --------- |42.0 | ---------
- pystarburst.functions.all_match(array: ColumnOrName, func: Callable) Column #
Returns whether all elements of an array match the given predicate. Returns true if all the elements match the predicate (a special case is when the array is empty); false if one or more elements don’t match; NULL if the predicate function returns NULL for one or more elements and true for all other elements.
Examples
>>> df = session.createDataFrame( ... [(1, ["bar"]), (2, ["foo", "bar"]), (3, ["foobar", "foo"])], ... ["key", "values"] ... ) >>> df.select(forall("values", lambda x: x.rlike("foo")).alias("all_foo")).show() ------------- |"all_foo" | ------------- |False | |False | |True | -------------
- pystarburst.functions.any_match(array: ColumnOrName, func: Callable) Column #
Returns whether any elements of an array match the given predicate. Returns true if one or more elements match the predicate; false if none of the elements matches (a special case is when the array is empty); NULL if the predicate function returns NULL for one or more elements and false for all other elements.
Examples
>>> df = session.createDataFrame([(1, [1, 2, 3, 4]), (2, [3, -1, 0])],["key", "values"]) >>> df.select(exists("values", lambda x: x < 0).alias("any_negative")).show() ------------------ |"any_negative" | ------------------ |False | |True | ------------------
- pystarburst.functions.any_value(e: ColumnOrName) Column #
Returns a non-deterministic any value for the specified column. This is an aggregate and window function.
Examples
>>> df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]) >>> result = df.select(any_value("a")).collect() >>> assert len(result) == 1 # non-deterministic value in result.
- pystarburst.functions.approx_count_distinct(e: ColumnOrName) Column #
Uses HyperLogLog to return an approximation of the distinct cardinality of the input (i.e. HLL(col1, col2, … ) returns an approximation of COUNT(DISTINCT col1, col2, … )).
- pystarburst.functions.approx_distinct(e: ColumnOrName) Column #
Uses HyperLogLog to return an approximation of the distinct cardinality of the input (i.e. HLL(col1, col2, … ) returns an approximation of COUNT(DISTINCT col1, col2, … )).
- pystarburst.functions.approx_percentile(col: ColumnOrName, percentile: float) Column #
Returns an approximated value for the desired percentile. This function uses the t-Digest algorithm.
- pystarburst.functions.approx_percentile_accumulate(col: ColumnOrName) Column #
Returns the internal representation of the t-Digest state (as a JSON object) at the end of aggregation. This function uses the t-Digest algorithm.
- pystarburst.functions.approx_percentile_combine(state: ColumnOrName) Column #
Combines (merges) percentile input states into a single output state. This allows scenarios where APPROX_PERCENTILE_ACCUMULATE is run over horizontal partitions of the same table, producing an algorithm state for each table partition. These states can later be combined using APPROX_PERCENTILE_COMBINE, producing the same output state as a single run of APPROX_PERCENTILE_ACCUMULATE over the entire table.
- pystarburst.functions.approx_percentile_estimate(state: ColumnOrName, percentile: float) Column #
Returns the desired approximated percentile value for the specified t-Digest state. APPROX_PERCENTILE_ESTIMATE(APPROX_PERCENTILE_ACCUMULATE(.)) is equivalent to APPROX_PERCENTILE(.).
- pystarburst.functions.array_agg(col: ColumnOrName, is_distinct: bool = False) Column #
Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.
- pystarburst.functions.array_append(array: ColumnOrName, element: ColumnOrName) Column #
Returns an ARRAY containing all elements from the source ARRAY as well as the new element. The new element is located at end of the ARRAY.
- Parameters:
array – The column containing the source ARRAY.
element – The column containing the element to be appended. The data type does need to match the data type of the existing elements in the ARRAY.
- pystarburst.functions.array_cat(*cols: ColumnOrName) Column #
Concatenates one or more strings, binary values, arrays. If any of the values is null, the result is also null.
- Parameters:
cols – A list of the columns to concatenate.
Examples
>>> df = session.createDataFrame([('abcd','123')], ['s', 'd']) >>> df = df.select(concat(df.s, df.d).alias('s')) >>> df.collect() [Row(s='abcd123')] >>> df = session.createDataFrame([([1, 2], [3, 4], [5]), ([1, 2], None, [3])], ['a', 'b', 'c']) >>> df = df.select(concat(df.a, df.b, df.c).alias("arr")) >>> df.collect() [Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
- pystarburst.functions.array_compact(array: ColumnOrName) Column #
Returns a compacted ARRAY with missing and null values removed, effectively converting sparse arrays into dense arrays.
- Parameters:
array – Column containing the source ARRAY to be compacted
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, None, 3])]) >>> df.select("a", array_compact("a").alias("compacted")).show() ------------------------------ |"a" |"compacted" | ------------------------------ |[1, None, 3] |[1, 3] | ------------------------------
- pystarburst.functions.array_construct(*args: ColumnOrLiteral) Column #
Creates a new array column.
- Parameters:
args – Columns containing the values (or expressions that evaluate to values).
Examples
>>> df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]) >>> df.select(array_construct("a", "b").alias("result")).show() ------------ |"result" | ------------ |[1, 2] | |[3, 4] | ------------
- pystarburst.functions.array_contains(array: ColumnOrName, element: ColumnOrName) Column #
Returns true if the array contains the element.
- pystarburst.functions.array_distinct(col: ColumnOrName)#
Remove duplicate values from the array.
- Parameters:
col – The array column
- Returns:
Returns a new ARRAY that contains only the distinct elements from the input ARRAY.
Examples
>>> from pystarburst.functions import array_construct,array_distinct,lit >>> df = session.createDataFrame([["1"]], ["A"]) >>> df = df.withColumn("array", array_construct(lit(1), lit(1), lit(1), lit(2), lit(3), lit(2), lit(2))) >>> df.withColumn("array_d", array_distinct("ARRAY")).show() |"a" |"array" |"array_d" | ------------------------------------------- |1 |[1, 1, 1, 2, 3, 2, 2] |[1, 2, 3] | -------------------------------------------
- pystarburst.functions.array_except(array1: ColumnOrName, array2: ColumnOrName) Column #
Returns an array of elements in array1 but not in array2, without duplicates.
Examples
>>> from pystarburst import Row >>> df = session.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])]) >>> df.select(array_except(df.c1, df.c2)).collect() [Row(array_except(c1, c2)=['b'])]
- pystarburst.functions.array_flatten(array: ColumnOrName) Column #
Returns a single array from an array of arrays. If the array is nested more than two levels deep, then only a single level of nesting is removed.
- Parameters:
array – the input array
Examples
>>> df = session.createDataFrame([([[1, 2, 3], [4, 5], [6]],), ([None, [4, 5]],)], ['data']) >>> df.select(flatten(df.data)).show() ---------------------- |"flatten(data)" | ---------------------- |[1, 2, 3, 4, 5, 6] | |[4, 5] | ----------------------
- pystarburst.functions.array_insert(array: ColumnOrName, pos: ColumnOrName | int, element: ColumnOrName) Column #
Returns an ARRAY containing all elements from the source ARRAY as well as the new element.
- Parameters:
array – Column containing the source ARRAY.
pos – Column containing a (one-based) position in the source ARRAY. The new element is inserted at this position. The original element from this position (if any) and all subsequent elements (if any) are shifted by one position to the right in the resulting array (i.e. inserting at position 1 has the same effect as using array_prepend). A negative position is interpreted as an index from the back of the array (e.g. -1 results in insertion before the last element in the array).
element – Column containing the element to be inserted. The new element is located at position pos. The relative order of the other elements from the source array is preserved.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row([1, 2]), Row([1, 3])], schema=["a"]) >>> df.select(array_insert("a", lit(1), lit(10)).alias("result")).show() -------------- |"result" | -------------- |[10, 1, 2] | |[10, 1, 3] | --------------
- pystarburst.functions.array_intersect(array1: ColumnOrName, array2: ColumnOrName) Column #
Returns an array that contains the matching elements in the two input arrays.
The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.
- Parameters:
array1 – An ARRAY that contains elements to be compared.
array2 – An ARRAY that contains elements to be compared.
- pystarburst.functions.array_intersection(array1: ColumnOrName, array2: ColumnOrName) Column #
Returns an array that contains the matching elements in the two input arrays.
The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.
- Parameters:
array1 – An ARRAY that contains elements to be compared.
array2 – An ARRAY that contains elements to be compared.
- pystarburst.functions.array_join(array: ColumnOrName, delimiter: ColumnOrName, null_replacement: ColumnOrName | None = None) Column #
Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.
- Parameters:
array – Column containing the ARRAY of elements to convert to a string.
delimiter – Column containing the string to put between each element (e.g. a space, comma, or other human-readable delimiter).
null_replacement – Optional value to replace nulls.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, 45, None])]) >>> df.select(array_join("a", lit(",")).alias("result")).show() ------------ |"result" | ------------ |1,45 | ------------
- pystarburst.functions.array_max(array: ColumnOrName) Column #
Returns the maximum value of input array. Null values are omitted.
Examples
>>> df = session.createDataFrame([([2, 1, 3],), ([None, 10, -1],)], ['data']) >>> df.select(array_max(df.data).alias('max')).collect() [Row(max=3), Row(max=10)]
- pystarburst.functions.array_min(array: ColumnOrName) Column #
Returns the minimum value of input array. Null values are omitted.
Examples
>>> df = session.createDataFrame([([2, 1, 3],), ([None, 10, -1],)], ['data']) >>> df.select(array_min(df.data).alias('min')).collect() [Row(min=1), Row(min=-1)]
- pystarburst.functions.array_position(array: ColumnOrName, element: ColumnOrName) Column #
Returns the position of the first occurrence of the element in array (or 0 if not found).
Note
Trino ARRAY indexing starts from 1
- pystarburst.functions.array_prepend(array: ColumnOrName, element: ColumnOrName) Column #
Returns an ARRAY containing the new element as well as all elements from the source ARRAY. The new element is positioned at the beginning of the ARRAY.
- Parameters:
array – Column containing the source ARRAY.
element – Column containing the element to be prepended. The data type does need to match the data type of the existing elements in the ARRAY.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, 2, 3])]) >>> df.select(array_prepend("a", lit(4)).alias("result")).show() ---------------- |"result" | ---------------- |[4, 1, 2, 3] | ----------------
- pystarburst.functions.array_remove(array: ColumnOrName, element: ColumnOrName | int | float)#
Remove all elements that equal element from array.
- pystarburst.functions.array_repeat(element: ColumnOrName, count: ColumnOrName | int) Column #
Creates an array containing an element repeated count times.
- pystarburst.functions.array_size(array: ColumnOrName) Column #
Returns the cardinality (size) of the array or map.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, 2, 3])]) >>> df.select(size("a").alias("result")).show() ------------ |"result" | ------------ |3 | ------------
- pystarburst.functions.array_slice(array: ColumnOrName, start: ColumnOrName | int, length: ColumnOrName | int) Column #
Subsets array starting from index start (or starting from the end if start is negative) with a length of length.
Note
The position of the first element is 1
- pystarburst.functions.array_sort(array: ColumnOrName, func: Callable = None) Column #
Sorts and returns the array based on the given comparator function. The comparator will take two nullable arguments representing two nullable elements of the array. It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. If the comparator function returns other values (including NULL), the query will fail and raise an error.
Examples
>>> df = session.createDataFrame([([2, 1, None, 3],),([1],),([],)], ['data']) >>> df.select(array_sort(df.data).alias('r')).collect() [Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])] >>> df = session.create_dataframe([([3, 2, 5, 1, 2],)], ["data"]) >>> df.select(array_sort("data", lambda x, y: iff(x < y, 1, iff(x == y, 0, -1)))).collect() [Row([5, 3, 2, 2, 1])]
- pystarburst.functions.array_to_string(array: ColumnOrName, delimiter: ColumnOrName, null_replacement: ColumnOrName | None = None) Column #
Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.
- Parameters:
array – Column containing the ARRAY of elements to convert to a string.
delimiter – Column containing the string to put between each element (e.g. a space, comma, or other human-readable delimiter).
null_replacement – Optional value to replace nulls.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, 45, None])]) >>> df.select(array_join("a", lit(",")).alias("result")).show() ------------ |"result" | ------------ |1,45 | ------------
- pystarburst.functions.array_union(array1: ColumnOrName, array2: ColumnOrName) Column #
Returns an array of the elements in the union of array1 and array2, without duplicates.
Examples
>>> from pystarburst import Row >>> df = session.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])]) >>> df.select(array_union(df.c1, df.c2)).collect() [Row(array_union(c1, c2)=['b', 'a', 'c', 'd', 'f'])]
- pystarburst.functions.arrays_overlap(array1: ColumnOrName, array2: ColumnOrName) Column #
Compares whether two ARRAYs have at least one element in common. Returns TRUE if there is at least one element in common; otherwise returns FALSE. The function is NULL-safe, meaning it treats NULLs as known values for comparing equality.
- pystarburst.functions.arrays_zip(*columns: ColumnOrName) Column #
Merges the given arrays, element-wise, into a single array of rows. The M-th element of the N-th argument will be the N-th field of the M-th output element. If the arguments have an uneven length, missing values are filled with NULL. Currently, this function accepts at most 5 arrays (limitation of Trino zip function).
- pystarburst.functions.asc(c: ColumnOrName) Column #
Returns a Column expression with values sorted in ascending order.
- pystarburst.functions.asc_nulls_first(c: ColumnOrName) Column #
Returns a Column expression with values sorted in ascending order (null values sorted before non-null values).
- pystarburst.functions.asc_nulls_last(c: ColumnOrName) Column #
Returns a Column expression with values sorted in ascending order (null values sorted after non-null values).
- pystarburst.functions.ascii(e: ColumnOrName) Column #
Returns the unicode code for the first character of a string. If the string is empty, a value of 0 is returned.
- pystarburst.functions.asin(e: ColumnOrName) Column #
Computes the inverse sine (arc sine) of its input; the result is a number in the interval [-pi, pi].
- pystarburst.functions.assert_true(col: ColumnOrName, error_description: ColumnOrLiteralStr = None) Column #
Returns null if the input column is true; throws an exception with the provided error message otherwise.
Examples
>>> df = session.createDataFrame([(0,1)], ['a', 'b']) >>> df.select(assert_true(df.a < df.b).alias('r')).collect() [Row(r=None)] >>> df = session.createDataFrame([(0,1)], ['a', 'b']) >>> df.session(assert_true(df.a < df.b, df.a).alias('r')).collect() [Row(r=None)] >>> df = session.createDataFrame([(0,1)], ['a', 'b']) >>> df.select(assert_true(df.a < df.b, 'error').alias('r')).collect() [Row(r=None)]
- pystarburst.functions.atan(e: ColumnOrName) Column #
Computes the inverse tangent (arc tangent) of its input; the result is a number in the interval [-pi, pi].
- pystarburst.functions.atan2(y: ColumnOrName, x: ColumnOrName) Column #
Computes the inverse tangent (arc tangent) of its input; the result is a number in the interval [-pi, pi].
- pystarburst.functions.avg(e: ColumnOrName) Column #
Returns the average of non-NULL records. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.base64(e: ColumnOrName) Column #
Encodes binary into a base64 string representation.
- pystarburst.functions.bit_length(e: ColumnOrName) Column #
“Calculates the bit length for the specified string column.
- pystarburst.functions.bitand(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the bitwise negation of a numeric expression.
Examples
>>> df = session.create_dataframe([19, 25], schema=["a", "b"]) >>> df.select(bitand("a", "b")).collect()[0][0] 17
- pystarburst.functions.bitnot(e: ColumnOrName) Column #
Returns the bitwise negation of a numeric expression.
Examples
>>> df = session.create_dataframe([1], schema=["a"]) >>> df.select(bitnot("a")).collect()[0][0] -2
- pystarburst.functions.bitor(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the bitwise negation of a numeric expression.
Examples
>>> df = session.create_dataframe([19, 25], schema=["a", "b"]) >>> df.select(bitor("a", "b")).collect()[0][0] 27
- pystarburst.functions.bitshiftleft(to_shift_column: ColumnOrName, n: Column | int) Column #
Returns the left shifted value of value
Examples
>>> df = session.create_dataframe([2], schema=["a"]) >>> df.select(bitshiftleft("a", 1)).collect()[0][0] 4
- pystarburst.functions.bitshiftright(to_shift_column: ColumnOrName, n: Column | int) Column #
Returns the right shifted value of value
Examples
>>> df = session.create_dataframe([2], schema=["a"]) >>> df.select(bitshiftright("a", 1)).collect()[0][0] 1
- pystarburst.functions.bitwiseNOT(e: ColumnOrName) Column #
Returns the bitwise negation of a numeric expression.
Examples
>>> df = session.create_dataframe([1], schema=["a"]) >>> df.select(bitnot("a")).collect()[0][0] -2
- pystarburst.functions.bitwise_not(e: ColumnOrName) Column #
Returns the bitwise negation of a numeric expression.
Examples
>>> df = session.create_dataframe([1], schema=["a"]) >>> df.select(bitnot("a")).collect()[0][0] -2
- pystarburst.functions.bitxor(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the bitwise XOR of x and y in 2’s complement representation,
Examples
>>> df = session.create_dataframe([19, 25], schema=["a", "b"]) >>> df.select(bitxor("a", "b")).collect()[0][0] 10
- pystarburst.functions.builtin(function_name: str) Callable #
Function object to invoke a Trino system-defined function (built-in function). Use this to invoke any built-in functions not explicitly listed in this object.
- Parameters:
function_name – The name of built-in function in Trino.
- Returns:
A
Callable
object for calling a Trino system-defined function.
Examples
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"]) # a single column with 4 rows >>> df.select(call_function("avg", col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ---------------- >>> my_avg = function('avg') >>> df.select(my_avg(col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ----------------
- pystarburst.functions.call_builtin(function_name: str, *args: ColumnOrLiteral) Column #
Invokes a Trino system-defined function (built-in function) with the specified name and arguments.
- Parameters:
function_name – The name of built-in function in Trino
args –
Arguments can be in two types:
Column
, orBasic Python types, which are converted to pystarburst literals.
Examples
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"]) # a single column with 4 rows >>> df.select(call_function("avg", col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ----------------
- pystarburst.functions.call_function(function_name: str, *args: ColumnOrLiteral) Column #
Invokes a Trino system-defined function (built-in function) with the specified name and arguments.
- Parameters:
function_name – The name of built-in function in Trino
args –
Arguments can be in two types:
Column
, orBasic Python types, which are converted to pystarburst literals.
Examples
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"]) # a single column with 4 rows >>> df.select(call_function("avg", col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ----------------
- pystarburst.functions.call_table_function(function_name: str, *args: ColumnOrLiteral, **kwargs: ColumnOrLiteral) TableFunctionCall #
Invokes a Trino table function, including system-defined table functions and user-defined table functions.
It returns a
TableFunctionCall()
so you can specify the partition clause.- Parameters:
function_name – The name of the table function.
args – The positional arguments of the table function.
kwargs – The named arguments of the table function. Some table functions (e.g.,
flatten
) have named arguments instead of positional ones.
Examples
>>> from pystarburst.functions import lit >>> session.table_function(call_table_function("sequence", lit(0), lit(4)).over()).collect() [Row(sequential_number=0), Row(sequential_number=1), Row(sequential_number=2), Row(sequential_number=3), Row(sequential_number=4)]
- pystarburst.functions.cast(column: ColumnOrName, to: str | DataType) Column #
Converts a value of one data type into another data type. The semantics of CAST are the same as the semantics of the corresponding to datatype conversion functions. If the cast is not possible, an error is raised.
- pystarburst.functions.ceil(e: ColumnOrName) Column #
Returns values from the specified column rounded to the nearest equal or larger integer.
- pystarburst.functions.coalesce(*e: ColumnOrName) Column #
Returns the first non-NULL expression among its arguments, or NULL if all its arguments are NULL.
- pystarburst.functions.codepoint(e: ColumnOrName) Column #
Returns the unicode code for the first character of a string. If the string is empty, a value of 0 is returned.
- pystarburst.functions.collectList(col: ColumnOrName, is_distinct: bool = False) Column #
Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.
- pystarburst.functions.collectSet(*args, **kwargs)#
- pystarburst.functions.collect_list(col: ColumnOrName, is_distinct: bool = False) Column #
Returns the input values, pivoted into an ARRAY. If the input is empty, an empty ARRAY is returned.
- pystarburst.functions.collect_set(*args, **kwargs)#
- pystarburst.functions.column(col_name: str) Column #
Returns a
Column
with the specified name. Alias for col.
- pystarburst.functions.concat(*cols: ColumnOrName) Column #
Concatenates one or more strings, binary values, arrays. If any of the values is null, the result is also null.
- Parameters:
cols – A list of the columns to concatenate.
Examples
>>> df = session.createDataFrame([('abcd','123')], ['s', 'd']) >>> df = df.select(concat(df.s, df.d).alias('s')) >>> df.collect() [Row(s='abcd123')] >>> df = session.createDataFrame([([1, 2], [3, 4], [5]), ([1, 2], None, [3])], ['a', 'b', 'c']) >>> df = df.select(concat(df.a, df.b, df.c).alias("arr")) >>> df.collect() [Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
- pystarburst.functions.concat_ws(*cols: ColumnOrName) Column #
Concatenates two or more strings, or concatenates two or more binary values. If any of the values is null, the result is also null. The CONCAT_WS operator requires at least two arguments, and uses the first argument to separate all following arguments.
- pystarburst.functions.contains(col: ColumnOrName, string: ColumnOrName) Column #
Returns true if col contains str.
- pystarburst.functions.conv(e: ColumnOrName, from_base: int, to_base: int) Column #
Computes a number in the string form from one base to another.
- pystarburst.functions.convert_timezone(target_timezone: ColumnOrName, source_time: ColumnOrName, source_timezone: ColumnOrName | None = None) Column #
Converts the given source_time to the target timezone.
For timezone information, refer to the Trino time conversion notes
- Parameters:
target_timezone – The time zone to which the input timestamp should be converted.=
source_time – The timestamp to convert. When it’s a TIMESTAMP_LTZ, use
None
forsource_timezone
.source_timezone – The time zone for the
source_time
. Required for timestamps with no time zone (i.e. TIMESTAMP_NTZ). UseNone
if the timestamps have a time zone (i.e. TIMESTAMP_LTZ). Default isNone
.
Note
The sequence of the 3 params is different from the SQL function, which two overloads:
CONVERT_TIMEZONE( <source_tz> , <target_tz> , <source_timestamp_ntz> )
CONVERT_TIMEZONE( <target_tz> , <source_timestamp> )
The first parameter
source_tz
is optional. But in Python an optional argument shouldn’t be placed at the first. Sosource_timezone
is aftersource_time
.Examples
>>> import datetime >>> from dateutil import tz >>> datetime_with_tz = datetime.datetime(2022, 4, 6, 9, 0, 0, tzinfo=tz.tzoffset("myzone", -3600*7)) >>> datetime_with_no_tz = datetime.datetime(2022, 4, 6, 9, 0, 0) >>> df = session.create_dataframe([[datetime_with_tz, datetime_with_no_tz]], schema=["a", "b"]) >>> result = df.select(convert_timezone(lit("UTC"), col("a")), convert_timezone(lit("UTC"), col("b"), lit("Asia/Shanghai"))).collect() >>> result[0][0] datetime.datetime(2022, 4, 6, 16, 0, tzinfo=<UTC>) >>> result[0][1] datetime.datetime(2022, 4, 6, 1, 0)
- pystarburst.functions.corr(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the correlation coefficient for non-null pairs in a group.
- pystarburst.functions.cos(e: ColumnOrName) Column #
Computes the cosine of its argument; the argument should be expressed in radians.
- pystarburst.functions.cosh(e: ColumnOrName) Column #
Computes the hyperbolic cosine of its argument.
- pystarburst.functions.count(e: ColumnOrName) Column #
Returns either the number of non-NULL records for the specified columns, or the total number of records.
- pystarburst.functions.countDistinct(*cols: ColumnOrName) Column #
Returns either the number of non-NULL distinct records for the specified columns, or the total number of the distinct records.
- pystarburst.functions.count_distinct(*cols: ColumnOrName) Column #
Returns either the number of non-NULL distinct records for the specified columns, or the total number of the distinct records.
- pystarburst.functions.covar_pop(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the population covariance for non-null pairs in a group.
- pystarburst.functions.covar_samp(column1: ColumnOrName, column2: ColumnOrName) Column #
Returns the sample covariance for non-null pairs in a group.
- pystarburst.functions.create_map(*cols: ColumnOrName) Column #
Creates a new map out of a series of rows
- Parameters:
cols – the column containing two-element rows, each one containing the key-value pair
Examples
>>> df = session.createDataFrame( ... [ ... ("Alice", 2), ... ("Bob", 5), ... ("Charlie", 6), ... ], ... schema=["a", "b"]) >>> df.select(create_map("a", "b")).show() --------------------------------------------- |"map_from_entries(array_agg(row (a, b)))" | --------------------------------------------- |{'Bob': 5, 'Alice': 2, 'Charlie': 6} | ---------------------------------------------
- pystarburst.functions.cume_dist() Column #
Finds the cumulative distribution of a value with regard to other values within the same window partition.
- pystarburst.functions.current_catalog() Column #
Returns the name of the catalog in use for the current session.
Examples
>>> # Return result is tied to session, so we only test if the result exists >>> result = session.create_dataframe([1]).select(current_catalog()).collect() >>> assert result is not None
- pystarburst.functions.current_groups() Column #
Returns the list of groups for the current user running the query.
Examples
>>> # Return result is tied to session, so we only test if the result exists >>> result = session.create_dataframe([1]).select(current_groups()).collect() >>> assert result is not None
- pystarburst.functions.current_schema() Column #
Returns the name of the schema in use for the current session.
Examples
>>> # Return result is tied to session, so we only test if the result exists >>> result = session.create_dataframe([1]).select(current_schema()).collect() >>> assert result is not None
- pystarburst.functions.current_timestamp(precision: int | None = None) Column #
Returns the current timestamp for the system.
- pystarburst.functions.current_user() Column #
Returns the name of the user currently logged into the system.
Examples
>>> # Return result is tied to session, so we only test if the result exists >>> result = session.create_dataframe([1]).select(current_user()).collect() >>> assert result is not None
- pystarburst.functions.date_add(col: ColumnOrName, days: ColumnOrName | int) Column #
Adds the specified value in days from the specified date.
- Parameters:
col – The timestamp column
days – The number of days to add
Examples
>>> # add 24 days on dates >>> import datetime >>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"]) >>> date_df.select(date_add(col("date_col"), lit(24)).alias("date")).show() ---------------- |"DATE" | ---------------- |2020-11-13 | ----------------
- pystarburst.functions.date_format(col: ColumnOrName, date_time_format: str) Column #
Format string that is compatible with JodaTime’s DateTimeFormat pattern format.
- Parameters:
col – The timestamp column
date_time_format – The format string
Examples
>>> # add one year on dates >>> import datetime >>> date_df = session.create_dataframe([[datetime.date(2020, 1, 1)]], schema=["date_col"]) >>> date_df.select(date_format(col("date_col"), "YYYY/MM/dd").alias("date")).show() ---------------- |"DATE" | ---------------- |"2021/01/01" | ----------------
- pystarburst.functions.date_sub(col: ColumnOrName, days: ColumnOrName | int) Column #
Subtracts the specified value in days from the specified date.
- Parameters:
col – The timestamp column
days – The number of days to subtract
Examples
>>> # subtracts 24 days on dates >>> import datetime >>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"]) >>> date_df.select(date_sub(col("date_col"), lit(24)).alias("date")).show() ---------------- |"DATE" | ---------------- |2020-09-26 | ----------------
- pystarburst.functions.date_trunc(part: str, expr: ColumnOrName) Column #
Truncates a DATE, TIME, or TIMESTAMP to the specified precision.
Note that truncation is not the same as extraction. For example: - Truncating a timestamp down to the quarter returns the timestamp corresponding to midnight of the first day of the quarter for the input timestamp. - Extracting the quarter date part from a timestamp returns the quarter number of the year in the timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(date_trunc("YEAR", "a"), date_trunc("MONTH", "a"), date_trunc("DAY", "a")).collect() [Row(DATE_TRUNC("YEAR", "A")=datetime.datetime(2020, 1, 1, 0, 0), DATE_TRUNC("MONTH", "A")=datetime.datetime(2020, 5, 1, 0, 0), DATE_TRUNC("DAY", "A")=datetime.datetime(2020, 5, 1, 0, 0))] >>> df.select(date_trunc("HOUR", "a"), date_trunc("MINUTE", "a"), date_trunc("SECOND", "a")).collect() [Row(DATE_TRUNC("HOUR", "A")=datetime.datetime(2020, 5, 1, 13, 0), DATE_TRUNC("MINUTE", "A")=datetime.datetime(2020, 5, 1, 13, 11), DATE_TRUNC("SECOND", "A")=datetime.datetime(2020, 5, 1, 13, 11, 20))] >>> df.select(date_trunc("QUARTER", "a")).collect() [Row(DATE_TRUNC("QUARTER", "A")=datetime.datetime(2020, 4, 1, 0, 0))]
- pystarburst.functions.dateadd(col: ColumnOrName, days: ColumnOrName | int) Column #
Adds the specified value in days from the specified date.
- Parameters:
col – The timestamp column
days – The number of days to add
Examples
>>> # add 24 days on dates >>> import datetime >>> date_df = session.create_dataframe([[datetime.date(2020, 10, 20)]], schema=["date_col"]) >>> date_df.select(date_add(col("date_col"), lit(24)).alias("date")).show() ---------------- |"DATE" | ---------------- |2020-11-13 | ----------------
- pystarburst.functions.datediff(part: str, col1: ColumnOrName, col2: ColumnOrName) Column #
Calculates the difference between two date, time, or timestamp columns based on the date or time part requested.
- Parameters:
part – The time part to use for calculating the difference
col1 – The first timestamp column or minuend in the datediff
col2 – The second timestamp column or the subtrahend in the datediff
Examples
>>> # year difference between two date columns >>> import datetime >>> date_df = session.create_dataframe([[datetime.date(2020, 1, 1), datetime.date(2021, 1, 1)]], schema=["date_col1", "date_col2"]) >>> date_df.select(datediff("year", col("date_col1"), col("date_col2")).alias("year_diff")).show() --------------- |"YEAR_DIFF" | --------------- |1 | ---------------
- pystarburst.functions.day(e: ColumnOrName) Column #
Extracts the day from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(day("a")).collect() [Row(day("a")=1), Row(day("a")=21)]
- pystarburst.functions.dayofmonth(e: ColumnOrName) Column #
Extracts the corresponding day (number) of the month from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(dayofmonth("a")).collect() [Row(day_of_month("A")=1)]
- pystarburst.functions.dayofweek(e: ColumnOrName) Column #
Extracts the corresponding day (number) of the week from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(dayofweek("a")).collect() [Row(day_of_week("A")=5)]
- pystarburst.functions.dayofyear(e: ColumnOrName) Column #
Extracts the corresponding day (number) of the year from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(dayofyear("a")).collect() [Row(day_of_year("A")=122)]
- pystarburst.functions.dense_rank() Column #
Returns the rank of a value within a group of values, without gaps in the ranks. The rank value starts at 1 and continues up sequentially. If two values are the same, they will have the same rank.
- pystarburst.functions.desc(c: ColumnOrName) Column #
Returns a Column expression with values sorted in descending order.
- pystarburst.functions.desc_nulls_first(c: ColumnOrName) Column #
Returns a Column expression with values sorted in descending order (null values sorted before non-null values).
- pystarburst.functions.desc_nulls_last(c: ColumnOrName) Column #
Returns a Column expression with values sorted in descending order (null values sorted after non-null values).
- pystarburst.functions.div0(dividend: ColumnOrName | int | float, divisor: ColumnOrName | int | float) Column #
Performs division like the division operator (/), but returns 0 when the divisor is 0 (rather than reporting an error).
- pystarburst.functions.element_at(col1: ColumnOrName | int, col2: ColumnOrName | int) Column #
Returns element of array at given index or value for given key in map.
Note
Trino ARRAY indexing starts from 1.
Examples
>>> from pystarburst.functions import lit >>> df = session.createDataFrame([({"a": 1.0, "b": 2.0}, [1, 2, 3],), ({}, [],)], ["map", "list"]) >>> df.select(element_at(df.list, 1).as_("idx1")).sort(col("idx1")).show() ---------- |"idx1" | ---------- |NULL | |1 | ---------- >>> df.select(element_at(df.map, lit("a")).as_("get_a")).sort(col("get_a")).show() ----------- |"get_a" | ----------- |NULL | |1.0 | -----------
- pystarburst.functions.ends_with(col: ColumnOrName, str: ColumnOrName) Column #
Returns true if col starts with str.
- pystarburst.functions.endswith(col: ColumnOrName, str: ColumnOrName) Column #
Returns true if col starts with str.
- pystarburst.functions.equal_nan(e: ColumnOrName) Column #
Return true if the value in the column is not a number (NaN).
- pystarburst.functions.exists(array: ColumnOrName, func: Callable) Column #
Returns whether any elements of an array match the given predicate. Returns true if one or more elements match the predicate; false if none of the elements matches (a special case is when the array is empty); NULL if the predicate function returns NULL for one or more elements and false for all other elements.
Examples
>>> df = session.createDataFrame([(1, [1, 2, 3, 4]), (2, [3, -1, 0])],["key", "values"]) >>> df.select(exists("values", lambda x: x < 0).alias("any_negative")).show() ------------------ |"any_negative" | ------------------ |False | |True | ------------------
- pystarburst.functions.exp(e: ColumnOrName) Column #
Computes Euler’s number e raised to a floating-point value.
- pystarburst.functions.expm1(e: ColumnOrName) Column #
Computes Euler’s number e raised to a floating-point value minus one.
- pystarburst.functions.expr(sql: str) Column #
Creates a
Column
expression from raw SQL text. Note that the function does not interpret or check the SQL text.
- pystarburst.functions.fail(error_description: ColumnOrLiteralStr) Column #
Throws an exception with the provided error message.
- Parameters:
error_description – A
Column
object or column name that determines the error description
Examples
>>> df.select(fail("unsupported operation"))
- pystarburst.functions.filter(array: ColumnOrName, func: Callable) Column #
Constructs an array from those elements of array for which func returns true
Examples
>>> df = session.createDataFrame( ... [(1, ["2018-09-20", "2019-02-03", "2019-07-01", "2020-06-01"])], ... ["key", "values"] ... ) >>> def after_second_quarter(x): ... return month(to_date(x)) > 6 >>> df.select( ... filter("values", after_second_quarter).alias("after_second_quarter") ... ).show() -------------------------------- |"after_second_quarter" | -------------------------------- |['2018-09-20', '2019-07-01'] | --------------------------------
- pystarburst.functions.first_value(e: ColumnOrName, ignore_nulls: bool = False) Column #
Returns the first value within an ordered group of values.
- pystarburst.functions.flatten(array: ColumnOrName) Column #
Returns a single array from an array of arrays. If the array is nested more than two levels deep, then only a single level of nesting is removed.
- Parameters:
array – the input array
Examples
>>> df = session.createDataFrame([([[1, 2, 3], [4, 5], [6]],), ([None, [4, 5]],)], ['data']) >>> df.select(flatten(df.data)).show() ---------------------- |"flatten(data)" | ---------------------- |[1, 2, 3, 4, 5, 6] | |[4, 5] | ----------------------
- pystarburst.functions.floor(e: ColumnOrName) Column #
Returns values from the specified column rounded to the nearest equal or smaller integer.
- pystarburst.functions.forall(array: ColumnOrName, func: Callable) Column #
Returns whether all elements of an array match the given predicate. Returns true if all the elements match the predicate (a special case is when the array is empty); false if one or more elements don’t match; NULL if the predicate function returns NULL for one or more elements and true for all other elements.
Examples
>>> df = session.createDataFrame( ... [(1, ["bar"]), (2, ["foo", "bar"]), (3, ["foobar", "foo"])], ... ["key", "values"] ... ) >>> df.select(forall("values", lambda x: x.rlike("foo")).alias("all_foo")).show() ------------- |"all_foo" | ------------- |False | |False | |True | -------------
- pystarburst.functions.format_string(format: str, *e: ColumnOrName) Column #
Formats the arguments in printf-style and returns the result as a string column.
- pystarburst.functions.from_json(col: ColumnOrName, to: str | DataType) Column #
Casting to BOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE or VARCHAR is supported. Casting to ARRAY and MAP is supported when the element type of the array is one of the supported types, or when the key type of the map is VARCHAR and value type of the map is one of the supported types.
Examples
>>> df = session.sql("SELECT JSON '{"v1":123,"v2":"abc","v3":true}' as col1") >>> schema = StructType([StructField('v1', IntegerType(), nullable=True), StructField('v2', StringType(), nullable=True), StructField('v3', BooleanType(), nullable=True)]) >>> df.select(from_json("col1", schema)).collect() [Row(col1=(v1: 123, v2: 'abc', v3: True))] >>> df = session.sql("select JSON '[1,null,456]' as col1") >>> df.select(from_json("col1", ArrayType(IntegerType()))).collect() [Row(col1=[1, None, 456])]
- pystarburst.functions.from_unixtime(col: ColumnOrName, date_time_format: str = 'yyyy-MM-dd HH:mm:ss') Column #
Convert a Unix timestamp into a string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default)
- Parameters:
col – The Unix timestamp column
date_time_format – The format string
Examples
>>> # Example run in the EST timezone >>> import datetime >>> date_df = session.create_dataframe([[1428476356]], schema=["unix_time"]) >>> date_df.select(from_unixtime(col("unix_time"), "YYYY/MM/dd hh:mm:ss").alias("datetime")).show() ------------------------ |"DATETIME" | ------------------------ |"2015/04/08 02:59:16" | ------------------------
- pystarburst.functions.from_utc_timestamp(col: ColumnOrName, col_tz: ColumnOrLiteralStr) Column #
Takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
- Parameters:
col – The timestamp column
col_tz – the timezone column
Examples
>>> df = session.create_dataframe( ... [["1997-02-28 1:30:00", "Japan")]], ... schema=["timestamp", "tz"], ... ) >>> date_df.select(from_utc_timestamp(col("timestamp"), col("tz")).alias("datetime")).show() --------------------------- |"DATETIME" | --------------------------- |1997-02-28 10:30:00 Japan| ---------------------------
- pystarburst.functions.function(function_name: str) Callable #
Function object to invoke a Trino system-defined function (built-in function). Use this to invoke any built-in functions not explicitly listed in this object.
- Parameters:
function_name – The name of built-in function in Trino.
- Returns:
A
Callable
object for calling a Trino system-defined function.
Examples
>>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"]) # a single column with 4 rows >>> df.select(call_function("avg", col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ---------------- >>> my_avg = function('avg') >>> df.select(my_avg(col("a"))).show() ---------------- |"avg(""a"")" | ---------------- |2.500000 | ----------------
- pystarburst.functions.get(col1: ColumnOrName | int, col2: ColumnOrName | int) Column #
Returns element of array at given (0-based) index. If the index points outside of the array boundaries, then this function returns NULL.
Examples
>>> df = session.createDataFrame([({"a": 1.0, "b": 2.0}, [1, 2, 3],), ({}, [],)], ["map", "list"]) >>> df.select(get(df.list, 1).as_("idx1")).sort(col("idx1")).show() ---------- |"idx1" | ---------- |NULL | |2 | ----------
- pystarburst.functions.get_json_object(col: ColumnOrName, path: str, json_path_mode: str = 'lax') Column #
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.
- Parameters:
col – string column in json format
path – path to the json object to extract
json_path_mode – The JSON path expression can be evaluated in two modes: strict and lax. In the strict mode, it is required that the input JSON data strictly fits the schema required by the path expression. In the lax mode, the input JSON data can diverge from the expected schema. Details and examples: https://trino.io/docs/current/functions/json.html#json-path-modes
Examples
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')] >>> df = session.createDataFrame(data, ["key", "jstring"]) >>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), get_json_object(df.jstring, '$.f2').alias("c1")).collect() [Row(key='1', c0='"value1"', c1='"value2"'), Row(key='2', c0='"value12"', c1=None)]
- pystarburst.functions.greatest(*columns: ColumnOrName) Column #
Returns the largest value from a list of expressions. If any of the argument values is NULL, the result is NULL. GREATEST supports all data types, including VARIANT.
- pystarburst.functions.grouping(*cols: ColumnOrName) Column #
Describes which of a list of expressions are grouped in a row produced by a GROUP BY query.
grouping_id()
is an alias ofgrouping()
.Examples
>>> from pystarburst import GroupingSets >>> df = session.create_dataframe([[1, 2, 3], [4, 5, 6]],schema=["a", "b", "c"]) >>> grouping_sets = GroupingSets([col("a")], [col("b")], [col("a"), col("b")]) >>> df.group_by_grouping_sets(grouping_sets).agg([count("c"), grouping("a"), grouping("b"), grouping("a", "b")]).collect() [Row(A=1, B=2, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=0, GROUPING(A, B)=0), Row(A=4, B=5, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=0, GROUPING(A, B)=0), Row(A=1, B=None, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=1, GROUPING(A, B)=1), Row(A=4, B=None, COUNT(C)=1, GROUPING(A)=0, GROUPING(B)=1, GROUPING(A, B)=1), Row(A=None, B=2, COUNT(C)=1, GROUPING(A)=1, GROUPING(B)=0, GROUPING(A, B)=2), Row(A=None, B=5, COUNT(C)=1, GROUPING(A)=1, GROUPING(B)=0, GROUPING(A, B)=2)]
- pystarburst.functions.hash(col: ColumnOrName) Column #
Computes the 128-bit MurmurHash3 hash of binaryhash
- Parameters:
col – The column or value to be hashed
Examples
>>> df = session.create_dataframe(['a'], schema=["a"]) >>> df.select(xxhash64("a").alias("hash")).collect() [Row(hash=-3292477735350538661)]
- pystarburst.functions.hour(e: ColumnOrName) Column #
Extracts the hour from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(hour("a")).collect() [Row(HOUR("A")=13), Row(HOUR("A")=1)]
- pystarburst.functions.iff(condition: ColumnOrSqlExpr, expr1: ColumnOrLiteral, expr2: ColumnOrLiteral) Column #
Returns one of two specified expressions, depending on a condition. This is equivalent to an
if-then-else
expression.- Parameters:
condition – A
Column
expression or SQL text representing the specified condition.expr1 – A
Column
expression or a literal value, which will be returned ifcondition
is true.expr2 – A
Column
expression or a literal value, which will be returned ifcondition
is false.
- pystarburst.functions.in_(cols: List[ColumnOrName], *vals: DataFrame | LiteralType | Iterable[LiteralType]) Column #
Returns a conditional expression that you can pass to the filter or where methods to perform the equivalent of a WHERE … IN query that matches rows containing a sequence of values.
The expression evaluates to true if the values in a row matches the values in one of the specified sequences.
- Parameters:
cols – A list of the columns to compare for the IN operation.
vals – A list containing the values to compare for the IN operation.
Examples
>>> # The following code returns a DataFrame that contains the rows in which >>> # the columns `c1` and `c2` contain the values: >>> # - `1` and `"a"`, or >>> # - `2` and `"b"` >>> # This is equivalent to ``SELECT * FROM table WHERE (c1, c2) IN ((1, 'a'), (2, 'b'))``. >>> df = session.create_dataframe([[1, "a"], [2, "b"], [3, "c"]], schema=["col1", "col2"]) >>> df.filter(in_([col("col1"), col("col2")], [[1, "a"], [2, "b"]])).show() ------------------- |"COL1" |"COL2" | ------------------- |1 |a | |2 |b | ------------------- >>> # The following code returns a DataFrame that contains the rows where >>> # the values of the columns `c1` and `c2` in `df2` match the values of the columns >>> # `a` and `b` in `df1`. This is equivalent to >>> # ``SELECT * FROM table2 WHERE (c1, c2) IN (SELECT a, b FROM table1)``. >>> df1 = session.sql("select 1, 'a'") >>> df.filter(in_([col("col1"), col("col2")], df1)).show() ------------------- |"COL1" |"COL2" | ------------------- |1 |a | -------------------
- pystarburst.functions.initcap(e: ColumnOrName) Column #
Translate the first letter of each word to upper case in the sentence.
- pystarburst.functions.instr(e: ColumnOrName, substring: ColumnOrLiteral) Column #
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
- pystarburst.functions.is_nan(x: ColumnOrName) Column #
Determine if x is not-a-number.
Examples
>>> df = session.create_dataframe([2], schema=["a"]) >>> df.select(is_nan("a")).collect()[0][0] False
- pystarburst.functions.is_not_null(e: ColumnOrName) Column #
Return true if the value in the column is not null.
- pystarburst.functions.is_null(e: ColumnOrName) Column #
Return true if the value in the column is null.
- pystarburst.functions.isnan(x: ColumnOrName) Column #
Determine if x is not-a-number.
Examples
>>> df = session.create_dataframe([2], schema=["a"]) >>> df.select(is_nan("a")).collect()[0][0] False
- pystarburst.functions.isnull(e: ColumnOrName) Column #
Return true if the value in the column is null.
- pystarburst.functions.json_array_length(col: ColumnOrName) Column #
Returns the array length of json (a string containing a JSON array):
Examples
>>> df = session.createDataFrame([(None,), ('[1, 2, 3]',), ('[]',)], ['data']) >>> df.select(json_array_length(df.data)).show() ----------------------------- |"json_array_length(data)" | ----------------------------- |NULL | |3 | |0 | -----------------------------
- pystarburst.functions.json_extract(col: ColumnOrName, path: ColumnOrName) Column #
Parses a JSON string and returns the value of an element at a specified path in the resulting JSON document.
- pystarburst.functions.json_parse(e: ColumnOrName) Column #
Parse the value of the specified column as a JSON string and returns the resulting JSON document.
- pystarburst.functions.json_tuple(col: ColumnOrName, *fields: str, json_path_mode: str = 'lax') List[Column] #
Creates a new row for a json column according to the given field names.
- Parameters:
col – string column in json format
fields – a field or fields to extract
json_path_mode – The JSON path expression can be evaluated in two modes: strict and lax. In the strict mode, it is required that the input JSON data strictly fits the schema required by the path expression. In the lax mode, the input JSON data can diverge from the expected schema. Details and examples: https://trino.io/docs/current/functions/json.html#json-path-modes
Examples
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')] >>> df = session.createDataFrame(data, ["key", "jstring"]) >>> json_tuple_list = json_tuple(df.jstring, 'f1', 'f2') >>> df.select(df.key, *json_tuple_list).collect() [Row('1', '"value1"', '"value2"'), Row('2', '"value12"', None)]
- pystarburst.functions.kurtosis(e: ColumnOrName) Column #
Returns the population excess kurtosis of non-NULL records. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.lag(e: ColumnOrName, offset: int = 1, default_value: ColumnOrLiteral | None = None, ignore_nulls: bool = False) Column #
Accesses data in a previous row in the same result set without having to join the table to itself.
- pystarburst.functions.last_day(col: ColumnOrName) Column #
Returns the last day of the month which the given date belongs to.
- Parameters:
col – The timestamp column
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> date_df.select(last_day(col("a")).alias("date")).show() ---------------- |"DATE" | ---------------- |2020-05-31 | ----------------
- pystarburst.functions.last_value(e: ColumnOrName, ignore_nulls: bool = False) Column #
Returns the last value within an ordered group of values.
- pystarburst.functions.lead(e: ColumnOrName, offset: int = 1, default_value: Column | LiteralType | None = None, ignore_nulls: bool = False) Column #
Accesses data in a subsequent row in the same result set without having to join the table to itself.
- pystarburst.functions.least(*columns: ColumnOrName) Column #
Returns the smallest value from a list of expressions. LEAST supports all data types, including VARIANT.
- pystarburst.functions.length(e: ColumnOrName) Column #
Returns the length of an input string or binary value. For strings, the length is the number of characters, and UTF-8 characters are counted as a single character. For binary, the length is the number of bytes.
- pystarburst.functions.levenshtein(l: ColumnOrName, r: ColumnOrLiteral) Column #
Computes the Levenshtein distance of the two given strings.
- pystarburst.functions.levenshtein_distance(l: ColumnOrName, r: ColumnOrLiteral) Column #
Computes the Levenshtein distance of the two given strings.
- pystarburst.functions.listagg(col: ColumnOrName, delimiter: str = '', *within_group: ColumnOrName | Iterable[ColumnOrName], is_distinct: bool = False) Column #
Returns the concatenated input values, separated by delimiter string. See LISTAGG for details.
- Parameters:
col – a
Column
object or column name that determines the values to be put into the list.delimiter – a string delimiter.
is_distinct – whether the input expression is distinct.
Examples
>>> df.group_by(df.col1).agg(listagg(df.col2. ",", f.col2.asc())) >>> df.select(listagg(df["col2"], ",", f.col2.asc(), is_distinct=False)
- pystarburst.functions.lit(literal: LiteralType) Column #
Creates a
Column
expression for a literal value. It supports basic Python data types, includingint
,float
,str
,bool
,bytes
,bytearray
,datetime.time
,datetime.date
,datetime.datetime
anddecimal.Decimal
. Also, it supports Python structured data types, includinglist
,tuple
anddict
, but this container must be JSON serializable.
- pystarburst.functions.locate(substring: ColumnOrLiteral, e: ColumnOrName, pos: ColumnOrName | int = 1) Column #
Locate the position of the first occurrence of substr in a string column, after position pos.
- pystarburst.functions.log(base: ColumnOrName | int | float, x: ColumnOrName | int | float) Column #
Returns the logarithm of a numeric expression.
- pystarburst.functions.log10(x: ColumnOrName | int | float) Column #
Returns the logarithm in base 10 of a numeric expression.
- pystarburst.functions.log1p(x: ColumnOrName | int | float) Column #
Returns the logarithm of a numeric expression plus one.
- pystarburst.functions.log2(x: ColumnOrName | int | float) Column #
Returns the base 2 logarithm of a numeric expression.
- pystarburst.functions.lower(e: ColumnOrName) Column #
Returns the input string with all characters converted to lowercase.
- pystarburst.functions.lpad(e: ColumnOrName, len: Column | int, pad: ColumnOrName) Column #
Left-pads a string with characters from another string, or left-pads a binary value with bytes from another binary value.
- pystarburst.functions.ltrim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column #
Removes leading characters, including whitespace, from a string.
- pystarburst.functions.make_date(col_year: ColumnOrName, col_month: ColumnOrName, col_day: ColumnOrName) Column #
Generate a date from a year, month and day columns.
- Parameters:
col_year – The year column
col_month – The month column
col_day – The day column
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[2020, 1, 30]], ... schema=["a", "b", "c"], ... ) >>> df.select(make_date("a", "b", "c")).collect() [Row(MAKE_DATE("A", "B", "C")=datetime.date(2020, 1, 30)]
- pystarburst.functions.map_concat(*cols: ColumnOrName) Column #
Returns the union of all the given maps
- Parameters:
cols – the maps to concatenate
Examples
>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a, MAP_FROM_ENTRIES(ARRAY[(3, 'c')]) as b") >>> df.select(map_concat('a', 'b')).show() ---------------------------------- |"map_concat(a, b)" | ---------------------------------- |{1: 'a', 2: 'b', 3: 'c'} | ----------------------------------
- pystarburst.functions.map_entries(col: ColumnOrName) Column #
Returns an unordered array of all entries in the given map
- Parameters:
col – the input map
Examples
>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a") >>> df.select(map_entries('a')).show() ------------------------ |"map_entries(a)" | ------------------------ |[(1, 'a'), (2, 'b')] | ------------------------
- pystarburst.functions.map_filter(col: ColumnOrName, func: Callable) Column #
Constructs a map from those entries of map for which function returns true.
Examples
>>> df = session.createDataFrame([(1, {"foo": 42.0, "bar": 1.0, "baz": 32.0})], ["id", "data"]) >>> row = df.select(map_filter( ... "data", lambda _, v: v > 30.0).alias("data_filtered") ... ).head() >>> sorted(row["data_filtered"].items()) [('baz', 32.0), ('foo', 42.0)]
- pystarburst.functions.map_from_arrays(col1: ColumnOrName, col2: ColumnOrName) Column #
Creates a new map from two arrays
- Parameters:
col1 – the array containing all the keys
col2 – the array containing all the values
Examples
>>> df = session.createDataFrame([(["Alice", "Bob", "Charlie"], [2, 5, 8])], schema=["a", "b"]) >>> df.select(map_from_arrays("a", "b")).show() ---------------------------------------- |"map_from_entries(zip(a, b))" | ---------------------------------------- |{'Bob': 5, 'Alice': 2, 'Charlie': 8} | ----------------------------------------
- pystarburst.functions.map_from_entries(col: ColumnOrName) Column #
Converts an array of entries (key value struct types) to a map of values
- Parameters:
col – the input map
Examples
>>> df = session.sql("SELECT ARRAY[(1, 'a'), (2, 'b')] as a") >>> df.select(map_from_entries('a')).show() ------------------------- |"map_from_entries(a)" | ------------------------- |{1: 'a', 2: 'b'} | -------------------------
- pystarburst.functions.map_keys(col: ColumnOrName) Column #
Returns an unordered array containing the keys of the map
- Parameters:
col – the input map
Examples
>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a") >>> df.select(map_keys('a')).show() ----------------- |"map_keys(a)" | ----------------- |[1, 2] | -----------------
- pystarburst.functions.map_values(col: ColumnOrName) Column #
Returns an unordered array containing the values of the map
- Parameters:
col – the input map
Examples
>>> df = session.sql("SELECT MAP_FROM_ENTRIES(ARRAY[(1, 'a'), (2, 'b')]) as a") >>> df.select(map_values('a')).show() ------------------- |"map_values(a)" | ------------------- |['a', 'b'] | -------------------
- pystarburst.functions.map_zip_with(col1: ColumnOrName, col2: ColumnOrName, func: Callable) Column #
Merges the two given maps into a single map by applying function to the pair of values with the same key. For keys only presented in one map, NULL will be passed as the value for the missing key.
Examples
>>> df = session.createDataFrame([ ... (1, {"IT": 24.0, "SALES": 12.00}, {"IT": 2.0, "SALES": 1.4})], ... ["id", "base", "ratio"] ... ) >>> row = df.select(map_zip_with( ... "base", "ratio", lambda k, v1, v2: round(v1 * v2, 2)).alias("updated_data") ... ).head() >>> sorted(row["updated_data"].items()) [('IT', 48.0), ('SALES', 16.8)]
- pystarburst.functions.max(e: ColumnOrName) Column #
Returns the maximum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.
- pystarburst.functions.max_by(e: ColumnOrName, f: ColumnOrName) Column #
Returns the maximum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.
- pystarburst.functions.md5(e: ColumnOrName) Column #
Returns a 32-character hex-encoded string containing the 128-bit MD5 message digest.
- pystarburst.functions.mean(e: ColumnOrName) Column #
Return the average for the specific numeric columns. Alias of
avg()
.
- pystarburst.functions.min(e: ColumnOrName) Column #
Returns the minimum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.
- pystarburst.functions.min_by(e: ColumnOrName, f: ColumnOrName) Column #
Returns the minimum value for the records in a group. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned.
- pystarburst.functions.minute(e: ColumnOrName) Column #
Extracts the minute from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(minute("a")).collect() [Row(MINUTE("A")=11), Row(MINUTE("A")=30)]
- pystarburst.functions.mode(e: ColumnOrName) Column #
Returns the most frequent value for the records in a group. NULL values are ignored. If all the values are NULL, or there are 0 rows, then the function returns NULL.
- pystarburst.functions.month(e: ColumnOrName) Column #
Extracts the month from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(month("a")).collect() [Row(MONTH("A")=5), Row(MONTH("A")=8)]
- pystarburst.functions.negate(e: ColumnOrName) Column #
Returns the negation of the value in the column (equivalent to a unary minus).
- pystarburst.functions.nth_value(e: ColumnOrName, offset: int, ignore_nulls: bool = False) Column #
Returns the value at the specified offset from the beginning of the window. Offsets start at 1. The offset can be any scalar expression. If the offset is null or greater than the number of values in the window, null is returned. It is an error for the offset to be zero or negative.
- pystarburst.functions.ntile(e: int | ColumnOrName) Column #
Divides an ordered data set equally into the number of buckets specified by n. Buckets are sequentially numbered 1 through n.
- Parameters:
e – The desired number of buckets; must be a positive integer value.
- pystarburst.functions.octet_length(e: ColumnOrName) Column #
Calculates the byte length for the specified string column.
- pystarburst.functions.overlay(e: ColumnOrName, replace: ColumnOrLiteral, pos: Column | int, len: Column | int = -1) Column #
Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
- pystarburst.functions.parse_json(e: ColumnOrName) Column #
Parse the value of the specified column as a JSON string and returns the resulting JSON document.
- pystarburst.functions.percent_rank() Column #
Returns the relative rank of a value within a group of values, specified as a percentage ranging from 0.0 to 1.0.
- pystarburst.functions.percentile_approx(col: ColumnOrName, percentile: float) Column #
Returns an approximated value for the desired percentile. This function uses the t-Digest algorithm.
- pystarburst.functions.pmod(e: ColumnOrName, f: ColumnOrName) Column #
Returns the positive modulus for the input
- pystarburst.functions.pow(left: ColumnOrName | int | float, right: ColumnOrName | int | float) Column #
Returns a number (left) raised to the specified power (right).
- pystarburst.functions.power(x: ColumnOrName, p: Column | int) Column #
Returns x raised to the power of p.
Examples
>>> df = session.create_dataframe([2], schema=["a"]) >>> df.select(power("a", 2)).collect()[0][0] 4
- pystarburst.functions.quarter(e: ColumnOrName) Column #
Extracts the quarter from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(quarter("a")).collect() [Row(QUARTER("A")=2), Row(QUARTER("A")=3)]
- pystarburst.functions.raise_error(error_description: ColumnOrLiteralStr) Column #
Throws an exception with the provided error message.
- Parameters:
error_description – A
Column
object or column name that determines the error description
Examples
>>> df.select(fail("unsupported operation"))
- pystarburst.functions.random(m: ColumnOrName | int | None = None, n: ColumnOrName | int | None = None) Column #
Each call returns a pseudo-random value
- pystarburst.functions.rank() Column #
Returns the rank of a value within an ordered group of values. The rank value starts at 1 and continues up.
- pystarburst.functions.reduce(array: ColumnOrName, initialState: ColumnOrName, input_func: Callable, output_func: Callable | None = None) Column #
Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value. It may be the identity function (i -> i) (default if not specified).
Examples
>>> df = session.createDataFrame([(1, [20.0, 4.0, 2.0, 6.0, 10.0])], ["id", "values"]) >>> df.select(aggregate("values", lit(0.0), lambda acc, x: acc + x).alias("sum")).show() --------- |"sum" | --------- |42.0 | ---------
- pystarburst.functions.regexp_count(subject: ColumnOrName, pattern: ColumnOrLiteralStr) Column #
Returns the number of times that a pattern occurs in the subject.
- pystarburst.functions.regexp_extract(subject: ColumnOrName, pattern: ColumnOrLiteralStr, group: Column | int | None = 0) Column #
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group
- pystarburst.functions.regexp_replace(subject: ColumnOrName, pattern: ColumnOrLiteralStr, replacement: ColumnOrLiteralStr = '') Column #
Returns the subject with the specified pattern (or all occurrences of the pattern) either removed or replaced by a replacement string. If no matches are found, returns the original subject.
- pystarburst.functions.repeat(s: ColumnOrName, n: Column | int) Column #
Builds a string by repeating the input for the specified number of times.
- pystarburst.functions.replace(subject: ColumnOrName, pattern: ColumnOrLiteralStr, replacement: ColumnOrLiteralStr = '') Column #
Removes all occurrences of a specified subject and optionally replaces them with replacement.
- pystarburst.functions.reverse(col: ColumnOrName) Column #
Returns a reversed string or an array with reverse order of elements.
Examples
>>> df = session.create_dataframe([["Hello"], ["abc"]], schema=["col1"]) >>> df.select(reverse(col("col1"))).show() ----------------------- |"reverse(col1)" | ----------------------- |olleH | |cba | ----------------------- >>> df = session.createDataFrame([([2, 1, 3],) ,([1],) ,([],)], ['data']) >>> res = df.select(reverse(df.data).alias('r')).collect() [Row(r=[3, 1, 2]), Row(r=[1]), Row(r=[])]
- pystarburst.functions.round(e: ColumnOrName, scale: ColumnOrName | int | float = 0) Column #
Returns values from the specified column rounded to the nearest equal or smaller integer.
- pystarburst.functions.row_number() Column #
Returns a unique row number for each row within a window partition. The row number starts at 1 and continues up sequentially.
- pystarburst.functions.rpad(e: ColumnOrName, len: Column | int, pad: ColumnOrName) Column #
Right-pads a string with characters from another string, or right-pads a binary value with bytes from another binary value.
- pystarburst.functions.rtrim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column #
Removes trailing characters, including whitespace, from a string.
- pystarburst.functions.second(e: ColumnOrName) Column #
Extracts the second from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(second("a")).collect() [Row(SECOND("A")=20), Row(SECOND("A")=5)]
- pystarburst.functions.sequence(start: ColumnOrName, stop: ColumnOrName, step: ColumnOrName | None = None) Column #
Generate a sequence of integers from start to stop, incrementing by step. If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1.
- Parameters:
start – the column that contains the integer to start with (inclusive).
stop – the column that contains the integer to stop (inclusive).
step – the column that contains the integer to increment.
Examples
>>> df1 = session.create_dataframe([(-2, 2)], ["a", "b"]) >>> df1.select(sequence("a", "b").alias("result")).show() --------------------- |"result" | --------------------- |[-2, -1, 0, 1, 2] | --------------------- >>> df2 = session.create_dataframe([(4, -4, -2)], ["a", "b", "c"]) >>> df2.select(sequence("a", "b", "c").alias("result")).show() --------------------- |"result" | --------------------- |[4, 2, 0, -2, -4] | ---------------------
- pystarburst.functions.sha1(e: ColumnOrName) Column #
Returns a 40-character hex-encoded string containing the 160-bit SHA-1 message digest.
- pystarburst.functions.sha2(e: ColumnOrName, num_bits: int) Column #
Returns a hex-encoded string containing the SHA-256 or SHA-512 message digest
- pystarburst.functions.shuffle(array: ColumnOrName) Column #
Generates a random permutation of the given array.
- Parameters:
array – The column containing the source ARRAY.
Examples
>>> df = spark.createDataFrame([([1, 20, 3, 5],), ([1, 20, None, 3],)], ['data']) >>> df.select(shuffle(df.data).alias('s')).collect() [Row(s=[3, 1, 5, 20]), Row(s=[20, None, 3, 1])]
- pystarburst.functions.sin(e: ColumnOrName) Column #
Computes the sine of its argument; the argument should be expressed in radians.
- pystarburst.functions.size(array: ColumnOrName) Column #
Returns the cardinality (size) of the array or map.
Examples
>>> from pystarburst import Row >>> df = session.create_dataframe([Row(a=[1, 2, 3])]) >>> df.select(size("a").alias("result")).show() ------------ |"result" | ------------ |3 | ------------
- pystarburst.functions.skew(e: ColumnOrName) Column #
Returns the sample skewness of non-NULL records. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.skewness(e: ColumnOrName) Column #
Returns the sample skewness of non-NULL records. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.sort_array(array: ColumnOrName, sort_ascending: bool | None = True) Column #
Returns rows of array column in sorted order. Users can choose the sort order.
- Parameters:
array – name of the column or column element which describes the column
sort_ascending – Boolean that decides if array elements are sorted in ascending order. Defaults to True.
Examples
>>> df = session.sql("select array[20, 0, null, 10] as a") >>> df.select(sort_array(df.a).as_("sorted_a")).show() --------------------- |"sorted_a" | --------------------- |[0, 10, 20, None] | --------------------- >>> df.select(sort_array(df.a, False).as_("sorted_a")).show() --------------------- |"sorted_a" | --------------------- |[None, 20, 10, 0] | ---------------------
- pystarburst.functions.soundex(e: ColumnOrName) Column #
Returns a string that contains a phonetic representation of the input string.
- pystarburst.functions.split(str: ColumnOrName, pattern: ColumnOrLiteralStr) Column #
Splits a given string with a given separator and returns the result in an array of strings. To specify a string separator, use the
lit()
function.
- pystarburst.functions.sql_expr(sql: str) Column #
Creates a
Column
expression from raw SQL text. Note that the function does not interpret or check the SQL text.
- pystarburst.functions.sqrt(e: ColumnOrName) Column #
Returns the square-root of a non-negative numeric expression.
- pystarburst.functions.starts_with(col: ColumnOrName, str: ColumnOrName) Column #
Returns true if col starts with str.
- pystarburst.functions.startswith(col: ColumnOrName, str: ColumnOrName) Column #
Returns true if col starts with str.
- pystarburst.functions.stddev(e: ColumnOrName) Column #
Returns the sample standard deviation (square root of sample variance) of non-NULL values. If all records inside a group are NULL, returns NULL.
- pystarburst.functions.stddev_pop(e: ColumnOrName) Column #
Returns the population standard deviation (square root of variance) of non-NULL values. If all records inside a group are NULL, returns NULL.
- pystarburst.functions.stddev_samp(e: ColumnOrName) Column #
Returns the sample standard deviation (square root of sample variance) of non-NULL values. If all records inside a group are NULL, returns NULL. Alias of
stddev()
.
- pystarburst.functions.strpos(e: ColumnOrName, substring: ColumnOrLiteral, instance: int | ColumnOrName)#
Returns the position of the N-th instance of substring in string. When instance is a negative number the search will start from the end of string. Positions start with 1. If not found, 0 is returned.
- pystarburst.functions.struct(*cols: ColumnOrLiteral) Column #
Creates a new struct column.
- Parameters:
cols – column names or Columns to contain in the output struct.
Examples
>>> df = session.createDataFrame([("Alice", 2), ("Bob", 5)], ["name", "age"]) >>> df.select(struct('age', 'name').alias("struct")).collect() [Row(struct=(age: 2, name: 'Alice')), Row(struct=(age: 5, name: 'Bob'))] >>> df.select(struct([df.age, df.name]).alias("struct")).collect() [Row(struct=(age: 2, name: 'Alice')), Row(struct=(age: 5, name: 'Bob'))]
- pystarburst.functions.substr(str: ColumnOrName, pos: Column | int, len: Column | int | None = None) Column #
Returns the portion of the string or binary value str, starting from the character/byte specified by pos, with limited length. The length should be greater than or equal to zero. If the length is a negative number, the function returns an empty string.
Note
For
pos
, 1 is the first character of the string in Trinosubstr()
is an alias ofsubstring()
.
- pystarburst.functions.substring(str: ColumnOrName, pos: Column | int, len: Column | int | None = None) Column #
Returns the portion of the string or binary value str, starting from the character/byte specified by pos, with limited length. The length should be greater than or equal to zero. If the length is a negative number, the function returns an empty string.
Note
For
pos
, 1 is the first character of the string in Trinosubstr()
is an alias ofsubstring()
.
- pystarburst.functions.substring_index(str: ColumnOrName, delim: ColumnOrLiteral, count: Column | int) Column #
Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned.
If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.
- pystarburst.functions.sum(e: ColumnOrName) Column #
Returns the sum of non-NULL records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.sumDistinct(e: ColumnOrName) Column #
Returns the sum of non-NULL distinct records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.sum_distinct(e: ColumnOrName) Column #
Returns the sum of non-NULL distinct records in a group. You can use the DISTINCT keyword to compute the sum of unique non-null values. If all records inside a group are NULL, the function returns NULL.
- pystarburst.functions.table_function(function_name: str) Callable #
Create a function object to invoke a Trino table function.
- Parameters:
function_name – The name of the table function.
Examples
>>> from pystarburst.functions import lit >>> sequence = table_function("sequence") >>> session.table_function(sequence(lit(0), lit(4)).over()).collect() [Row(sequential_number=0), Row(sequential_number=1), Row(sequential_number=2), Row(sequential_number=3), Row(sequential_number=4)]
- pystarburst.functions.tan(e: ColumnOrName) Column #
Computes the tangent of its argument; the argument should be expressed in radians.
- pystarburst.functions.tanh(e: ColumnOrName) Column #
Computes the hyperbolic tangent of its argument.
- pystarburst.functions.timestamp_seconds(col: ColumnOrName) Column #
Convert a Unix timestamp into a local datetime.
- Parameters:
col – The Unix timestamp column
Examples
>>> # Example run in the EST timezone >>> import datetime >>> date_df = session.create_dataframe([[1428476356]], schema=["unix_time"]) >>> date_df.select(timestamp_seconds(col("unix_time")).alias("datetime")).show() -------------------------- |"DATETIME" | -------------------------- |2015-04-08 02:59:16.000 | --------------------------
- pystarburst.functions.to_date(col: ColumnOrName, fmt: str = 'yyyy-MM-dd') Column #
Converts an input expression into a date using the optionally specified format. Use format compatible with JodaTime’s DateTimeFormat pattern format.
- Parameters:
col – column values to convert.
format – format to use to convert date values. Default is
yyyy-MM-dd
.
Examples
>>> df = session.createDataFrame([('1997-02-28',)], ['t']) >>> df.select(to_date(df.t).alias('date')).collect() [Row(date=datetime.date(1997, 2, 28))] >>> df = session.createDataFrame([('1997-02-28',)], ['t']) >>> df.select(to_date(df.t, 'yyyy-MM-dd').alias('date')).collect() [Row(date=datetime.date(1997, 2, 28))]
- pystarburst.functions.to_hex(e: ColumnOrName) Column #
Encodes binary into a hex string representation.
- pystarburst.functions.to_json(col: ColumnOrName) Column #
Cast to a JSON string.
Examples
>>> data = [(1, {"name": "Alice"})] >>> df = session.createDataFrame(data, ["key", "value"]) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json='{"name":"Alice"}')] >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])] >>> df = session.createDataFrame(data, ["key", "value"]) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json='[{"name":"Alice"},{"name":"Bob"}]')] >>> data = [(1, ["Alice", "Bob"])] >>> df = session.createDataFrame(data, ["key", "value"]) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json='["Alice","Bob"]')]
- pystarburst.functions.to_time(e: ColumnOrName, fmt: Column | None = None) Column #
Converts an input expression into the corresponding time.
- pystarburst.functions.to_timestamp(col: ColumnOrName, fmt: str = 'yyyy-MM-dd HH:mm:ss') Column #
Converts an input expression into a timestamp using the optionally specified format. Use format compatible with JodaTime’s DateTimeFormat pattern format.
- Parameters:
col – column values to convert.
format – format to use to convert timestamp values. Default is
yyyy-MM-dd HH:mm:ss
.
Examples
>>> df = session.createDataFrame([('1997-02-28 10:30:00',)], ['t']) >>> df.select(to_timestamp(df.t).alias('dt')).collect() [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))] >>> df = session.createDataFrame([('1997-02-28 10:30:00',)], ['t']) >>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect() [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]
- pystarburst.functions.to_utc_timestamp(col: ColumnOrName, col_tz: ColumnOrLiteralStr) Column #
Takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in the given timezone, and renders that timestamp as a timestamp in UTC.
- Parameters:
col – The timestamp column
col_tz – the timezone column
Examples
>>> df = session.create_dataframe( ... [["1997-02-28 10:30:00", "Japan")]], ... schema=["timestamp", "tz"], ... ) >>> date_df.select(to_utc_timestamp(col("timestamp"), col("tz")).alias("datetime")).show() ------------------------ |"DATETIME" | ------------------------ |1997-02-28 1:30:00 UTC| ------------------------
- pystarburst.functions.transform(array: ColumnOrName, func: Callable) Column #
Returns an array that is the result of applying function to each element of array
- pystarburst.functions.transform_keys(col: ColumnOrName, func: Callable) Column #
Returns a map that applies function to each entry of map and transforms the keys.
Examples
>>> df = session.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ["id", "data"]) >>> row = df.select(transform_keys( ... "data", lambda k, _: upper(k)).alias("data_upper") ... ).head() >>> sorted(row["data_upper"].items()) [('BAR', 2.0), ('FOO', -2.0)]
- pystarburst.functions.transform_values(col: ColumnOrName, func: Callable) Column #
Returns a map that applies function to each entry of map and transforms the values.
Examples
>>> df = session.createDataFrame([(1, {"IT": 10.0, "SALES": 2.0, "OPS": 24.0})], ["id", "data"]) >>> row = df.select(transform_values( ... "data", lambda k, v: v + 10.0 ... ).alias("new_data")).head() >>> sorted(row["new_data"].items()) [('IT', 20.0), ('OPS', 34.0), ('SALES', 12.0)]
- pystarburst.functions.translate(src: ColumnOrName, matching_string: ColumnOrName, replace_string: ColumnOrName) Column #
Translates src from the characters in matchingString to the characters in replaceString.
- pystarburst.functions.trim(e: ColumnOrName, trim_string: ColumnOrName | None = None) Column #
Removes leading and trailing characters from a string.
- pystarburst.functions.trunc(col: ColumnOrName, trunc_format: ColumnOrLiteralStr) Column #
Truncates a DATE, TIME, or TIMESTAMP to the specified precision.
Note that truncation is not the same as extraction. For example: - Truncating a timestamp down to the quarter returns the timestamp corresponding to midnight of the first day of the quarter for the input timestamp. - Extracting the quarter date part from a timestamp returns the quarter number of the year in the timestamp.
- Parameters:
col – the date/time/timestamp column
trunc_format – the truncation format
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(trunc("YEAR", "a"), trunc("MONTH", "a")).collect() [Row(DATE_TRUNC("YEAR", "A")=datetime.date(2020, 1, 1), DATE_TRUNC("MONTH", "A")=datetime.datetime(2020, 5, 1)]
- pystarburst.functions.try_cast(column: ColumnOrName, to: str | DataType) Column #
A special version of CAST for a subset of data type conversions. It performs the same operation (i.e. converts a value of one data type into another data type), but returns a NULL value instead of raising an error when the conversion can not be performed.
The
column
argument must be a column in Trino.
- pystarburst.functions.typeof(col: ColumnOrName) Column #
Reports the type of a value stored in a VARIANT column. The type is returned as a string.
- pystarburst.functions.unbase64(e: ColumnOrName) Column #
Decodes a BASE64 encoded string column and returns it as a binary column.
- pystarburst.functions.unhex(col: ColumnOrName) Column #
Computes each pair of characters as a hexadecimal number and converts to the byte representation of number.
- pystarburst.functions.unix_timestamp(col: ColumnOrName, date_time_format: str = 'yyyy-MM-dd HH:mm:ss') Column #
Convert a time string with a given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix timestamp (in seconds).
- Parameters:
col – The timestamp column or addend in the date_format
date_time_format – The format string
Examples
>>> # Example run in the EST timezone >>> import datetime >>> date_df = session.create_dataframe([["04/08/2015 02:59:16"]], schema=["date_col"]) >>> date_df.select(to_unixtime(col("date_col"), "MM/dd/yyyy hh:mm:ss").alias("unix_time")).show() ---------------- |"UNIX_TIME" | ---------------- |1428476356 | ----------------
- pystarburst.functions.upper(e: ColumnOrName) Column #
Returns the input string with all characters converted to uppercase.
- pystarburst.functions.var_pop(e: ColumnOrName) Column #
Returns the population variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.
- pystarburst.functions.var_samp(e: ColumnOrName) Column #
Returns the sample variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned. Alias of
variance()
- pystarburst.functions.variance(e: ColumnOrName) Column #
Returns the sample variance of non-NULL records in a group. If all records inside a group are NULL, a NULL is returned.
- pystarburst.functions.week_of_year(e: ColumnOrName) Column #
Extracts the corresponding week (number) of the year from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(weekofyear("a")).collect() [Row(WEEKOFYEAR("A")=18)]
- pystarburst.functions.weekofyear(e: ColumnOrName) Column #
Extracts the corresponding week (number) of the year from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe( ... [[datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f")]], ... schema=["a"], ... ) >>> df.select(weekofyear("a")).collect() [Row(WEEKOFYEAR("A")=18)]
- pystarburst.functions.when(condition: ColumnOrSqlExpr, value: ColumnOrLiteral) CaseExpr #
Works like a cascading if-then-else statement. A series of conditions are evaluated in sequence. When a condition evaluates to TRUE, the evaluation stops and the associated result (after THEN) is returned. If none of the conditions evaluate to TRUE, then the result after the optional OTHERWISE is returned, if present; otherwise NULL is returned.
- Parameters:
condition – A
Column
expression or SQL text representing the specified condition.value – A
Column
expression or a literal value, which will be returned ifcondition
is true.
- pystarburst.functions.when_matched(condition: Column | None = None) WhenMatchedClause #
Specifies a matched clause for the
Table.merge
action. SeeWhenMatchedClause
for details.
- pystarburst.functions.when_not_matched(condition: Column | None = None) WhenNotMatchedClause #
Specifies a not-matched clause for the
Table.merge
action. SeeWhenNotMatchedClause
for details.
- pystarburst.functions.xxhash64(col: ColumnOrName) Column #
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a varbinary column.
- Parameters:
col – The column or value to be hashed
Examples
>>> df = session.create_dataframe(['a'], schema=["a"]) >>> df.select(xxhash64("a").alias("xxhash64")).collect() [Row(xxhash64=-3292477735350538661)]
- pystarburst.functions.year(e: ColumnOrName) Column #
Extracts the year from a date or timestamp.
Examples
>>> import datetime >>> df = session.create_dataframe([ ... datetime.datetime.strptime("2020-05-01 13:11:20.000", "%Y-%m-%d %H:%M:%S.%f"), ... datetime.datetime.strptime("2020-08-21 01:30:05.000", "%Y-%m-%d %H:%M:%S.%f") ... ], schema=["a"]) >>> df.select(year("a")).collect() [Row(YEAR("A")=2020), Row(YEAR("A")=2020)]
- pystarburst.functions.zip_with(array1: ColumnOrName, array2: ColumnOrName, func: Callable) Column #
Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.