Dataframe stat functions#

Dataframe stat methods#

class pystarburst.dataframe_stat_functions.DataFrameStatFunctions(df: DataFrame)#

Bases: object

Provides computed statistical functions for DataFrames. To access an object of this class, use DataFrame.stat.

approxQuantile(col: ColumnOrName | Iterable[ColumnOrName], percentile: Iterable[float], *, statement_properties: Dict[str, str] | None = None) List[float] | List[List[float]]#

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. This function uses the t-Digest algorithm.

approxQuantile() is an alias of approx_quantile().

  • col – The name of the numeric column.

  • percentile – A list of float values greater than or equal to 0.0 and less than 1.0.


A list of approximate percentile values if col is a single column name, or a matrix with the dimensions (len(col) * len(percentile) containing the approximate percentile values if col is a list of column names.


>>> df = session.create_dataframe([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], schema=["a"])
>>> df.stat.approx_quantile("a", [0, 0.1, 0.4, 0.6, 1])
[-0.5, 0.5, 3.5, 5.5, 9.5]

>>> df2 = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df2.stat.approx_quantile(["a", "b"], [0, 0.1, 0.6])
[[0.05, 0.15000000000000002, 0.25], [0.45, 0.55, 0.6499999999999999]]
approx_quantile(col: ColumnOrName | Iterable[ColumnOrName], percentile: Iterable[float], *, statement_properties: Dict[str, str] | None = None) List[float] | List[List[float]]#

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. This function uses the t-Digest algorithm.

approxQuantile() is an alias of approx_quantile().

  • col – The name of the numeric column.

  • percentile – A list of float values greater than or equal to 0.0 and less than 1.0.


A list of approximate percentile values if col is a single column name, or a matrix with the dimensions (len(col) * len(percentile) containing the approximate percentile values if col is a list of column names.


>>> df = session.create_dataframe([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], schema=["a"])
>>> df.stat.approx_quantile("a", [0, 0.1, 0.4, 0.6, 1])
[-0.5, 0.5, 3.5, 5.5, 9.5]

>>> df2 = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df2.stat.approx_quantile(["a", "b"], [0, 0.1, 0.6])
[[0.05, 0.15000000000000002, 0.25], [0.45, 0.55, 0.6499999999999999]]
corr(col1: ColumnOrName, col2: ColumnOrName, *, statement_properties: Dict[str, str] | None = None) float | None#

Calculates the correlation coefficient for non-null pairs in two numeric columns.

  • col1 – The name of the first numeric column to use.

  • col2 – The name of the second numeric column to use.


The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.


>>> df = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df.stat.corr("a", "b")
cov(col1: ColumnOrName, col2: ColumnOrName, *, statement_properties: Dict[str, str] | None = None) float | None#

Calculates the sample covariance for non-null pairs in two numeric columns.

  • col1 – The name of the first numeric column to use.

  • col2 – The name of the second numeric column to use.


The sample covariance of the two numeric columns. If there is not enough data to generate the covariance, the method returns None.


>>> df = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df.stat.cov("a", "b")
sampleBy(col: ColumnOrName, fractions: Dict[LiteralType, float]) DataFrame#

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

sampleBy() is an alias of sample_by().

  • col – The name of the column that defines the strata.

  • fractions – A dict that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the dict, the method uses 0 as the fraction.


>>> df = session.create_dataframe([("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)], schema=["name", "age"])
>>> fractions = {"Bob": 0.5, "Nico": 1.0}
>>> sample_df = df.stat.sample_by("name", fractions)  # non-deterministic result
sample_by(col: ColumnOrName, fractions: Dict[LiteralType, float]) DataFrame#

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

sampleBy() is an alias of sample_by().

  • col – The name of the column that defines the strata.

  • fractions – A dict that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the dict, the method uses 0 as the fraction.


>>> df = session.create_dataframe([("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)], schema=["name", "age"])
>>> fractions = {"Bob": 0.5, "Nico": 1.0}
>>> sample_df = df.stat.sample_by("name", fractions)  # non-deterministic result