Dataframe stat functions#

Dataframe stat methods#

class pystarburst.dataframe_stat_functions.DataFrameStatFunctions(df: DataFrame)#

Bases: object

Provides computed statistical functions for DataFrames. To access an object of this class, use DataFrame.stat.

approxQuantile(col: ColumnOrName | Iterable[ColumnOrName], percentile: Iterable[float], *, statement_properties: Dict[str, str] | None = None) → List[float] | List[List[float]]#

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. This function uses the t-Digest algorithm.

approxQuantile() is an alias of approx_quantile().

Parameters:

col – The name of the numeric column.
percentile – A list of float values greater than or equal to 0.0 and less than 1.0.

Returns:

A list of approximate percentile values if col is a single column name, or a matrix with the dimensions (len(col) * len(percentile) containing the approximate percentile values if col is a list of column names.

Examples

>>> df = session.create_dataframe([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], schema=["a"])
>>> df.stat.approx_quantile("a", [0, 0.1, 0.4, 0.6, 1])
[-0.5, 0.5, 3.5, 5.5, 9.5]

>>> df2 = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df2.stat.approx_quantile(["a", "b"], [0, 0.1, 0.6])
[[0.05, 0.15000000000000002, 0.25], [0.45, 0.55, 0.6499999999999999]]

approx_quantile(col: ColumnOrName | Iterable[ColumnOrName], percentile: Iterable[float], *, statement_properties: Dict[str, str] | None = None) → List[float] | List[List[float]]#

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles. This function uses the t-Digest algorithm.

approxQuantile() is an alias of approx_quantile().

Parameters:

col – The name of the numeric column.
percentile – A list of float values greater than or equal to 0.0 and less than 1.0.

Returns:

A list of approximate percentile values if col is a single column name, or a matrix with the dimensions (len(col) * len(percentile) containing the approximate percentile values if col is a list of column names.

Examples

>>> df = session.create_dataframe([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], schema=["a"])
>>> df.stat.approx_quantile("a", [0, 0.1, 0.4, 0.6, 1])
[-0.5, 0.5, 3.5, 5.5, 9.5]

>>> df2 = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df2.stat.approx_quantile(["a", "b"], [0, 0.1, 0.6])
[[0.05, 0.15000000000000002, 0.25], [0.45, 0.55, 0.6499999999999999]]

corr(col1: ColumnOrName, col2: ColumnOrName, *, statement_properties: Dict[str, str] | None = None) → float | None#

Calculates the correlation coefficient for non-null pairs in two numeric columns.

Parameters:

col1 – The name of the first numeric column to use.
col2 – The name of the second numeric column to use.

Returns:

The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.

Examples

>>> df = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df.stat.corr("a", "b")
0.9999999999999991

cov(col1: ColumnOrName, col2: ColumnOrName, *, statement_properties: Dict[str, str] | None = None) → float | None#

Calculates the sample covariance for non-null pairs in two numeric columns.

Parameters:

col1 – The name of the first numeric column to use.
col2 – The name of the second numeric column to use.

Returns:

The sample covariance of the two numeric columns. If there is not enough data to generate the covariance, the method returns None.

Examples

>>> df = session.create_dataframe([[0.1, 0.5], [0.2, 0.6], [0.3, 0.7]], schema=["a", "b"])
>>> df.stat.cov("a", "b")
0.010000000000000037

sampleBy(col: ColumnOrName, fractions: Dict[LiteralType, float]) → DataFrame#

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

sampleBy() is an alias of sample_by().

Parameters:

col – The name of the column that defines the strata.
fractions – A dict that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the dict, the method uses 0 as the fraction.

Examples

>>> df = session.create_dataframe([("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)], schema=["name", "age"])
>>> fractions = {"Bob": 0.5, "Nico": 1.0}
>>> sample_df = df.stat.sample_by("name", fractions)  # non-deterministic result

sample_by(col: ColumnOrName, fractions: Dict[LiteralType, float]) → DataFrame#

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

sampleBy() is an alias of sample_by().

Parameters:

col – The name of the column that defines the strata.
fractions – A dict that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the dict, the method uses 0 as the fraction.

Examples

>>> df = session.create_dataframe([("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)], schema=["name", "age"])
>>> fractions = {"Bob": 0.5, "Nico": 1.0}
>>> sample_df = df.stat.sample_by("name", fractions)  # non-deterministic result