Dataframe NA functions#
Dataframe NA methods#
- class pystarburst.dataframe_na_functions.DataFrameNaFunctions(df: DataFrame)#
Bases:
objectProvides functions for handling missing values in a
DataFrame.- drop(how: str = 'any', thresh: int | None = None, subset: Iterable[str] | None = None) DataFrame#
Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
- Parameters:
how – An
strwith value either ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. The default value is ‘any’. Ifthreshis provided,howwill be ignored.thresh –
The minimum number of non-null and non-NaN values that should be in the specified columns in order for the row to be included. It overwrites
how. In each case:If
threshis not provided orNone, the length ofsubsetwill be used whenhowis ‘any’ and 1 will be used whenhowis ‘all’.If
threshis greater than the number of the specified columns, the method returns an empty DataFrame.If
threshis less than 1, the method returns the original DataFrame.
subset –
A list of the names of columns to check for null and NaN values. In each case:
If
subsetis not provided orNone, all columns will be included.If
subsetis empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b") >>> # drop a row if it contains any nulls, with checking all columns >>> df.na.drop().show() ------------- |"A" |"B" | ------------- |1.0 |1 | ------------- >>> # drop a row only if all its values are null, with checking all columns >>> df.na.drop(how='all').show() --------------- |"A" |"B" | --------------- |1.0 |1 | |nan |2 | |NULL |3 | |4.0 |NULL | --------------- >>> # drop a row if it contains at least one non-null and non-NaN values, with checking all columns >>> df.na.drop(thresh=1).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |nan |2 | |NULL |3 | |4.0 |NULL | --------------- >>> # drop a row if it contains any nulls, with checking column "a" >>> df.na.drop(subset=["a"]).show() -------------- |"A" |"B" | -------------- |1.0 |1 | |4.0 |NULL | --------------
See also
DataFrame.dropna()
- fill(value: LiteralType | Dict[str, LiteralType], subset: Iterable[str] | None = None) DataFrame#
Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
- Parameters:
value – A scalar value or a
dictthat associates the names of columns with the values that should be used to replace null and NaN values in those columns. Ifvalueis adict,subsetis ignored. Ifvalueis an emptydict, the method returns the original DataFrame.subset –
A list of the names of columns to check for null and NaN values. In each case:
If
subsetis not provided orNone, all columns will be included.If
subsetis empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b") >>> # fill null and NaN values in all columns >>> df.na.fill(3.14).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |NULL | |3.14 |NULL | --------------- >>> # fill null and NaN values in column "a" >>> df.na.fill({"a": 3.14}).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |NULL | |3.14 |NULL | --------------- >>> # fill null and NaN values in column "a" and "b" >>> df.na.fill({"a": 3.14, "b": 15}).show() -------------- |"A" |"B" | -------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |15 | |3.14 |15 | --------------
Note
If the type of a given value in
valuedoesn’t match the column data type (e.g. afloatforStringTypecolumn), this replacement will be skipped in this column. Especially,intcan be filled in a column withFloatTypeorDoubleType, butfloatcannot filled in a column withIntegerTypeorLongType.
See also
DataFrame.fillna()
- replace(to_replace: LiteralType | Iterable[LiteralType] | Dict[LiteralType, LiteralType], value: Iterable[LiteralType] | None = None, subset: Iterable[str] | None = None) DataFrame#
Returns a new DataFrame that replaces values in the specified columns.
- Parameters:
to_replace – A scalar value, or a list of values or a
dictthat associates the original values with the replacement values. Ifto_replaceis adict,valueandsubsetare ignored. To replace a null value, useNoneinto_replace. To replace a NaN value, usefloat("nan")into_replace. Ifto_replaceis empty, the method returns the original DataFrame.value – A scalar value, or a list of values for the replacement. If
valueis a list,valueshould be of the same length asto_replace. Ifvalueis a scalar andto_replaceis a list, thenvalueis used as a replacement for each item into_replace.subset – A list of the names of columns in which the values should be replaced. If
colsis not provided orNone, the replacement will be applied to all columns. Ifcolsis empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1, 1.0, "1.0"], [2, 2.0, "2.0"]], schema=["a", "b", "c"]) >>> # replace 1 with 3 in all columns >>> df.na.replace(1, 3).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |2 |2.0 |2.0 | ------------------- >>> # replace 1 with 3 and 2 with 4 in all columns >>> df.na.replace([1, 2], [3, 4]).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |4 |4.0 |2.0 | ------------------- >>> # replace 1 with 3 and 2 with 3 in all columns >>> df.na.replace([1, 2], 3).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |3 |3.0 |2.0 | ------------------- >>> # the following line intends to replaces 1 with 3 and 2 with 4 in all columns >>> # and will give [Row(3, 3.0, "1.0"), Row(4, 4.0, "2.0")] >>> df.na.replace({1: 3, 2: 4}).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |4 |4.0 |2.0 | ------------------- >>> # the following line intends to replace 1 with "3" in column "a", >>> # but will be ignored since "3" (str) doesn't match the original data type >>> df.na.replace({1: "3"}, ["a"]).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |1.0 |1.0 | |2 |2.0 |2.0 | -------------------
Note
If the type of a given value in
to_replaceorvaluedoesn’t match the column data type (e.g. afloatforStringTypecolumn), this replacement will be skipped in this column. Especially,intcan replace or be replaced in a column withFloatTypeorDoubleType, butfloatcannot replace or be replaced in a column withIntegerTypeorLongType.Nonecan replace or be replaced in a column with any data type.
See also
DataFrame.replace()