Dataframe NA functions#
Dataframe NA methods#
- class pystarburst.dataframe_na_functions.DataFrameNaFunctions(df: DataFrame)#
Bases:
object
Provides functions for handling missing values in a
DataFrame
.- drop(how: str = 'any', thresh: int | None = None, subset: Iterable[str] | None = None) DataFrame #
Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
- Parameters:
how – An
str
with value either ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. The default value is ‘any’. Ifthresh
is provided,how
will be ignored.thresh –
The minimum number of non-null and non-NaN values that should be in the specified columns in order for the row to be included. It overwrites
how
. In each case:If
thresh
is not provided orNone
, the length ofsubset
will be used whenhow
is ‘any’ and 1 will be used whenhow
is ‘all’.If
thresh
is greater than the number of the specified columns, the method returns an empty DataFrame.If
thresh
is less than 1, the method returns the original DataFrame.
subset –
A list of the names of columns to check for null and NaN values. In each case:
If
subset
is not provided orNone
, all columns will be included.If
subset
is empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b") >>> # drop a row if it contains any nulls, with checking all columns >>> df.na.drop().show() ------------- |"A" |"B" | ------------- |1.0 |1 | ------------- >>> # drop a row only if all its values are null, with checking all columns >>> df.na.drop(how='all').show() --------------- |"A" |"B" | --------------- |1.0 |1 | |nan |2 | |NULL |3 | |4.0 |NULL | --------------- >>> # drop a row if it contains at least one non-null and non-NaN values, with checking all columns >>> df.na.drop(thresh=1).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |nan |2 | |NULL |3 | |4.0 |NULL | --------------- >>> # drop a row if it contains any nulls, with checking column "a" >>> df.na.drop(subset=["a"]).show() -------------- |"A" |"B" | -------------- |1.0 |1 | |4.0 |NULL | --------------
See also
DataFrame.dropna()
- fill(value: LiteralType | Dict[str, LiteralType], subset: Iterable[str] | None = None) DataFrame #
Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
- Parameters:
value – A scalar value or a
dict
that associates the names of columns with the values that should be used to replace null and NaN values in those columns. Ifvalue
is adict
,subset
is ignored. Ifvalue
is an emptydict
, the method returns the original DataFrame.subset –
A list of the names of columns to check for null and NaN values. In each case:
If
subset
is not provided orNone
, all columns will be included.If
subset
is empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b") >>> # fill null and NaN values in all columns >>> df.na.fill(3.14).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |NULL | |3.14 |NULL | --------------- >>> # fill null and NaN values in column "a" >>> df.na.fill({"a": 3.14}).show() --------------- |"A" |"B" | --------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |NULL | |3.14 |NULL | --------------- >>> # fill null and NaN values in column "a" and "b" >>> df.na.fill({"a": 3.14, "b": 15}).show() -------------- |"A" |"B" | -------------- |1.0 |1 | |3.14 |2 | |3.14 |3 | |4.0 |15 | |3.14 |15 | --------------
Note
If the type of a given value in
value
doesn’t match the column data type (e.g. afloat
forStringType
column), this replacement will be skipped in this column. Especially,int
can be filled in a column withFloatType
orDoubleType
, butfloat
cannot filled in a column withIntegerType
orLongType
.
See also
DataFrame.fillna()
- replace(to_replace: LiteralType | Iterable[LiteralType] | Dict[LiteralType, LiteralType], value: Iterable[LiteralType] | None = None, subset: Iterable[str] | None = None) DataFrame #
Returns a new DataFrame that replaces values in the specified columns.
- Parameters:
to_replace – A scalar value, or a list of values or a
dict
that associates the original values with the replacement values. Ifto_replace
is adict
,value
andsubset
are ignored. To replace a null value, useNone
into_replace
. To replace a NaN value, usefloat("nan")
into_replace
. Ifto_replace
is empty, the method returns the original DataFrame.value – A scalar value, or a list of values for the replacement. If
value
is a list,value
should be of the same length asto_replace
. Ifvalue
is a scalar andto_replace
is a list, thenvalue
is used as a replacement for each item into_replace
.subset – A list of the names of columns in which the values should be replaced. If
cols
is not provided orNone
, the replacement will be applied to all columns. Ifcols
is empty, the method returns the original DataFrame.
Examples
>>> df = session.create_dataframe([[1, 1.0, "1.0"], [2, 2.0, "2.0"]], schema=["a", "b", "c"]) >>> # replace 1 with 3 in all columns >>> df.na.replace(1, 3).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |2 |2.0 |2.0 | ------------------- >>> # replace 1 with 3 and 2 with 4 in all columns >>> df.na.replace([1, 2], [3, 4]).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |4 |4.0 |2.0 | ------------------- >>> # replace 1 with 3 and 2 with 3 in all columns >>> df.na.replace([1, 2], 3).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |3 |3.0 |2.0 | ------------------- >>> # the following line intends to replaces 1 with 3 and 2 with 4 in all columns >>> # and will give [Row(3, 3.0, "1.0"), Row(4, 4.0, "2.0")] >>> df.na.replace({1: 3, 2: 4}).show() ------------------- |"A" |"B" |"C" | ------------------- |3 |3.0 |1.0 | |4 |4.0 |2.0 | ------------------- >>> # the following line intends to replace 1 with "3" in column "a", >>> # but will be ignored since "3" (str) doesn't match the original data type >>> df.na.replace({1: "3"}, ["a"]).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |1.0 |1.0 | |2 |2.0 |2.0 | -------------------
Note
If the type of a given value in
to_replace
orvalue
doesn’t match the column data type (e.g. afloat
forStringType
column), this replacement will be skipped in this column. Especially,int
can replace or be replaced in a column withFloatType
orDoubleType
, butfloat
cannot replace or be replaced in a column withIntegerType
orLongType
.None
can replace or be replaced in a column with any data type.
See also
DataFrame.replace()