Dataframe NA functions#

Dataframe NA methods#

class pystarburst.dataframe_na_functions.DataFrameNaFunctions(df: DataFrame)#

Bases: object

Provides functions for handling missing values in a DataFrame.

drop(how: str = 'any', thresh: int | None = None, subset: Iterable[str] | None = None) DataFrame#

Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.

Parameters:
  • how – An str with value either ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null. The default value is ‘any’. If thresh is provided, how will be ignored.

  • thresh

    The minimum number of non-null and non-NaN values that should be in the specified columns in order for the row to be included. It overwrites how. In each case:

    • If thresh is not provided or None, the length of subset will be used when how is ‘any’ and 1 will be used when how is ‘all’.

    • If thresh is greater than the number of the specified columns, the method returns an empty DataFrame.

    • If thresh is less than 1, the method returns the original DataFrame.

  • subset

    A list of the names of columns to check for null and NaN values. In each case:

    • If subset is not provided or None, all columns will be included.

    • If subset is empty, the method returns the original DataFrame.

Examples:

>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b")
>>> # drop a row if it contains any nulls, with checking all columns
>>> df.na.drop().show()
-------------
|"A"  |"B"  |
-------------
|1.0  |1    |
-------------

>>> # drop a row only if all its values are null, with checking all columns
>>> df.na.drop(how='all').show()
---------------
|"A"   |"B"   |
---------------
|1.0   |1     |
|nan   |2     |
|NULL  |3     |
|4.0   |NULL  |
---------------

>>> # drop a row if it contains at least one non-null and non-NaN values, with checking all columns
>>> df.na.drop(thresh=1).show()
---------------
|"A"   |"B"   |
---------------
|1.0   |1     |
|nan   |2     |
|NULL  |3     |
|4.0   |NULL  |
---------------

>>> # drop a row if it contains any nulls, with checking column "a"
>>> df.na.drop(subset=["a"]).show()
--------------
|"A"  |"B"   |
--------------
|1.0  |1     |
|4.0  |NULL  |
--------------

See also

DataFrame.dropna()

fill(value: LiteralType | Dict[str, LiteralType], subset: Iterable[str] | None = None) DataFrame#

Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.

Parameters:
  • value – A scalar value or a dict that associates the names of columns with the values that should be used to replace null and NaN values in those columns. If value is a dict, subset is ignored. If value is an empty dict, the method returns the original DataFrame.

  • subset

    A list of the names of columns to check for null and NaN values. In each case:

    • If subset is not provided or None, all columns will be included.

    • If subset is empty, the method returns the original DataFrame.

Examples:

>>> df = session.create_dataframe([[1.0, 1], [float('nan'), 2], [None, 3], [4.0, None], [float('nan'), None]]).to_df("a", "b")
>>> # fill null and NaN values in all columns
>>> df.na.fill(3.14).show()
---------------
|"A"   |"B"   |
---------------
|1.0   |1     |
|3.14  |2     |
|3.14  |3     |
|4.0   |NULL  |
|3.14  |NULL  |
---------------

>>> # fill null and NaN values in column "a"
>>> df.na.fill({"a": 3.14}).show()
---------------
|"A"   |"B"   |
---------------
|1.0   |1     |
|3.14  |2     |
|3.14  |3     |
|4.0   |NULL  |
|3.14  |NULL  |
---------------

>>> # fill null and NaN values in column "a" and "b"
>>> df.na.fill({"a": 3.14, "b": 15}).show()
--------------
|"A"   |"B"  |
--------------
|1.0   |1    |
|3.14  |2    |
|3.14  |3    |
|4.0   |15   |
|3.14  |15   |
--------------

Note

If the type of a given value in value doesn’t match the column data type (e.g. a float for StringType column), this replacement will be skipped in this column. Especially,

  • int can be filled in a column with FloatType or DoubleType, but float cannot filled in a column with IntegerType or LongType.

See also

DataFrame.fillna()

replace(to_replace: LiteralType | Iterable[LiteralType] | Dict[LiteralType, LiteralType], value: Iterable[LiteralType] | None = None, subset: Iterable[str] | None = None) DataFrame#

Returns a new DataFrame that replaces values in the specified columns.

Parameters:
  • to_replace – A scalar value, or a list of values or a dict that associates the original values with the replacement values. If to_replace is a dict, value and subset are ignored. To replace a null value, use None in to_replace. To replace a NaN value, use float("nan") in to_replace. If to_replace is empty, the method returns the original DataFrame.

  • value – A scalar value, or a list of values for the replacement. If value is a list, value should be of the same length as to_replace. If value is a scalar and to_replace is a list, then value is used as a replacement for each item in to_replace.

  • subset – A list of the names of columns in which the values should be replaced. If cols is not provided or None, the replacement will be applied to all columns. If cols is empty, the method returns the original DataFrame.

Examples:

>>> df = session.create_dataframe([[1, 1.0, "1.0"], [2, 2.0, "2.0"]], schema=["a", "b", "c"])
>>> # replace 1 with 3 in all columns
>>> df.na.replace(1, 3).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|3    |3.0  |1.0  |
|2    |2.0  |2.0  |
-------------------

>>> # replace 1 with 3 and 2 with 4 in all columns
>>> df.na.replace([1, 2], [3, 4]).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|3    |3.0  |1.0  |
|4    |4.0  |2.0  |
-------------------

>>> # replace 1 with 3 and 2 with 3 in all columns
>>> df.na.replace([1, 2], 3).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|3    |3.0  |1.0  |
|3    |3.0  |2.0  |
-------------------

>>> # the following line intends to replaces 1 with 3 and 2 with 4 in all columns
>>> # and will give [Row(3, 3.0, "1.0"), Row(4, 4.0, "2.0")]
>>> df.na.replace({1: 3, 2: 4}).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|3    |3.0  |1.0  |
|4    |4.0  |2.0  |
-------------------

>>> # the following line intends to replace 1 with "3" in column "a",
>>> # but will be ignored since "3" (str) doesn't match the original data type
>>> df.na.replace({1: "3"}, ["a"]).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1.0  |1.0  |
|2    |2.0  |2.0  |
-------------------

Note

If the type of a given value in to_replace or value doesn’t match the column data type (e.g. a float for StringType column), this replacement will be skipped in this column. Especially,

  • int can replace or be replaced in a column with FloatType or DoubleType, but float cannot replace or be replaced in a column with IntegerType or LongType.

  • None can replace or be replaced in a column with any data type.

See also

DataFrame.replace()