Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43239][PS] Remove null_counts from info() #40913

Closed

Conversation

bjornjorgensen
Copy link
Contributor

What changes were proposed in this pull request?

Remove null_counts from info()

Why are the changes needed?

Pandas 2.0
Removed deprecated null_counts argument in DataFrame.info(). Use show_counts instead (GH37999)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested local

Before this PR

F05.info()

TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 F05.info()

File /opt/spark/python/pyspark/pandas/frame.py:12167, in DataFrame.info(self, verbose, buf, max_cols, null_counts)
  12163     count_func = self.count
  12164     self.count = (  # type: ignore[assignment]
  12165         lambda: count_func()._to_pandas()  # type: ignore[assignment, misc, union-attr]
  12166     )
> 12167     return pd.DataFrame.info(
  12168         self,  # type: ignore[arg-type]
  12169         verbose=verbose,
  12170         buf=buf,
  12171         max_cols=max_cols,
  12172         memory_usage=False,
  12173         null_counts=null_counts,
  12174     )
  12175 finally:
  12176     del self._data

TypeError: DataFrame.info() got an unexpected keyword argument 'null_counts'

With this PR

F05.info()

<class 'pyspark.pandas.frame.DataFrame'>
Int64Index: 5257 entries, 0 to 5256
Data columns (total 203 columns):
 #    Column                                                               Non-Null Count  Dtype  
---   ------                                                               --------------  -----  
 0    DOFFIN_APPENDIX:EXPRESSION_OF_INTEREST_URL                           471 non-null    object
(...)

@bjornjorgensen
Copy link
Contributor Author

bjornjorgensen commented Apr 23, 2023

To me it seams like we can just add show_counts to this function. We already have this max row to calculate on.

Or we can implement something like this..

from collections import Counter
from pyspark.sql.functions import col, count, when

def spark_info(df):
    # Print basic DataFrame information
    print(f"<class '{df.__class__.__module__}.{df.__class__.__name__}'>")
    print(f"Number of rows: {df.count()}")
    print(f"Number of columns: {len(df.columns)}")

    # Print column header for the detailed DataFrame information
    print("\nColumn" + " " * 110 + "Non-Null Count" + " " + "Dtype")
    print("-" * 6, " " * 108, "-" * 14, "-" * 5)

    # Calculate non-null counts for each column
    non_null_counts = df.agg(*[count(when(col(f"`{c}`").isNotNull(), f"`{c}`")).alias(c) for c in df.columns]).collect()[0]

    # Initialize a counter to store data type counts
    dtype_counter = Counter()

    # Iterate through the schema fields and print detailed column information
    for i, field in enumerate(df.schema.fields):
        non_null_count = non_null_counts[field.name]
        dtype = field.dataType.simpleString()
        print(f"{field.name:<90} {non_null_count:>30} non-null {dtype}")

        # Update the data type counter
        dtype_counter[dtype] += 1

    # Print data type summary
    dtypes_summary = ", ".join([f"{dtype}({count})" for dtype, count in dtype_counter.items()])
    print(f"\ndtypes: {dtypes_summary}")

image
(...)

image

@bjornjorgensen
Copy link
Contributor Author

add Counter to imports

from collections import defaultdict, namedtuple, Counter

def info(
        self,
        verbose: Optional[bool] = None,
        buf: Optional[IO[str]] = None,
        max_cols: Optional[int] = None,
    ) -> None:       
        # To avoid pandas' existing config affects pandas-on-Spark.
        # TODO: should we have corresponding pandas-on-Spark configs?
        #with pd.option_context(
        #    "display.max_info_columns", sys.maxsize, "display.max_info_rows", sys.maxsize
        #):
        if verbose is None or verbose:
            index_type: Type = type(self.index).__name__
            print(f"<class '{self.__class__.__module__}.{self.__class__.__name__}'>")
            print(f"{index_type}: {len(self)} entries, {self.index.min()} to {self.index.max()}")

            # Print column header for the detailed DataFrame information
            print(f"Data columns (total {len(self.columns)} columns):")
            print(f" #   Column{' ' * 106}Non-Null Count  Dtype")
            print(f"---  ------{' ' * 106}--------------  -----")

        # Calculate non-null counts for each column
        non_null_counts: Dict[str, int] = self.count().to_dict()

        # Initialize a counter to store data type counts
        dtype_counter: Counter = Counter()

        # Iterate through the schema fields and print detailed column information
        for idx, column in enumerate(self.columns):
            dtype: str = str(self[column].dtype)
            non_null_count: int = non_null_counts[column]
            if verbose is None or verbose:
                print(f"{idx:<3} {column:<90} {non_null_count:>30} non-null {dtype}")

            # Update the data type counter
            dtype_counter[dtype] += 1

        if verbose is None or verbose:
            # Print data type summary
            dtypes_summary: str = ", ".join([f"{dtype}({count})" for dtype, count in dtype_counter.items()])
            print(f"\ndtypes: {dtypes_summary}")
        elif not verbose:
            print(f"<class '{self.__class__.__module__}.{self.__class__.__name__}'>")
            print(f"Index: {len(self)} entries, {self.index.min()} to {self.index.max()}")
            print(f"Columns: {len(self.columns)} entries, {self.columns[0]} to {self.columns[-1]}")
            dtypes_summary: str = ", ".join([f"{dtype}({count})" for dtype, count in dtype_counter.items()])
            print(f"dtypes: {dtypes_summary}")

@HyukjinKwon
Copy link
Member

Merged to master.

@bjornjorgensen bjornjorgensen deleted the remove-null_counts branch June 2, 2023 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants