Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor format_array in pandas\io\formats\format.py #26837

Open
simonjayhawkins opened this issue Jun 13, 2019 · 22 comments
Open

Refactor format_array in pandas\io\formats\format.py #26837

simonjayhawkins opened this issue Jun 13, 2019 · 22 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code

Comments

@simonjayhawkins
Copy link
Member

see #26833 (comment)

to make format_array more generic we need to push more functionality to the objects to be formatted themselves (and PandasArray) rather than encoding the conversions and special casing in format_array.

@simonjayhawkins simonjayhawkins added Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Jun 13, 2019
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jun 13, 2019
@simonjayhawkins
Copy link
Member Author

as per #26833 (comment)

IIUC correctly, before EAs, format_array formatted an numpy array. and GenericArrayFormatter expects an numpy array, used mainly for numpy arrays with dtype=object

but we now have a ExtensionArrayFormatter that includes logic to convert to a numpy array

        values = self.values
        if isinstance(values, (ABCIndexClass, ABCSeries)):
            values = values._values

        formatter = values._formatter(boxed=True)

        if is_categorical_dtype(values.dtype):
            # Categorical is special for now, so that we can preserve tzinfo
            array = values.get_values()
        else:
            array = np.asarray(values)

        fmt_values = format_array(array,

whereas before, _formatting_values was used and conversion to a numpy array looked like

    def _format_col(self, i):
        frame = self.tr_frame
        formatter = self._get_formatter(i)
        values_to_format = frame.iloc[:, i]._formatting_values()
        return format_array(values_to_format, formatter,
                            float_format=self.float_format, na_rep=self.na_rep,
                            space=self.col_space, decimal=self.decimal)

even though _formatting_values is depecated:

    def _formatting_values(self) -> np.ndarray:
        # At the moment, this has to be an array since we use result.dtype
        """
        An array of values to be printed in, e.g. the Series repr

        .. deprecated:: 0.24.0

           Use :meth:`ExtensionArray._formatter` instead.
        """

it is still used. but does not necessarily return an numpy array

>>> import pandas as pd
>>> df = pd.DataFrame({"c": [2,3,float('nan')]})
>>> df = df.astype("Int64")
>>> df.c._formatting_values()
<IntegerArray>
[2, 3, NaN]
Length: 3, dtype: Int64

so when creating a repr_html of a DataFrame which uses _format_col, we pass a <IntegerArray> to format_array, which then uses ExtensionArrayFormatter to get the values as an numpy array and then call
format_array again which then formats the numpy array using GenericArrayFormatter.

It seems a bit convoluted.

i'm not sure why a numpy array is not returned from _formatting_values.

I think the issues are related since the discussions revolve around numpy arrays from extension arrays and formatting probably works best if the formatting is done on a numpy array with am object dtype. performance is probably not an issue for formatting.

@jorisvandenbossche
Copy link
Member

whereas before, _formatting_values was used and conversion to a numpy array looked like
even though _formatting_values is depecated:
it is still used. but does not necessarily return an numpy array

I think you are confusing two different _formatting_values: the one defined on Series, and the one defined on an EA.
It is the one on EA that is deprecated, in favor of the EA._formatter.

And it is the Series._formatting_values that can potentially return an EA, as that is what ExtensionBlock.formatting_values decides to return (this is where the EA._formatting_values deprecation is handled)

so when creating a repr_html of a DataFrame which uses _format_col, we pass a to format_array, which then uses ExtensionArrayFormatter to get the values as an numpy array and then call
format_array again which then formats the numpy array using GenericArrayFormatter.

It's a bit of indirection, but that is how the EAs are currently displayed: the values itself are converted to a numpy array (the __array__ that the EA defines) and the display is determined by the EA._formatter. And then it is using our format_array machinery to format those values with the custom formatter.

@jorisvandenbossche
Copy link
Member

I think the general discussion items are:

  • Who should be responsible of the actual formatting?

    • Currently, the EA has partial responsibility: it gives you an array to format and a custom formatter, but is not responsible for the full formatting pipeline, it is still format_array that combines both elements (eg it does not return an array of strings to be composed in a repr). We could also decide to give it full control over the pipeline.
      For the rest, the formatting is centralized in format.py.
    • Do we want our internal EAs to follow the EA interface for formatting?
      Currently, that is not yet possible I think. Eg the datetime formatting functionality takes the resolution of the full array into account. So it is not possible to format this scalar by scalar (such as the current EA interface does)
  • How do we handle all the formatting options, and how do we ensure the different objects (eg EAs) are following those options?

    • What are all the options we are talking about here?
    • Do we want to pass some of those to EA._formatter so that EA backed Series/DataFrame follows our general options?

@simonjayhawkins
Copy link
Member Author

I think you are confusing two different _formatting_values: the one defined on Series, and the one defined on an EA.

i guess. is #24858 related?

@simonjayhawkins
Copy link
Member Author

It's a bit of indirection, but that is how the EAs are currently displayed: the values itself are converted to a numpy array (the __array__ that the EA defines) and the display is determined by the EA._formatter. And then it is using our format_array machinery to format those values with the custom formatter.

but also pass through ExtensionArrayFormatter. Is this generic enough for EA developers?

@jorisvandenbossche
Copy link
Member

Is this generic enough for EA developers?

What do you mean exactly?

I think you are confusing two different _formatting_values: the one defined on Series, and the one defined on an EA.

i guess. is #24858 related?

Well, that was an issue prompted by the deprecated _formatting_values, but it is not really related to this discussion I think. The issue there is that the new way (to replace the deprecated one) requires np.array(EA) to work (which is probably generally not a problem).

@simonjayhawkins
Copy link
Member Author

  • Do we want our internal EAs to follow the EA interface for formatting?

that would presumably be less performant. on the other hand, shouldn't EA developers have a mechanism to be able to achieve the same performance?

@jorisvandenbossche
Copy link
Member

Yes, and additionally, EAs might also want to use the full array context to format their values.

A possible conclusion from this discussion might be that it was maybe the wrong direction we took when deprecating _formatting_values (returning a to be printed array) for _formatter (a formatter for scalar objects).
We might want to go back to an array-based formatting, so the end-goal can be that all formatting lives on the Array itself (of course the helper functions for that does not need to live there, only the responsible caller).

@TomAugspurger do you remember what the motivation was to move from the formatting_values to the scalar formatter?

@simonjayhawkins
Copy link
Member Author

Is this generic enough for EA developers?

What do you mean exactly?

ExtensionArrayFormatter has special casing for internal EAs. That's why this issue was opened. The title of this issue maybe misleading now since the PR where the contents of ExtensionArrayFormatter was moved to format_array has been reverted.

A possible conclusion from this discussion might be that it was maybe the wrong direction we took when deprecating _formatting_values (returning a to be printed array) for _formatter (a formatter for scalar objects).

That maybe a separate issue. but could solve the special casing?

@jorisvandenbossche
Copy link
Member

That maybe a separate issue

Not sure, as I think any refactor of format_array should at least involve thinking about how to handle EAs / how this integrates with our own formatting. From your initial post: " we need to push more functionality to the objects to be formatted themselves "
What is the issue otherwise about ? ;)

@simonjayhawkins
Copy link
Member Author

What is the issue otherwise about ? ;)

the issue is raised directly from #26833 (comment), where the PR was a precursor to allow a custom formatter to override the defaults #26000

The EA formatters are 'incomplete' as custom formatters as they don't account for other formatting options. I have now updated #26000, to allow EA formatters and custom formatters specified in to_string, to_latex and to_html to share the same format_array machinery.

@TomAugspurger
Copy link
Contributor

@TomAugspurger do you remember what the motivation was to move from the formatting_values to the scalar formatter?

Not especially, though if I had to guess it was so that ExtensionBlock.formatting_values would return an ExtensionArray, which would then be handled by ExtensionArrayFormatter.

@ghost
Copy link

ghost commented Jun 19, 2019

As requested, moving this comment here from #26833, discussing EA formatters and particularly the deprecation of _formatting_values

thinking about whether _formatting_values should have been deprecated.

  1. _formatting_values was deprecated in 1573340c89b, but is still called by SeriesFormatter (which calls it on the ExtensionArray) to produce the repr for the series, while the new _formatter machinery is ignored.
    def _get_formatted_values(self):
    values_to_format = self.tr_series._formatting_values()
    return format_array(values_to_format, None,
    float_format=self.float_format, na_rep=self.na_rep)

    I needed control over the formatting of the EA inside the series repr, so I had to implement this deprecated method.
  2. It would be nice to be able to control EA repr seperately from EA values formatted into Series repr. _formatter sort of anticipated this with boxed, but that is value by value.
  3. My interest in EA is currently about using it with pint-pandas the units package. Right now, the formatting of an EA repr, its values when shown within its surround Series' repr, and the formatting of individual values in the dataframe display cannot be controlled individually. Currently It looks like this
      weight
0  10.0 gram
1   2.1 gram
2   3.8 gram

Its easy enough to eliminate the units from the display, but that information is kind of helpful, it's just repeated unnecessarily.

as a suggestion, I'd like to see some control over the column headers handed to the EA,
to we could generate something like this:

    weight 
    [gram]
0  10.0 
1   2.1 
2   3.8 

or the like, which is more readable and compact. With some config option or other, we could include some df.info() stuff in the repr.

@jorisvandenbossche
Copy link
Member

formatting_values was deprecated in 1573340, but is still called by SeriesFormatter (which calls it on the ExtensionArray) to produce the repr for the series, while the new _formatter machinery is ignored.

Again, as also explained to @simonjayhawkins above, this is mixing up to different _formatting_values methods. The snippet you show calls it on the Series, not the EA (I fully acknowledge this is confusing!)

I needed control over the formatting of the EA inside the series repr, so I had to implement this deprecated method.

So as far as I know, also the Series repr fully follows the EA._formatter if defined (that is what the boxed argument is for). So you should have control over it without needing to implement the deprecated method. If not, that's a bug (but I am quite sure it works, I did it last weekend in a demo). The only tricky part to know is that it works in two parts: the EA is converted to a numpy array, and it is on the scalars of this array that the EA._formatter function is called. So both need to match each other (which is certainly a limitation, and one I am only realizing from the discussion above).

Its easy enough to eliminate the units from the display, but that information is kind of helpful, it's just repeated unnecessarily.

as a suggestion, I'd like to see some control over the column headers handed to the EA,
to we could generate something like this:

That are certainly interesting ideas, but a whole bigger discussion I think, outside the scope of this issue.
Long term that might be a nice idea to discuss if we want to allow tweaking other parts of the repr such as a custom header, but let's first focus on actually formatting the values inside the table. There are still enough questions for that part to figure out first :)

@ghost
Copy link

ghost commented Jun 19, 2019

 So as far as I know, also the Series repr fully follows the EA._formatter 

Now I see what happened. When I was playing with this a few days ago, I was using a commit between
#26833 and #26845, where ExtensionArrayFormatter was removed. it was gone for less than a day but I hit. Without it, _formatter isn't called, and the only way to control the output was to implement _formatting_values, which is still called, if it's implemented. And when it is called, it is from the line I pointed out Series._formatting_values->mgr->block->EA._formatting_values.

@ghost
Copy link

ghost commented Jun 19, 2019

That are certainly interesting ideas

So In other words, please go away :)

I've got a small patch to the frame repr code (with a config option) which adds a row of dtypes to the column header. It's pretty small. I implemented using a small concept which might be relevent here.

In Qt there's a concept called a "Role". When you query some object for its value, you pass it a "role" argument (a constant from a list) indicating in what context the value is being asked. So you have DisplayRole or BackgroundColorRole. Based on that role, the object knows the context in which the value is being asked, while rendering itself.

You've already started something like that with boxed, but rather than a boolean (indicating only 2 possibles roles), you could use a role keyword, with some constants, or strings that follow some convention.

For example, In the patch I just mentioned, which includes the dtype in the frame repr's column header , I modified the repr code to call a (new) to_string(role=None) method on the ExtensionDtype.

if isinstance(col.dtype, ExtensionDtype):
    try:
        dtype_name = col.dtype.to_string(role="frame.repr.column_header")
    except:
        dtype_name=str(col.dtype)

With this approach, you can have very granular control over rendering.

@jorisvandenbossche
Copy link
Member

When I was playing with this a few days ago, I was using a commit between
#26833 and #26845, where ExtensionArrayFormatter was removed. it was gone for less than a day but I hit. Without it, _formatter isn't called

Ah, that was unfortunate :-) But that is also the reason it was reverted.

So In other words, please go away :)

I've got a small patch to the frame repr code (with a config option) which adds a row of dtypes to the column header. It's pretty small. I implemented using a small concept which might be relevent here.

OK, but that's something more limited than providing the ability to also customize that from the EA. Adding dtype information is a general thing, so if we want to look at this, we should start looking at the option to show the dtype in the dataframe repr (independent of EA or not). Personally, If find that something worth looking into. Eg also R tibbles show their dtypes. If you want to do that, let's open a new issue for that.

@ghost
Copy link

ghost commented Jun 19, 2019

OK, but that's something more limited than providing the ability to also customize that from the EA.

yes it is. But I mentioned it mostly to introduce the possibility of adding a "role" argument to _formatter,
instead of the less vesratile boxed.

@jorisvandenbossche
Copy link
Member

yes it is. But I mentioned it mostly to introduce the possibility of adding a "role" argument to _formatter,
instead of the less vesratile boxed.

But the _formatter we currently have is for formatting scalar values, here it would be about formatting the dtype?

@jorisvandenbossche
Copy link
Member

But in general, I think the role suggestion is certainly more robust than the current box True/False, in case we would rework this formatting machinery again.

@ghost
Copy link

ghost commented Jun 20, 2019

But the _formatter we currently have is for formatting scalar values, here it would be about formatting
the dtype?

I used the same concept to add "to_string(role)" method to dtype, for my own use, it's not a PR.
Whether roles are useful abstraction doesn't depend really on what you're representation, but on the number of different contexts in which you'd potentially want to generate a different representation.
You could have a value which can format itself differently in a series repr vs. a framr repr, and whether those are rendering into an html or latex or plain text display context, for example.

In most cases you'd return the same thing, but you would have the control to have each one be different.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 12, 2021

I ran into this today. Im trying to write an ExtensionArray for a somewhat complicated object (each item is an xarary.DataArray that's a satellite image). xarray loads data lazily, but the call to np.asarray at

array = np.asarray(values)
recursively goes through the Series -> ExtensionArray -> DataArray to trigger loading data. It's not great having just printing something trigger I/O :)

So for this use case, I think we want the ExtensionArray to handle as much as possible of what format_array handles. So something like a

class ExtensionArray:
    def _format_array(self, formatter, float_format, na_rep, ...) -> List[str]?

That would take some / all the kwargs from format_array. I think the default implementation would be roughly what's on ExtensionArrayFormatter._format_strings today (which does the cast and uses pandas' machinery).

Finally, we would update io.formats.format.format_array to call ExtensionArray._format_array(...) for EAs.

@jreback jreback modified the milestones: Contributions Welcome, 1.4 Nov 20, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Jan 8, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code
Projects
None yet
5 participants