-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pandas __setitem__ support for DocumentTermDF #158
Add pandas __setitem__ support for DocumentTermDF #158
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: unitTest Co-authored-by: Henri Froese <henri.froese@yahoo.com>
@jbesomi we now have also adapted the set_item method to handle the changes in PR #156 when the set item method receives a data frame, which has more than one column it will convert the columns into a pandas multi-index with the original column names on the lower level and the >>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([[3, 5], [8, 4]], columns=["term 1", "term 2"],)
>>> df1["count"] = df2
>>> df1
Test count
term 1 term 2
0 Text 1 3 5
1 Text 2 8 4 when just a single column DataFrame is inserted, it will behave, as it used to: >>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([3, 5], columns=["count"])
>>> df1["here"] = df2
>>> df1
Test here
0 Text 1 3
1 Text 2 5 |
Magic. 🥇 I will review this once #156 is merged so that we can test a bit around with the output DataFrame(s). Does your solution handle correctly the insertion of sparse Pandas DF? For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion. |
There is no reason, that the pandas API does not support it. This issue we opened at pandas about it. Now our usecase is much easier, with a single column level.
It works so far. As soon, as the other two are merged, I will move this PR to ready to review |
Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just went through everything again and added additional tests for sparseness. We believe this is ready to be reviewed/merged now 🐙
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, less complex than expected 🧐
Looks great!
What if we add this for now as an extension that users can test? We can insert this into a separate file texthero.beta.helper
or similar and explain in a blog article how to "activate that" (from texthero.beta import helper ...) and what it allows us to do.
- we don't destroy any pandas functionality as currently calling | ||
`__setitem__` with a Multiindexed value is just not possible, so | ||
our changes to Pandas do not break any Pandas functionality for | ||
the users. We're only _expanding_ the functinoality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functinoality spelling mistake
2 0 0 1 1 1 1 | ||
``` | ||
|
||
That's a DataFrame. Great! Of course, users can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can be a bit more succint here (That's a DataFrame. Great! Of course,)
I also discussed this issue with @henrifroese. we decided that we will not merge it, as discussed in the meeting. In our view when hiding this feature in a beta version, most users will probably ignore this. As this is performance-wise not optimized for huge DataFrames we decided that the tradeoff is not worth it and we will close this PR |
Ok. I'm sorry that we couldn't merge it, as it would have been a great feature for users, yet, it would have been probably worse to introduce something that would have made the user experience less pleasant. The fact that assigning a sparse DataFrame takes so long because of the large number of columns is what makes me think twice before making it part of the master branch. Let's hope that with Pandas v.2 we will be able to do something similar in a more efficient way 🙏🏻 |
As discussed, this PR makes small changes to
pd.DataFrame.__setitem__
to allow users to do e.g.df["tfidf"] = hero.tfidf(df["text"])
. See the extensive documentation we added in_helper.py
. We also add tests for this 🕵️ 🕵️♂️ 🕵️♀️NOTE: only so many commits/lines as this builds on #156