-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get start and end of regex match in dataframe #8747
Comments
can you provide a short but specific example of what exactly is needed/wanted here (make runnable as much as possible and indicate where syntax is needed) |
I've added the code to the question as requested. Sorry it's not short, but contains some real data to help explain why it is necessary to obtain the regex start and end. The code should work in both python 2.7 and 3.4, and the latest pandas release (0.15.0). In my case, I will apply the above workaround to ~5000 dataframes, each containing ~5000 rows, with significantly longer sequences (~500 characters in each string). |
so it actually sounds like you want a function like extract but returns the matched indices. eg. subtle issue is whether the match can return just |
Create Beginners learn to apply regex to single strings using the following syntax
As a beginner, I am happiest when the syntax in pandas matches the original syntax as closely as possible. The Currently when someone wants to get three things: the groups, the start index and end index. The only way this can be done without repeating the regex search is to get the indices first and then apply some lambda functions to slice out the group. This is a very different process to what people are accustomed to from using the original re module. So in summary, in order of my preferences:
What're your thoughts concerning the first two options? Regarding your second comment as to whether the match can return just (start,end) or a list of matches, I still have to sit down and think about that one :) |
This is my workaround for named groups:
|
Hi, I'd love to see a solution to this and it seems like fairly expected functionality, given the way that the It looks like an edit to the if not expand:
def g(x):
m = regex.search(x)
return m.groups()[0] if m else na_value
return self._str_map(g, convert=False) I'd be willing to take a stab at it if someone can provide me with some more direction (unless there's plans to implement this in a future release that I missed)? |
Hi, following up on this! @mroeschke, wondering why the "Contributions Welcome" milestone was taken off and if this is still up for contributions, thanks! |
I frequently wish I had access to regex match object methods when using str.extract/str.extractall. Is this still under consideration for a new release? |
+1 for this |
What about including a method to get the start and stop after a regex search of items in a DataFrame . Perhaps using .str.extract?
Returning the start as a new column would perhaps be as follows:
an alternative suggestion from jkitchen on StackOverflow was to use start_index = True, or end_index = True
For multiple parameters (e.g. start and end) as outputs, there needs to be a way to avoid running the search twice. One solution would be to give the output as a tuple:
I don't use regex very often, so I don't know if there are other parameters that people want after a regex search. If there really is just the text in the groups, the start and the end, perhaps there's a way to put the output directly into new columns?
I think it makes sense that non-matches return a NaN, just as in the regular extract function. This would mix integer and float datatypes in the df['start'] column, but I guess we all know about that situation :)
I'm not an experienced programmer, so sorry if I misunderstood some basic concepts.
Please see the question in StackOverflow for example code and comments:
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram
A block of example data and code is below, as requested by jreback.
The text was updated successfully, but these errors were encountered: