get start and end of regex match in dataframe #8747

teese · 2014-11-06T17:05:22Z

What about including a method to get the start and stop after a regex search of items in a DataFrame . Perhaps using .str.extract?

Returning the start as a new column would perhaps be as follows:

df['start'] = df['string'].str.extract(pattern, output = 'start')

an alternative suggestion from jkitchen on StackOverflow was to use start_index = True, or end_index = True

df['start'] = df['string'].str.extract(pattern, start_index = True)

For multiple parameters (e.g. start and end) as outputs, there needs to be a way to avoid running the search twice. One solution would be to give the output as a tuple:

df['regex_output_tuple'] = df['string'].str.extract(pattern, output = ('start','end'))

I don't use regex very often, so I don't know if there are other parameters that people want after a regex search. If there really is just the text in the groups, the start and the end, perhaps there's a way to put the output directly into new columns?

df['groups'], df['start'], df['end']  = df['string'].str.extract(pattern, output = ('groups','start','end'))

I think it makes sense that non-matches return a NaN, just as in the regular extract function. This would mix integer and float datatypes in the df['start'] column, but I guess we all know about that situation :)

I'm not an experienced programmer, so sorry if I misunderstood some basic concepts.

Please see the question in StackOverflow for example code and comments:
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

A block of example data and code is below, as requested by jreback.

import pandas as pd
import re
#some example query sequences, markup strings, hit sequences.
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})

#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

#Pandas has a nice extract function to slice out the matched sequence from the query:
df['extracted'] = df['query'].str.extract(regex_desired_region_from_query)

#However I need the start and end of the match in order to extract the equivalent regions 
#from the markup and hit columns. For a single string, this is done as follows:
match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
print('sliced_hit, non-vectorized example: ', sliced_hit)

#HERE the new syntax is necessary
#e.g. df['start'], df['end']  = df['string'].str.extract(pattern, output = ('start','end'))

#My current workaround in pandas is as follows.
#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)

The text was updated successfully, but these errors were encountered:

jreback · 2014-11-07T17:15:32Z

can you provide a short but specific example of what exactly is needed/wanted here (make runnable as much as possible and indicate where syntax is needed)

teese · 2014-11-08T15:10:42Z

I've added the code to the question as requested. Sorry it's not short, but contains some real data to help explain why it is necessary to obtain the regex start and end. The code should work in both python 2.7 and 3.4, and the latest pandas release (0.15.0). In my case, I will apply the above workaround to ~5000 dataframes, each containing ~5000 rows, with significantly longer sequences (~500 characters in each string).

jreback · 2014-11-08T15:26:03Z

so it actually sounds like you want a function like extract but returns the matched indices.

eg. df['query'].indices(regex_desired_region_from_query, outtype='list|frame')

subtle issue is whether the match can return just (start,end) or a list of matches (not sure what that would look like)

teese · 2014-11-08T20:06:50Z

Create .indices as another function?
It's an interesting idea, but I'd have to admit I'm already confused with the .match, .extract, .contains functions that already exist.

Beginners learn to apply regex to single strings using the following syntax
(from https://docs.python.org/3.4/library/re.html):

text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
07-16: carefully
40-47: quickly

As a beginner, I am happiest when the syntax in pandas matches the original syntax as closely as possible. The .extract function works great, but after looking at the discussion in #5075, I would probably have voted to keep the name .match, replace the legacy code with the new extract function, and change the output (group, bool, index, or a combination) based on various arguments.

Currently when someone wants to get three things: the groups, the start index and end index. The only way this can be done without repeating the regex search is to get the indices first and then apply some lambda functions to slice out the group. This is a very different process to what people are accustomed to from using the original re module.

So in summary, in order of my preferences:

incorporate extract and proposed get-indices into str.match (to me the simplest for new users, but involves reopening an old discussion and worrying about backwards compatibility)
incorporate get-indices function into str.match, but leave the current default output as 'bool' (as planned)
create a new str.indices function

What're your thoughts concerning the first two options?

Regarding your second comment as to whether the match can return just (start,end) or a list of matches, I still have to sit down and think about that one :)

edumotya · 2021-02-03T12:13:28Z

This is my workaround for named groups:

import re
import pandas as pd

class SpanExtractor:
    def __init__(self, pattern):
        self._pattern = re.compile(pattern)
        self._groups = list(self._pattern.groupindex.keys())

    def __call__(self, x):
        """
        Utility function to extract the start and end indices.
        """
        m = self._pattern.search(x)
        if m:
            span_groups = {g: m.span(g) for g in self._groups}
        else:
            span_groups = {g: (float("nan"), float("nan")) for g in self._groups}
        return pd.Series(span_groups)


def _extract_spans(ds: pd.Series, pattern: str) -> pd.DataFrame:
    span_extractor = SpanExtractor(pattern)
    spans = ds.apply(span_extractor)
    spans = pd.concat(
        [
            pd.DataFrame(
                spans[col].to_list(),
                columns=["start_index_" + col, "end_index_" + col],
                index=spans.index,
            )
            for col in spans
        ],
        axis="columns",
    )
    return spans

spans = _extract_spans(df["text"], pattern)
spans["start_index_{your_named_group_1}"] 
spans["end_index_{your_named_group_1}"] 
spans["start_index_{your_named_group_2}"]

vsocrates · 2023-01-12T00:59:18Z

Hi, I'd love to see a solution to this and it seems like fairly expected functionality, given the way that the re module works.

It looks like an edit to the _str_extract function here may be the start to a fix, but it seems like there would be issues with backwards compatibility or impacts on other functions.

if not expand:

    def g(x):
        m = regex.search(x)
        return m.groups()[0] if m else na_value


    return self._str_map(g, convert=False)

I'd be willing to take a stab at it if someone can provide me with some more direction (unless there's plans to implement this in a future release that I missed)?

vsocrates · 2023-01-23T00:29:02Z

Hi, following up on this! @mroeschke, wondering why the "Contributions Welcome" milestone was taken off and if this is still up for contributions, thanks!

GolAGitHub · 2023-06-26T17:29:14Z

I frequently wish I had access to regex match object methods when using str.extract/str.extractall. Is this still under consideration for a new release?

delucca · 2023-06-27T15:19:47Z

+1 for this

teese mentioned this issue Nov 6, 2014

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

Closed

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data API Design labels Nov 7, 2014

jreback added this to the 0.15.2 milestone Nov 8, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Nov 30, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

mroeschke added the Enhancement label May 3, 2020

mroeschke removed the API Design label Apr 11, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get start and end of regex match in dataframe #8747

get start and end of regex match in dataframe #8747

teese commented Nov 6, 2014

jreback commented Nov 7, 2014

teese commented Nov 8, 2014

jreback commented Nov 8, 2014

teese commented Nov 8, 2014

edumotya commented Feb 3, 2021

vsocrates commented Jan 12, 2023

vsocrates commented Jan 23, 2023

GolAGitHub commented Jun 26, 2023

delucca commented Jun 27, 2023

get start and end of regex match in dataframe #8747

get start and end of regex match in dataframe #8747

Comments

teese commented Nov 6, 2014

jreback commented Nov 7, 2014

teese commented Nov 8, 2014

jreback commented Nov 8, 2014

teese commented Nov 8, 2014

edumotya commented Feb 3, 2021

vsocrates commented Jan 12, 2023

vsocrates commented Jan 23, 2023

GolAGitHub commented Jun 26, 2023

delucca commented Jun 27, 2023