Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add to_date and to_time to StringValue #8908

Closed
2 of 3 tasks
saschahofmann opened this issue Apr 6, 2024 · 15 comments · Fixed by #10278
Closed
2 of 3 tasks

feat: Add to_date and to_time to StringValue #8908

saschahofmann opened this issue Apr 6, 2024 · 15 comments · Fixed by #10278
Assignees
Labels
feature Features or general enhancements
Milestone

Comments

@saschahofmann
Copy link
Contributor

saschahofmann commented Apr 6, 2024

Is your feature request related to a problem?

No response

Describe the solution you'd like

I think it would make sense to have date and time parsing functions similar to the already existing to_timestamp function.

A workaround right now could be to use to_timestamp and extract the time or date part but e.g. BigQuery supports extra parsing functions PARSE_TIME and PARSE_DATE

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

BigQuery, Postgres

Code of Conduct

@deepyaman
Copy link
Contributor

@saschahofmann Your request makes sense! I'll look to implement this, with to_date and to_time falling back to to_timestamp followed by extracting the appropriate component for backends that don't support it directly.

@saschahofmann
Copy link
Contributor Author

Cool. Also to happy to look into it but I would need some pointers on where these things are implemented.

@deepyaman
Copy link
Contributor

Cool. Also to happy to look into it but I would need some pointers on where these things are implemented.

Amazing!

So, to_timestamp is defined here:

def to_timestamp(self, format_str: str) -> ir.TimestampValue:
"""Parse a string and return a timestamp.
Parameters
----------
format_str
Format string in `strptime` format
Returns
-------
TimestampValue
Parsed timestamp value
Examples
--------
>>> import ibis
>>> ibis.options.interactive = True
>>> t = ibis.memtable({"ts": ["20170206"]})
>>> t.ts.to_timestamp("%Y%m%d")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ StringToTimestamp(ts, '%Y%m%d') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestamp('UTC') │
├─────────────────────────────────┤
│ 2017-02-06 00:00:00+00:00 │
└─────────────────────────────────┘
"""
return ops.StringToTimestamp(self, format_str).to_expr()

You would need to add functions for to_date and to_time to the API. I'd highly recommend just starting with one; probably to_date will run into less issues (e.g. with timezones).

You'll also need to add a corresponding operator:

class StringToTimestamp(Value):

Finally, you should add tests, like https://github.com/ibis-project/ibis/blob/main/ibis/backends/tests/test_temporal.py#L1431

Once you've done this, you implement translations. For example:

Since you're defining a new operator, it would require implementation across a lot of places in the codebase; let me know if this feels like something you want to tackle, first time diving into the codebase. Regardless, start small! You can fail the test for everything except BigQuery to start with, if you'd like. Also, happy to structure this in a way to set up some of the bases and you can focus on adding implementations for the backends you prefer; just let me know.

@deepyaman
Copy link
Contributor

Oh, also; I noticed that .to_timestamp().time() will not work across the board, because DuckDB doesn't support casting timestamp with timezone to time. So may need to experiment and find a nice approach. Part of why I think it's better to start with to_date. :)

@saschahofmann
Copy link
Contributor Author

I guess in that case, it is acceptable to first drop the timezone?

@deepyaman
Copy link
Contributor

I guess in that case, it is acceptable to first drop the timezone?

In general, I think we defer to the behavior of individual backends to some extent on these things. In the case of DuckDB, my understanding is they don't believe this to be a safe operation, and I'm somewhat inclined to agree. I'll also defer to @cpcloud, given he has a much clearer view on how Ibis usually treats these kinds of things.

@saschahofmann
Copy link
Contributor Author

Alright. I finally started on adding to_date and already got a question on the SQLGlot implementation

I assume str_to_time is a SQLGlot function? Maybe this https://sqlglot.com/sqlglot/dialects/dialect.html#str_to_time_sql?

I don't see an equivalent for str_to_date. As we discussed the fallback could be to do to_timestamp().date(). How/where would I put that?

@saschahofmann
Copy link
Contributor Author

I also took a look at bigquery/compiler.py but I have no idea whether something like this would just work.

def visit_StringToDate(self, op, *, arg, format_str):
    return self.f.parse_date(format_str, arg)

or whether I can just add parse_date to the SIMPLE_OPS?

@gforsyth
Copy link
Member

@saschahofmann -- if arg and format_str are the args defined on the StringToDate op, then yes, your first example will work.

and in this case, since all the kwargs defined on the op are used in parse_date, you can just add it to SIMPLE_OPS

@saschahofmann
Copy link
Contributor Author

But do I need to both? Or is one of them enough?

@gforsyth
Copy link
Member

Just the one is sufficient

gforsyth added a commit that referenced this issue Apr 24, 2024
## Description of changes

* Adds `to_date` to string types that accepts a format string and parses
a string to Date type.

* Uses a cast to timestamp and extract time as a fallback like this
```python
def visit_StringToDate(self, op, *, arg, format_str):
        return self.f.date(self.f.str_to_time(arg, format_str))
```
* Implements native functions for bigquery, clickhouse, MySQL, oracle,
postgres, and snowflake

## Issues closed
Implements half of #8908

---------

Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>
@saschahofmann
Copy link
Contributor Author

Alright I'll finally have time to tackle the second part of this and add a to_time function if wanted. @deepyaman mentioned that DuckDB doesn't support this on purpose and that @cpcloud would have an opinion on that. Do you want to support to_time and default back to just to_timestamp().time? Or should backends like DuckDB throw an error?

@jcrist
Copy link
Member

jcrist commented Aug 7, 2024

Side note - if we're adding new conversion API methods, I think we might want to use an as_* prefix (or something else), keeping to_* methods for things that execute like to_pandas. See #9788 for more context.

@deepyaman
Copy link
Contributor

Alright I'll finally have time to tackle the second part of this and add a to_time function if wanted. @deepyaman mentioned that DuckDB doesn't support this on purpose and that @cpcloud would have an opinion on that. Do you want to support to_time and default back to just to_timestamp().time? Or should backends like DuckDB throw an error?

@saschahofmann Very sorry for not getting back to you sooner on this. Where possible, it seems it would make sense to use the time-specific parsing function (e.g. PARSE_TIME on BigQuery, TO_TIME on Snowflake). For DuckDB, you can just cast (e.g. CAST('1992-09-20 11:30:00.123456' AS TIME)).

@saschahofmann
Copy link
Contributor Author

Alright! I will try to get to this soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants