-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
location fallback system #262
Comments
not urgent or anything, but just thought Id ping you @karlicoss since seems like notifications often dont work on here |
Ah sorry -- in this case saw the notification -- just was a bit busy with other things!
Yeah seems good -- at least to start with!
Hmm, not sure I understood -- what would the user pick here? Seems that it's possible to pick the source automatically, just by using accuracy?
Similarly, not sure I understand -- makes sense to have
Yeah, seems like a good idea! Maybe there is even something existing in more_itertools we could reuse |
all good, this will probably take a while to build anyways
Ah, sorry; didn't totally describe this here. The sources/fallback for a day would all be picked automatically, based on what data is available on the day, which has the highest accuracy etc. The prompting I'm referring to is to confirm a location based on an IP address, as some of the IP addresses I have from 2012-2013 when geolocated now, return incorrect locations, like in random parts of europe/islands I've never been to (I'm assuming that these IPs were bought out by another company, and the location now is different?) Could also probably check if the location is within some distance of a known/home location and then auto-confirm it, so you're not prompting for everything, just for what look like outlier addresses If you confirm one of those as correct addresses, then that would be saved to the The same thing may (or might not, havent ever gone through the latitude/longitutde tags on my images, so will see it is a problem that needs fixing) be a problem for images in my photos directory which I didn't take (so the location should not be used), but happen to be geotagged could also let the user define a custom function in their config like:
so they can remove swaths of IP addresses before/within some date etc, but will see if thats even necessary when I get to that point. I have a good amount of IPs personally so should be able to find most of the annoying edge cases:
|
Ah gotcha! So it's basically some tooling/overlay/filter to 'review' the data, good idea. The json would end up quite huge though, right? Perhaps it could be backed by sqlite, maybe even cachew somehow.
I think at this point it could just be a personal overlay? At least to start with/expreiment cause it's less intrusive than defining a config etc.
I guess the whole thing could be quite generic in principle (at the very list you could drop datapoints). But kinda requires stable ids for each piece of data karlicoss/promnesia#173 |
Yeah, Im not really set on any storage/strategy yet, will experiment when I get there and see what makes sense
Ah true, but would then need to have some way to find 'unlikely/bad' entries, so you can at least warn user? As otherwise things get auto approved, and you might have incorrect data -- but maybe it should be up to the user to flag bad entries. Will keep these all open as options while Im bootstrapping this
Right, I'll probably add some more documentation for that as well, as another option
yeah, that is something Ive had the thought about a few times (same with IPs above), would be more convenient for the user. but also, these could all be solved with an hpi overlay, but thats a lot more involved than just adding something to your could experiment with adding something to def _custom_filter_function(ip: IP) -> bool:
if ...:
return False
return True
class query:
filters: Dict[str, Callable[[T], bool]] = {
"my.ip.all": _custom_filter_function
} I dont really want to wrap every module function in a decorator that checks the also this isnt typesafe but I feel thats sort of the tradeoff you have to make for user convenience perhaps in the future if thats deemed to be so useful that its worth the tradeoff (doubt it, but perhaps), can add a decorator to every module |
Oh, and on the stable IDs, perhaps we can at least move a bit closer to that/define some standard behaviour? Like, each namedtuple/dataclass can define a unique key on the data, sort of like in the google_takeout_parser/models.py. Could be a may not be able to do that for every source, but at least when its useful for promneia/other consumers, theres some sort of standard |
If you don’t mind the interjection, I think I can highlight something that might be getting missed: I suspect that @seanbreckenridge and I have a similar desire for absolute truth even if it means putting in the extra work manually. From that perspective someone can see that in-file metadata can be (and has been) wrong, and that it’s important for an “absolute truth” system to recognize when we’ve marked a correction and to even favour that correction. For example:
Now, we could alter the EXIF tag but that corrupts the ideal of this file being the original copy (yes I’m going a bit far with this, but date is still tricky to modify and other metadata types are not as forgiving) What would be better for archival purposes is a record of the correction, maybe a note about why it was wrong, and the system moving forward with the corrected data. This keeps the original untouched, leaves an audit trail, and still lets us view reports without having to think about it again. And the dream of our system being a source of absolute truth stays intact. |
true, I would personally be fine with putting in the extra work, but I don't want to put that burden on everyone else using the module In my opinion at least, HPI is meant to be able to take messy data as input and deal with it, and return (not throw) errors/warnings when it cant. (which is also sort of stated in This also slightly feels outside of the realm of HPI (at least for fixing tags on images), (which is also why I left the comment at the top): "since this is a bit more state-ful than HPI typically is (this would include a CLI which saves data to disk based on prompting the user)" Could give the user more than one choice:
but it is getting pretty complicated to maintain all of this in this repo at that point, and it doesn't need to be here For images in particular, what might make sense is to create a separate repo/CLI tool which manages your images in the custom way we may want, probably built on top of an existing project like elodie (or something else, havent scoured for related projects yet). That repo would let you import/manage your photos incrementally, and could be used for people who want the absolute truth/to confirm all of their image data, and then we'd just have a file here like
I do think this is a good idea, and will try to implement something like this at some point |
Yeah, I've been reluctant to do it so far unless necessary (and in particular, advertise as a stable uuid). I guess my hope was that it could be mostly agnostic, e.g. derived from the timestamp or something. But that could be tricky too, even timestamps might change for various reasons.
Yeah agreed -- plus it's not always feasible to modify the metadata directly, e.g. you could have something coming from google takeout, so you'd have to patch up files every time you do a takeout (as opposed to a programmatic overlay).
Yep, totally agree!
I guess it depends on how generic we can make it -- if it 's possible to make it work in a somewhat agnostic manner, just on top of namedtuples/dataclasses I don't see why not. But specific UIs/logic to analyse and review data don't have to be in core HPI yeah. |
Hmm -- the more I try to implement any sort of confirmation mechanism its either:
I think instead would be easier to add a filter function just to the config like: def filter_ips(ip: IP) -> bool:
# user defined behaviour here...
class location:
class fallback:
class via_ip:
filter = filter_ips
accuracy: float = 10_000 Sitenote: This Edit: this does go down the path of adding more boilerplate to every module though, but most likely filtering 'bad' data isn't that common of a usecase? I can only think of IP and photos that might use it, but it does increase the size of the And then have some documentation on how to use it - Could even have 'recipes' or something using some shared functions in |
Oh also, I tried implementing the diff --git a/my/location/all.py b/my/location/all.py
index eec4bcc..627fe5d 100644
--- a/my/location/all.py
+++ b/my/location/all.py
@@ -7,7 +7,7 @@ from typing import Iterator
from my.core import Stats, LazyLogger
from my.core.source import import_source
-from .common import Location
+from my.location.common import Location, SortedLocations
logger = LazyLogger(__name__, level="warning")
@@ -15,9 +15,11 @@ logger = LazyLogger(__name__, level="warning")
def locations() -> Iterator[Location]:
# can add/comment out sources here to disable them, or use core.disabled_modules
- yield from _takeout_locations()
- yield from _gpslogger_locations()
- yield from _ip_locations()
+ yield from SortedLocations.map_sort(
+ _takeout_locations(),
+ _gpslogger_locations(),
+ _ip_locations(),
+ )
@import_source(module_name="my.location.google_takeout")
diff --git a/my/location/common.py b/my/location/common.py
index 5b5c33f..b1391fe 100644
--- a/my/location/common.py
+++ b/my/location/common.py
@@ -1,9 +1,14 @@
+from __future__ import annotations
from datetime import date, datetime
-from typing import Union, Tuple, Optional
+from typing import Union, Tuple, Optional, Iterable, TypeVar
from dataclasses import dataclass
+
from my.core import __NOT_HPI_MODULE__
from my.core.compat import Protocol
+from my.core.common import LazyLogger
+
+logger = LazyLogger(__name__, level='warning')
DateIsh = Union[datetime, date, str]
@@ -32,3 +37,44 @@ class Location(LocationProtocol):
accuracy: Optional[float]
elevation: Optional[float]
datasource: Optional[str] = None # which module provided this, useful for debugging
+
+
+L = TypeVar("L", bound=LocationProtocol)
+
+# functionality related to merging sorting iterators of locations
+# using heapq.merge https://docs.python.org/3/library/heapq.html
+
+class SortedLocations(Iterable[L]):
+ def __init__(self, locations: Iterable[L], sort_input: bool = True):
+ self.sort_input = sort_input
+ self._locations: Iterable[L] = locations
+
+ # if this is already a SortedLocations, don't sort again
+ if not isinstance(locations, SortedLocations):
+ if sort_input:
+ self._locations = list(locations)
+ self._locations.sort(key=lambda x: x.dt)
+ self._iter = iter(self._locations)
+
+ def __iter__(self) -> SortedLocations:
+ return self
+
+ def __next__(self) -> L:
+ return next(self._iter)
+
+ @classmethod
+ def map_sort(cls, *locations: Iterable[L]) -> SortedLocations:
+ '''
+ Sorts locations from each iterable, then merge them
+ '''
+ logger.debug(f"Sorting locations from {len(locations)} sources")
+ # if already sorted, should be close to O(1) anyways, so shouldnt matter that much
+ # its safer to sort again, but user can opt out by wrapping their input in
+ # SortedLocations(..., sort_input=False), which will skip the sort in the constructor
+ locs = cls.sort(*(cls(loc) for loc in locations))
+ return cls(locs, sort_input=False)
+
+ @staticmethod
+ def sort(*locations: Iterable[L]) -> Iterable[L]:
+ import heapq
+ return heapq.merge(*locations, key=lambda x: x.dt)
diff --git a/my/location/fallback/via_ip.py b/my/location/fallback/via_ip.py
index 0e8fb05..3bd4775 100644
--- a/my/location/fallback/via_ip.py
+++ b/my/location/fallback/via_ip.py
@@ -4,10 +4,12 @@ Converts IP addresses provided by my.location.ip to estimated locations
REQUIRES = ["git+https://github.com/seanbreckenridge/ipgeocache"]
+from datetime import datetime
+
from my.core import dataclass, Stats
from my.config import location
from my.core.warnings import medium
-from datetime import datetime
+from my.location.common import SortedLocations
@dataclass
@@ -42,7 +44,7 @@ def fallback_locations() -> Iterator[FallbackLocation]:
# for compatibility with my.location.via_ip, this shouldnt be used by other modules
def locations() -> Iterator[Location]:
medium("via_ip.locations is deprecated, use via_ip.fallback_locations instead")
- yield from map(FallbackLocation.to_location, fallback_locations())
+ yield from SortedLocations(iter(map(FallbackLocation.to_location, fallback_locations())))
def estimate_location(dt: datetime) -> Location:
diff --git a/my/location/google_takeout.py b/my/location/google_takeout.py
index 80b31cb..2aed070 100644
--- a/my/location/google_takeout.py
+++ b/my/location/google_takeout.py
@@ -10,7 +10,7 @@ from my.google.takeout.parser import events, _cachew_depends_on
from google_takeout_parser.models import Location as GoogleLocation
from my.core.common import mcachew, LazyLogger, Stats
-from .common import Location
+from my.location.common import Location, SortedLocations
logger = LazyLogger(__name__)
@@ -20,11 +20,17 @@ logger = LazyLogger(__name__)
logger=logger,
)
def locations() -> Iterator[Location]:
- for g in events():
- if isinstance(g, GoogleLocation):
- yield Location(
- lon=g.lng, lat=g.lat, dt=g.dt, accuracy=g.accuracy, elevation=None
- )
+ def _locations() -> Iterator[Location]:
+ for g in events():
+ if isinstance(g, GoogleLocation):
+ yield Location(
+ lon=g.lng,
+ lat=g.lat,
+ dt=g.dt, accuracy=g.accuracy,
+ elevation=None,
+ datasource="google_takeout",
+ )
+ return SortedLocations(_locations())
def stats() -> Stats:
diff --git a/my/location/gpslogger.py b/my/location/gpslogger.py
index 95f4474..79e3628 100644
--- a/my/location/gpslogger.py
+++ b/my/location/gpslogger.py
@@ -17,7 +17,6 @@ class config(location.gpslogger):
accuracy: float = 50.0
-from itertools import chain
from datetime import datetime, timezone
from pathlib import Path
from typing import Iterator, Sequence, List
@@ -27,14 +26,17 @@ from more_itertools import unique_everseen
from my.core import Stats, LazyLogger
from my.core.common import get_files, mcachew
-from .common import Location
+from my.location.common import Location, SortedLocations
logger = LazyLogger(__name__, level="warning")
def inputs() -> Sequence[Path]:
- return get_files(config.export_path, glob="*.gpx")
+ # sort by timestamp so that we can merge them in order
+ # gpslogger files can optionally be prefixed by a device id,
+ # so we can't always just sort by name
+ return sorted(get_files(config.export_path, glob="*.gpx", sort=False), key=lambda p: p.stat().st_mtime)
def _cachew_depends_on() -> List[float]:
@@ -45,7 +47,11 @@ def _cachew_depends_on() -> List[float]:
@mcachew(depends_on=_cachew_depends_on, logger=logger)
def locations() -> Iterator[Location]:
yield from unique_everseen(
- chain(*map(_extract_locations, inputs())), key=lambda loc: loc.dt
+ # sort each file, then merge them all together
+ SortedLocations.map_sort(
+ *map(_extract_locations, inputs())
+ ),
+ key=lambda loc: loc.dt,
)
@@ -65,6 +71,7 @@ def _extract_locations(path: Path) -> Iterator[Location]:
accuracy=config.accuracy,
elevation=point.elevation,
dt=datetime.replace(point.time, tzinfo=timezone.utc),
+ datasource="gpslogger",
)
diff --git a/my/time/tz/via_location.py b/my/time/tz/via_location.py
index 6b8e835..c56c8af 100644
--- a/my/time/tz/via_location.py
+++ b/my/time/tz/via_location.py
@@ -75,7 +75,7 @@ class DayWithZone(NamedTuple):
zone: Zone
-from my.location.common import LatLon
+from my.location.common import LatLon, SortedLocations
# for backwards compatibility
def _locations() -> Iterator[Tuple[LatLon, datetime]]:
@@ -99,7 +99,18 @@ def _locations() -> Iterator[Tuple[LatLon, datetime]]:
# TODO: could use heapmerge or sort the underlying iterators somehow?
# see https://github.com/karlicoss/HPI/pull/237#discussion_r858372934
def _sorted_locations() -> List[Tuple[LatLon, datetime]]:
- return list(sorted(_locations(), key=lambda x: x[1]))
+ is_already_sorted = False
+ try:
+ import my.location.all
+ is_already_sorted = isinstance(my.location.all.locations(), SortedLocations)
+ except Exception:
+ pass
+
+ if is_already_sorted:
+ logger.debug("my.location.all.locations() is already sorted, skipping sort")
+ return list(_locations())
+ else:
+ return list(sorted(_locations(), key=lambda x: x[1]))
# Note: this takes a while, as the upstream since _locations isn't sorted, so this
@@ -192,7 +203,7 @@ def _get_day_tz(d: date) -> Optional[pytz.BaseTzInfo]:
# ok to cache, there are only a few home locations?
@lru_cache(maxsize=None)
-def _get_home_tz(loc) -> Optional[pytz.BaseTzInfo]:
+def _get_home_tz(loc: LatLon) -> Optional[pytz.BaseTzInfo]:
(lat, lng) = loc
finder = _timezone_finder(fast=False) # ok to use slow here for better precision
zone = finder.timezone_at(lat=lat, lng=lng)
@@ -203,7 +214,7 @@ def _get_home_tz(loc) -> Optional[pytz.BaseTzInfo]:
return pytz.timezone(zone)
-def _get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
+def get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
'''
Given a datetime, returns the timezone for that date.
'''
@@ -211,16 +222,13 @@ def _get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
if res is not None:
return res
# fallback to home tz
- from ...location import home
- loc = home.get_location(dt)
+ from my.location.fallback.via_home import get_location as get_home_location
+ loc = get_home_location(dt)
return _get_home_tz(loc=loc)
-# expose as 'public' function
-get_tz = _get_tz
-
def localize(dt: datetime) -> tzdatetime:
- tz = _get_tz(dt)
+ tz = get_tz(dt)
if tz is None:
# TODO -- this shouldn't really happen.. think about it carefully later
return dt You could always just sort the inputs anyways (which is what I tried), since TimSort (default python sort) should have a fast case when big chunks (or the whole list) are already sorted, but then youre always doing a Could add some version of this that the user can opt-in to, but hard to make it automatic |
see #262 * move home to fallback/via_home.py * move via_ip to fallback * add fallback model * add stub via_ip file * add fallback_locations for via_ip * use protocol for locations * estimate_from helper, via_home estimator, all.py * via_home: add accuracy, cache history * add datasources to gpslogger/google_takeout * tz/via_location.py: update import to fallback * denylist docs/installation instructions * tz.via_location: let user customize cachew refresh time * add via_ip.estimate_location using binary search * use estimate_location in via_home.get_location * tests: add gpslogger to location config stub * tests: install tz related libs in test env * tz: add regression test for broken windows dates * vendorize bisect_left from python src doesnt have a 'key' parameter till python3.10
Am just posting/ideating this here to see if its something you're interested in adding, since this is a bit more state-ful than HPI typically is (this would include a CLI which saves data to disk based on prompting the user). I thought about making this a separate project but I think dealing with all this messy data is sort of HPI is good for, so it would be nice to have an interface thats easily extendible in here
Ive been creating something similar for my where_db CLI and we already have home locations as fallback, so I thought it would be nice to formalize it into an API thats easier to extend, like:
via_photos
because I have lots of photos on disk but not all of them I actually want to use for location, so this would let me confirm which ones to do that for (could also maybe improve onphotos.py
somehow)via_prompt
being a manual entry of the namedtuple described belowNamedtuple would be something like:
Either
duration
(how long this location is valid for) orend_dt
(when this location is no longer valid) has to be set.And then all would call some function that prioritizes/picks which source to use, based on the order theyre returned in, like:
It would call each fallback function, and then use the one with the closest
accuracy
(which typically would be entered or estimated through a config block in each module)This would require the user to prompt the user, so could probably just have a main function and run this like
python3 -m my.location.fallback.via_prompt
? Or could hook it into the CLI, either way.Strategy in general Ive thought of:
The choices the user picks would cached (so they arent re-prompted), so this would need to save to a JSON file or something, e.g. to ~/data/locations/via_prompt.json, ~/data/locations/via_ip.json
Im a bit split between actually saving data, or saving a 'sourcemap/transform' of sorts -- I recently did something similar for my scramble_history project, which lets me define keys to look for on any object, and then it defines a mapping file which converts one model into another -- that way were not actually duplicating data, its a bit more programatic, and you can sort of define 'behavior' to transform one piece of data (e.g. IPs) into locations
Not heavily attached to any of these ideas
--
As a sidenote, I had another idea for sorted locations as we discussed in #237 -- I had an idea to create a custom class which just yields the iterator, but then in the merge in
all.py
, we can just do an isinstance check on the iterator itself to check if its the class, likeThat way its still all backwards compatible and only if all sources are
SortedLocationIterator
do we do a mergesort of sorts.This could all be added behind a flag as well, but would really speed up sorting locations
Then in
tz.via_location
for example, ifall.py
return type is aSortedLocationIterator
, we know its been sorted by the user and we dont have to sort it ourselfThe text was updated successfully, but these errors were encountered: