Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Local Copy of the Open Food Facts (OFF) DB #604

Open
simonj222 opened this issue Jul 26, 2022 · 22 comments
Open

[Feature Request] Local Copy of the Open Food Facts (OFF) DB #604

simonj222 opened this issue Jul 26, 2022 · 22 comments

Comments

@simonj222
Copy link

I suspect one feature that is holding back adoption of Waistline is the search functionality. The OFF Search API has a few limitations:

  • No typeahead (as no partial matching and throughput limitations)
  • Unsorted results (ie, no TF-IDF, results appear random)
  • Performance as we're making an API call

We could solve these concerns through an offline version of the DB.

My main concern with doing this is the increase in APK size. I've written a simple Python script that takes the OFF csv, and populates a smaller csv with only the necessary fields (product name, serving size, calories - only what's displayed in the search results view). It does this only for items that contain calories + serving size, resulting in ~500k items. This compressed is ~10MB, but will grow over time as the DB grows.

Detailed nutritional information could be populated through a separate call if the item detail view is displayed, or the item is added to the diary.

There's some downsides:

  • The local OFF DB would need to be periodically refreshed. It can probably just be an APK build step, but this means it will be frozen if Waistline stops releasing new versions.
  • Item detail screen may be slower
  • APK growth

However, the improved search functionality seems worth those downsides.

If there's sufficient interest here, I can try putting together a PR when I can find some spare time, (but couldn't commit to it right now).

The work here is non-trivial and would appear to involve:

  • Script to get OFF data dump, pull relevant fields, compress (done)
  • Integration of that data with the app
  • Typeahead
  • Ranking (just TF-IDF is likely adequate for a decent improvement)
  • Refactor to call OFF on item detail view + diary add
@simonj222 simonj222 changed the title [Feature Request] Local Copy of the OFF DB [Feature Request] Local Copy of the Open Food Facts (OFF) DB Jul 26, 2022
@davidhealey
Copy link
Owner

What's TF-IDF?

@simonj222
Copy link
Author

What's TF-IDF?

Sorry, Term Frequency - Inverse Document Frequency, it's probably a good starting point for ranking in a situation like this. It'll give us a score for each item matching a search query, and incorporate a notion of how valuable each search term is (if multiple search terms are provided).

This may not play so nicely with typeahead, but is probably a good initial direction to investigate.

@davidhealey
Copy link
Owner

I don't like the idea of bloating the APK. I'd rather any offline DB was part of a separate download that the user can opt into. Apart from that it all sounds like a good idea to me.

@simonj222
Copy link
Author

Got it - I'd avoided that because then we'd need to host it.

However, perhaps the OFF data manipulation + output could live in a separate Github project. That way Github hosts the output, and the app could download from that hosted url.

I'll try to play around with this when I get a chance.

@EmilJunker
Copy link
Contributor

To me, the biggest reason why the OFF search in Waistline is so bad and I rarely use it is because of its data quality and quantity issues. You only find what you're looking for if (a) the item exists in the OFF database at all, and (b) the product name or brand in OFF actually matches the one printed on the item (i.e. the search term you enter). All too often, these are not a given. That's why I always prefer to search for products via the barcode, and only resort to text based search if it's absolutely necessary.

I really like your idea, but I don't think it would fix the underlying problem of the OFF database. Sorry for being so pessimistic, but I'm afraid this whole thing will just turn out to be a huge amount of work, but in the end lead to no substantial improvement in the user experience.

@EmilJunker
Copy link
Contributor

By the way, it looks like the OFF search API has a sort_by parameter that allows to sort the results by popularity (most frequently scanned items first). Maybe this would be worth a look.

@davidhealey
Copy link
Owner

I use the search all the time, I don't think it's that bad. Sometimes the data isn't quite right but that's the same when you scan a barcode.

I agree that an offline database isn't going to make a huge difference overall, but if someone else wants to do it and it doesn't negatively impact my use of the app then I'm happy to include it.

We're already using sort_by https://github.com/davidhealey/waistline/blob/master/www/activities/foodlist/js/open-food-facts.js#L29

@EmilJunker
Copy link
Contributor

if someone else wants to do it and it doesn't negatively impact my use of the app then I'm happy to include it

Sure, if someone else wants to implement it and it's an optional download, then it's fine. I'm not advocating against this being added. But I do think that it would require a lot of effort, and the result might not actually be too different from what we have now, so you have been warned ;)

Also, there are a few problems with the approach outlined in the original comment:

  • If I understand correctly, the csv only includes the product name, serving size, and calories. But we also need the product brand (not just to display it, but also for searching), and the product image (for the thumbnail). I assume including these would blow up the size of the csv file even more.
  • There are some food items that have no calories, but should still be included in the search results, e.g. dietary supplements (see Lots of food items showing up with 0kcal #482). So these would also need to be included in the csv file, increasing the size even more.
  • The typeahead feature could interfere with searching for local food items. When I'm just typing in the search field to look for an item from the local foodslist, it could be annoying to be presented with search suggestions from OFF.

We're already using sort_by

Oh, that's interesting. But it's currently set to last_modified_t. I think it would be worth experimenting with other values such as unique_scans_n and see if that improves the search output.

@davidhealey
Copy link
Owner

  • The typeahead feature could interfere with searching for local food items. When I'm just typing in the search field to look for an item from the local foodslist, it could be annoying to be presented with search suggestions from OFF.

Yes I think I would want an option to disable this feature, although it could also be used to typeahead in your local DB as well as the local OFF DB.

I think it would be worth experimenting with other values such as unique_scans_n and see if that improves the search output.

Yeah we can play around with it.

@simonj222
Copy link
Author

simonj222 commented Jul 27, 2022

I'm afraid this whole thing will just turn out to be a huge amount of work, but in the end lead to no substantial improvement in the user experience.

That's a very valid concern, and I agree with the risk. I built an offline search functionality, and was able to see a real improvement for my queries. However, the technical complexity may not be worth it. I'm treating this as an experiment with a high chance of failure :)

we also need the product brand (not just to display it, but also for searching), and the product image (for the thumbnail)
...
There are some food items that have no calories, but should still be included in the search results, e.g. dietary supplements

Great points - given this would be an optional download, size becomes less of a concern. I'm investigating an alternative approach - having a much larger file that contains everything needed for the searching + item detail view (ie, all nutritional information). This avoids the complexity of a second fetch.

The data grows to ~160MB (after stripping out unneeded fields + using parquet + gzip'ing). Large, but probably acceptable for people who want this functionality. I'll continue playing around with this and see how the integration would look.

Oh, that's interesting. But it's currently set to last_modified_t. I think it would be worth experimenting with other values such as unique_scans_n and see if that improves the search output.

+1, I think that's a great idea.

@simonj222
Copy link
Author

Quick update here - I've been playing around with getting a local index, but Cordova doesn't make it easy. Since we want to do something smart, IndexedDB isn't really sufficient for our needs.

I'm instead looking into creating a new OFF API that would support this usecase. It would also hopefully help other developers. I'll circle back here if I have any luck. However, for the moment I'll close this issue.

@teolemon
Copy link

ahaha :-)
@simonj222 I'm now connecting the dots :-)
I was retesting the latest version of Waistline, given the current MyFitnessPal apocalypse, and I found this issue :-P

@simonj2
Copy link

simonj2 commented Aug 27, 2022

@teolemon - ahaha, small world! :)

For anyone else following along - the new API is a WIP at: https://github.com/openfoodfacts/openfoodfacts-search

@davidhealey
Copy link
Owner

MyFitnessPal apocalypse

@teolemon Tell me more

@teolemon
Copy link

https://www.theverge.com/2022/8/25/23321408/myfitnesspal-weight-loss-app-barcode-scanning-premium-paywall

@teolemon
Copy link

People are pissed, cf Twitter

@davidhealey
Copy link
Owner

That's a crazy move, I expect they'll make a u-turn on it. Hopefully more users will move to free alternatives like Waistline.

@jncosideout
Copy link

I will certainly tell my friends about the My Fitness Pal Apocalypse and use that to pitch Waistline to them 😄
@teolemon

@Kallinteris-Andreas
Copy link

Hey, this appears to be not actively developed, but I would like to add (as a user):

  • offline databases are great for privacy (no need to tell a third party what you are eating/thinking of eating)
  • also for privacy, this allows you to use airplane mode and avoid location tracking (from cellular towers)

@teolemon
Copy link

teolemon commented Mar 3, 2024

FYI, the new search API is now live at https://search.openfoodfacts.org/docs

@davidhealey
Copy link
Owner

FYI, the new search API is now live at https://search.openfoodfacts.org/docs

Is this a breaking change for existing apps?

@teolemon
Copy link

teolemon commented Mar 3, 2024

The 2 existing API will co-exist, but it's recommended to update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants