Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search engine - Golem Network Beta.2 bounty #1457

Closed
mat7ias opened this issue Jun 28, 2021 · 16 comments
Closed

Full-text search engine - Golem Network Beta.2 bounty #1457

mat7ias opened this issue Jun 28, 2021 · 16 comments

Comments

@mat7ias
Copy link
Contributor

mat7ias commented Jun 28, 2021

Golem Network is a cloud computing power service where everyone can develop, manage and execute workloads in an unstoppable, inexpensive and censorship-free environment.

Since Beta.2 Golem supports a new model of computation – services. In contrast with batch tasks, services are expected to be long-running processes that don't have any natural completion point but rather are started and stopped on explicit command. The goal of this project is to build a full-text search service on Golem. The service would allow its users to perform search queries over a corpus of documents submitted by the requestor during deployment.

Requirements

  • The service should be compatible with yagna v0.7.1.
  • Requestor is expected to provide a set of plain-text documents to be indexed during service startup.
  • Once the indexing is done, the service is expected to respond to search queries.
    • Query input could be any arbitrary string to be found in the corpus.
    • Query output should include a list of all occurrences of the searched string in the corpus, i.e.: filename, line, and position.

Non-requirements

  • The implementer doesn't have to build the search engine from scratch. Using existing full-text search libraries is allowed.
  • The service is not expected to apply any advanced search techniques such as boolean queries, wildcards, fuzzy search, etc.
  • No scalability is required, the implementer might put reasonable constraints on corpus size and query size.
  • No graphical user interface is required (although it is welcome).
  • No direct communication between the user and the provider is required. It is assumed that all communication with the service will be mediated by the requestor application.

Deliverables

  • A GitHub repository with the following:
    • The service code,
    • A testing script allowing to check if the service works as intended,
    • Basic usage instructions,
    • GPL license.
  • A video recording demonstrating the usage of the service.

Resources

Estimated time to allocate: 24 hours

Useful Links:
Bounties Blogpost (including things you need to know!): https://blog.golemproject.net/golem-network-beta-2-bounties/
Beta.2 Blogpost: https://blog.golemproject.net/beta-ii-release/
Docs: https://handbook.golem.network
Install video: https://www.youtube.com/watch?v=Wqm7j7CtQwM
In case you need support, we’re here for you, join our Discord: https://chat.golem.network
Golem Twitter - https://twitter.com/golemproject

@gitcoinbot
Copy link

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


This issue now has a funding of 6000.0 GLM (1527.65 USD @ $0.25/GLM) attached to it.

@gitcoinbot
Copy link

gitcoinbot commented Jun 28, 2021

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work has been started.

These users each claimed they can complete the work by 265 years, 4 months from now.
Please review their action plans below:

1) skotre has been approved to start work.

I will:

  1. Get all text files from the requestor.
  2. Get search query from the requestor.
  3. On provider(s), loop through all text files, see which words are the most common and where.
  4. Return text files in a specific order.

I will use Python to accomplish this.

Learn more on the Gitcoin Issue Details page.

@gitcoinbot
Copy link

@skotre Hello from Gitcoin Core - are you still working on this issue? Please submit a WIP PR or comment back within the next 3 days or you will be removed from this ticket and it will be returned to an ‘Open’ status. Please let us know if you have questions!

  • reminder (3 days)
  • escalation to mods (6 days)

Funders only: Snooze warnings for 1 day | 3 days | 5 days | 10 days | 100 days

@skotre
Copy link

skotre commented Jul 2, 2021

I am still working on the issue. I have a plan laid out and have started work on the Docker image, but I haven't tried turning it into a Golemized Docker image yet. I have also run service and task examples and digested a lot from the documentation.

@gitcoinbot
Copy link

@skotre Hello from Gitcoin Core - are you still working on this issue? Please submit a WIP PR or comment back within the next 3 days or you will be removed from this ticket and it will be returned to an ‘Open’ status. Please let us know if you have questions!

  • reminder (3 days)
  • escalation to mods (6 days)

Funders only: Snooze warnings for 1 day | 3 days | 5 days | 10 days | 100 days

@mat7ias
Copy link
Contributor Author

mat7ias commented Jul 8, 2021

@skotre You can ignore the "Warned for Abandonment of Bounty" notifications. I'm working with Gitcoin to snooze these warnings (I don't have access to do it on my own), for now, they can be ignored from your end.

@mat7ias
Copy link
Contributor Author

mat7ias commented Jul 21, 2021

@skotre How is your bounty coming along, do you need any assistance?
We recently released a workshop related to services that might be helpful: https://blog.golemproject.net/developing-utilizing-the-golem-service-model/

@mat7ias
Copy link
Contributor Author

mat7ias commented Jul 26, 2021

@skotre we haven't had a response from you so for now we'll have to assume you're no longer actively working on this bounty.

@niklr
Copy link

niklr commented Jul 27, 2021

Just submitted my work on Gitcoin. You can find it here: https://github.com/niklr/golem-fulltext-search

Let me know what you think @mat7ias

@mat7ias
Copy link
Contributor Author

mat7ias commented Aug 11, 2021

Hi @niklr
Thanks for your patience, an individual I required to check the application with me was on vacation and has returned this week so I have some feedback. Are you able to address the below?

  1. In the example code ctrl+c doesn't perform a graceful shutdown so the requestor never pays for the job. One could see this even in the demo video (there is "Terminating agreement" log and the script ends).
  2. As far as I understand, index is being read again for every search. This is extremely inefficient. What we wanted is a code that:
  • creates the index (this is done correctly)
  • holds the index in the memory
  • performs a lookup in this index without rereading it again
    We understand this 2nd point isn't directly specified in the requirements explicitly, but efficiency is quite a regular requirement in any search engine.

Let me know your thoughts and if you're able to address those points. I have some more detailed remarks (below) but those above are the most important.

  1. requirements.txt: whoosh is not needed from what we can tell (or?)
  2. requirements.txt: gvmkit-build is only used if user wants to modify the image, but why would they want to do this? Building the image might not be needed in the README
  3. ENTRYPOINT is currently ignored, so it could be removed from the Dockerfile
  4. what is the purpose of the FtseService.shutdown method?
  5. Currently input is not async, so whole yapapi hangs on it (and e.g. time limit doesn't work while waiting for input). Example implementation of an async input can be found in https://github.com/golemfactory/yapapi-service-manager/blob/master/examples/python_shell.py

@niklr
Copy link

niklr commented Aug 12, 2021

Hi @mat7ias

Thank you for the feedback. Let me try to address the mentioned points:

  1. Indeed, does this mean it is currently possible to leverage the Golem cloud computing power without paying for the job? This use case might not be representative, but the requestor was able to accomplish what he wanted. Can I resolve this by implementing the async approach you mentioned in 7?
  2. I tried to keep a reference to the index instance in a class variable (see niklr/golem-fulltext-search@7598900)
    Tested with test.py works fine but once deployed as image the variable is not initialized anymore when calling search. Do you have something else in mind?
yapapi.rest.activity.CommandExecutionError: Command '{'run': {'entry_point': '/golem/run/ftse.py', 'args': ('--search', 'golem'), 'capture': {'stdout': {'stream': {}}, 'stderr': {'stream': {}}}}}' failed on provider; message: 'ExeScript command exited with code 1'; stderr: 'Traceback (most recent call last):
  File "/golem/run/ftse.py", line 165, in <module>
    search(args.search)
  File "/golem/run/ftse.py", line 137, in search
    result = ftse.search(term)
  File "/golem/run/ftse.py", line 102, in search
    with self.ix.searcher() as searcher:
AttributeError: 'NoneType' object has no attribute 'searcher'
  1. Only needed to run test.py for testing purposes without building/deploying an image.
  2. Depends how you want the README to be structured. I was more addressing developers (also easier for me getting up to speed quickly after 16 days break on this project;)
  3. I will remove the ENTRYPOINT from the Dockerfile
  4. If you mean the following lines they are probably from a sample implementation -> will remove as well
async def shutdown(self):
    # handler reponsible for executing operations on shutdown
    yield self._ctx.commit()   

@niklr
Copy link

niklr commented Aug 22, 2021

Hi @mat7ias

All mentioned points should now be covered except for the second where I need your input. Thanks a lot.

@niklr
Copy link

niklr commented Aug 22, 2021

With the help of Nebula from Discord I have integrated rpyc wich makes it possible to start a server in a separate thread. This multi-threaded approach keeps the index in memory as requested.

The implementation can be found in the dev_rpyc branch. Let me know if this should be merged into main.

@cryptobench
Copy link
Member

cryptobench commented Aug 26, 2021

@niklr Your work looks splendid. If you'd like you can submit your work over on Gitcoin and we will sort out the payment for you.

(Mattias is on vacation at the moment, so i'm covering for him)

@niklr
Copy link

niklr commented Aug 26, 2021

@cryptobench Awesome. I think I have already submitted on Gitcoin see https://gitcoin.co/issue/golemfactory/yagna/1457/100026045

Let me know if something is missing.

@cryptobench
Copy link
Member

Hi! Indeed you have - thanks a lot! Payout will happen as soon as possible.

@mat7ias mat7ias closed this as completed Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants