You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way of scraping issues and PRs determines the quality of the search. Here are the points I noticed for improvements:
Using url and anchor instead of just url field so that the same issue/PR is not repeated in the search bar results. I asked this change in the main PR because it's essential to fix that before any integration from my POV.
hierarchy_radio_lvlX is never set. I pointed out a link in the documentation main issue to explain in details how to imitate docs-scraper, and there is an explanation about hierarchy_radio_lvlX. This field is also important for the search. It should be fixed ASAP 🙂
If I'm not wrong, every single document has a content field filled (never set to null). It's not a big deal but we don't want the content to be displayed every time. It depends on the search.
In the current main docs, if I type "add documents" the search bar returns only a title because a title exists according to the search:
Be sure the Add Documents page contains add documents in its content (not only in the titles), but the search bar does not need to display it.
Again with the current main documentation, if I type "our solution", there is no title or subtitle matching, so the search bar returns the content:
But, with the current search bar for the GitHub issue, we always return content: it's not necessary and "spoils" the results:
Here, if I type "new token" I want the search bar to return only the issue "Tracking issue: New tokenizer" without any content.
If I type "i agree", there is no title matching, so I do want this result:
How to fix that? When scraping a new issue, another document should be added with the same information but:
with content set to null
(with anchor set to null according to the first point)
(and with hierarchy_radio_lvlX according to the 2nd point)
Nothing has to be removed. Only one additional document is required.
I see a document with a content set to "" (not null). Maybe there are others.
We should investigate on that to set to null or to fill it with the right content. I noticed the issue does not have any description. In this case, only a document with a content set to null should be added (linked with the 3rd point).
NB
The 2nd and 3rd points are linked to improve the user experience and should be done in the same PR.
The text was updated successfully, but these errors were encountered:
The way of scraping issues and PRs determines the quality of the search. Here are the points I noticed for improvements:
Using
url
andanchor
instead of justurl
field so that the same issue/PR is not repeated in the search bar results. I asked this change in the main PR because it's essential to fix that before any integration from my POV.hierarchy_radio_lvlX
is never set. I pointed out a link in the documentation main issue to explain in details how to imitate docs-scraper, and there is an explanation abouthierarchy_radio_lvlX
. This field is also important for the search. It should be fixed ASAP 🙂If I'm not wrong, every single document has a
content
field filled (never set tonull
). It's not a big deal but we don't want the content to be displayed every time. It depends on the search.In the current main docs, if I type "add documents" the search bar returns only a title because a title exists according to the search:
Be sure the
Add Documents
page containsadd documents
in its content (not only in the titles), but the search bar does not need to display it.Again with the current main documentation, if I type "our solution", there is no title or subtitle matching, so the search bar returns the
content
:But, with the current search bar for the GitHub issue, we always return
content
: it's not necessary and "spoils" the results:Here, if I type "new token" I want the search bar to return only the issue "Tracking issue: New tokenizer" without any content.
If I type "i agree", there is no title matching, so I do want this result:
How to fix that? When scraping a new issue, another document should be added with the same information but:
content
set tonull
anchor
set tonull
according to the first point)hierarchy_radio_lvlX
according to the 2nd point)Nothing has to be removed. Only one additional document is required.
""
(notnull
). Maybe there are others.We should investigate on that to set toI noticed the issue does not have any description. In this case, only a document with a content set tonull
or to fill it with the right content.null
should be added (linked with the 3rd point).NB
The 2nd and 3rd points are linked to improve the user experience and should be done in the same PR.
The text was updated successfully, but these errors were encountered: