Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find out why our linen archive of Slack discussions isn't indexed (or isn't ranking) #2802

Closed
stichbury opened this issue Jul 17, 2023 · 13 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@stichbury
Copy link
Contributor

We put in some time to kedro-org/kedro-devrel#84 and have a fair number of links to the archive now...but we are not seeing it in search results. Is there something up with linen?

@stichbury stichbury added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jul 17, 2023
@stichbury stichbury moved this to To Do in Kedro Framework Jul 17, 2023
@stichbury
Copy link
Contributor Author

I emailed help@linen.dev today (18/07/2023) and will give them a while to get back to me. I've looked into other options but it's really hard to find any (SearchUnify allows you to index your Slack org but not for Google). Even if we to convert a basic JSON export of Slack data into something we could ourselves upload for index, we'd be stretched. The best lead I got was from the Future of Coding community, who have had a couple of contributors build them indexing tools https://futureofcoding.org/community.html -- I have reached out to Kartik Agaram to find out more about his archives project. https://akkartik.name/

I work in the Kedro team and our Slack community has a Linen archive at https://www.linen.dev/s/kedro

I’m concerned that it's not properly indexed by Google and wondered if you have any advice.

For example, I took a question from July 10th (which is archived here https://www.linen.dev/s/kedro/t/13149556/hi-folks-is-it-possible-to-define-multiple-types-of-base-dat#a6d24a4d-a198-48d1-b9bd-1699d9b55fd6)

"Is it possible to define multiple types of base datasets for PartitionedDataSet?"

This doesn't return any results to the Linen archive. If I add the word "Kedro" there's still nothing.

If I search "site:linen.dev/s/kedro Is it possible to define multiple types of base datasets for PartitionedDataSet?" I see the result returned, as in the screen grab. However the link google gives me doesn't take me to the question itself, but to the latest point in the archive for that channel (https://www.linen.dev/s/kedro/c/questions)

Screenshot 2023-07-18 at 13 29 44

We have a number of links to the archive on our website, our blog, our documentation and on GitHub (repos, wiki). So I don't think it's that there are no backlinks. We've had these for some weeks now.

Please could you help me understand what is going on and how to get better indexing of the content, as it's not useful to our users right now. My ideal result is that the query (without any site qualification) would return the archive in the SERP, and the link would correctly link to the question and not the tip of the archive.

@astrojuanlu
Copy link
Member

I think one very likely scenario is that, if we want certain things to be indexed, we'll have to transform them into Stack Overflow questions and answer them ourselves. This is not considered a bad practice (as long as we don't use GenAI) and will probably increase the visibility of Kedro in SO.

@stichbury
Copy link
Contributor Author

True, although that is a big effort and we have a wealth of content on Slack of only people had access to it from Google! Let's see if we get any leads from my efforts today.

@stichbury
Copy link
Contributor Author

stichbury commented Jul 19, 2023

So far, nobody has returned to me from Linen with any help or information about why Google isn't indexing our Slack archive correctly.

I'm wondering if we should consider a different approach. I've done quite a bit of hunting around for alternatives to Linen but it's limited. Kartik (who I mentioned above) shared a link to his repo with me. This is code that reads a Slack workspace and builds a set of static HTML. It's here: https://github.com/akkartik/foc-archive

It's not ideal but I'm wondering if there's scope for us to do this and publish on the Kedro website (with no links to it or particular formatting). We don't advertise that it's there, but somewhere on the kedro.org domain we have an equivalent to this (it's the future of coding Slack workspace archive).

The reasoning is that we have this online, and indexed so we flood Google with all the Q&A discussions about Kedro so people searching for answers get to see them. (THE WHOLE POINT OF LINEN). We'd add links through to slack.kedro.org onto the archive to push users browsing it to sign up and get the content from Slack, but at least we'd initially get the answers in front of them. If it's hidden in Slack, they can't see it's there at all.

Problem with it being static is that we would need to schedule regular rebuild of the archive and republication. It's non-trivial. But I think it's worth considering from a discoverability point of view.

WDYT @astrojuanlu @tynandebold ?

Edit: Linen response

I think the best bet is to host it under your domain. This will give you the best bet at giving you these results. If you set up a custom domain Linen automatically redirects existing traffic to it. Google recently changed up their indexing algorithm to not index everything. We are working on changing up how Linen works in response. We're prioritizing longer threads and potentially not indexing more noisier chat messages (Hasn't shipped yet).

The main problem is filtering through all the noise and one thing we probably will do is index only longer threads first since that is where you have most of the information.

@astrojuanlu
Copy link
Member

Linen got back to us. I think the limitations are in two areas:

  • Having our own domain. That's easy, can be sorted out either on Linen or the solution you propose @stichbury.
  • The thread and content model. I think this might be the difficult part. Since Slack is a stream of text, I think Google might be deciding to rank it low or index it in a suboptimal way. In that sense I'm not sure having the text in our own system would help things.

@stichbury
Copy link
Contributor Author

stichbury commented Jul 19, 2023

I don't see how the domain makes much difference to the indexing TBH, particularly since we don't have huge authority or strong backlinks as a project. I'm unconvinced but willing to try that option if it doesn't cost us anything, and I think we did previously discuss having a questions.kedro.org or community.kedro.org subdomain, so let's see if that helps before we pursue anything more complex.

I'll keep this issue open but raise a separate ticket over on the kedro-website repo.

@tynandebold
Copy link
Member

So which of those two subdomains do we want to point to Linen, questions.kedro.org or community.kedro.org?

@astrojuanlu
Copy link
Member

https://linen-slack.kedro.org/? (to mirror https://linen-discord.kedro.org/)

@tynandebold
Copy link
Member

tynandebold commented Jul 31, 2023

Don't we already have that? I think I set this up a while ago, now that I think back to it.

Visit this: linen-slack.kedro.org, it actually works.

@stichbury
Copy link
Contributor Author

Great. Now I think of it, there were no redirects between the custom domain but now it's working so I guess we can update the docs/github/blog posts etc to point to the new location, give it a couple of weeks and see if it helps...I'll keep this open in the interim.

@stichbury
Copy link
Contributor Author

#2877 now ticketed for the worked needed

@astrojuanlu
Copy link
Member

Our community is now visible in the Linen main page https://www.linen.dev/ and https://www.linen.dev/communities

@stichbury
Copy link
Contributor Author

And the indexing does seem to be working! I tried "Is it possible to define multiple types of base datasets for PartitionedDataSet?" and our linen was the first result. Closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Archived in project
Development

No branches or pull requests

3 participants