Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to duplicate/copy corpus #427

Open
adamveng opened this issue Oct 29, 2021 · 1 comment
Open

Possible to duplicate/copy corpus #427

adamveng opened this issue Oct 29, 2021 · 1 comment

Comments

@adamveng
Copy link

This is for the purpose of experimenting with taking a "turn" in a corpus curation:
Is it possible to make a copy/duplication of an existing corpus and then work on from an existing corpus?
This would allow researchers to follow tangents and adding new features in the curation - e.g. I can now trace how different actors link to each other, but I would ALSO like to incorporate the news articles that they link to. Making a copy would allow me to experiment without the fear for making my existing corpus too messy by including a lot of new entities (e.g. the news articles). I don't know if this is at all possible without exporting the csv and then re-crawling all the imported URL's?

Hope it makes some kind of sense?

@boogheta
Copy link
Member

boogheta commented Oct 29, 2021

Hi Adam,

It makes complete sense, and is something we really would want to be able to do... but...

Currently there is no easy way to do this and it can only be done manually.

If you control the server where Hyphe is running, the simpliest solution if probably to make an identical copy by messing a bit with the databases:

  • stop hyphe
  • duplicate the corpus entry within the corpus collection of MongoDB's hyphe database and just change the id in the copy from projectid to some cloneprojectid
  • duplicate the whole MongoDB of the corpus (named hyphe-projectid) into another one with a name with the other id such as hyphe-cloneprojectid
  • copy paste within the traph directory the projectid directory into another cloneprojectid
  • restart hyphe

This should do the trick.
A script was written a long time ago to do this here but it hasn't been maintained or practiced in a while, so I'd recommand to run it manually step by step.

Another solution if this is not your own server would be programmatically, using the exports and the API : first collect exports of all webentities of the corpus as well as all crawls (EXPORT & CRAWL/All crawl jobs pages), create a new corpus with the same settings, then write a small programme that calls Hyphe's API to declare within the corpus the definition of all webentities from the original corpus and then run all the same crawls (a simple shell client to the API is available in hyphe_backend/test_client.py and the full documentation of the API is there).
Of course this won't ensure perfect reproducibility since it will recrawl webpages at a different time.

This is a feature we want to add to the interface but it is quite complex and we never took the time yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants