Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EggStorage doesn't pick the latest version of the project with GIT #34

Closed
jnv opened this issue Jan 19, 2014 · 23 comments
Closed

EggStorage doesn't pick the latest version of the project with GIT #34

jnv opened this issue Jan 19, 2014 · 23 comments

Comments

@jnv
Copy link

jnv commented Jan 19, 2014

When I deploy a project with version = GIT and later schedule the job, scrapyd sometimes picks some older version of the project.

If I am correct, this is due to the fact, that FilesystemEggStorage sorts versions by eggs' filenames and then picks the last one. This is fine for Mercurial and timestamp versioning, since both are sequential. But Git hashes are not ordered, so sometimes the older project takes precedence.

Since scrapyd-deploy picks Git version with git describe --always, the revisions are apparently supposed to be sequentially tagged (like v1.0 and so), eggs will be then ordered correctly. I think this fact is worth mentioning in the documentation, don't you think?

@pablohoffman
Copy link
Member

Agreed. Maybe we should make the GIT version sortable including the number of commits for example?

@jnv
Copy link
Author

jnv commented Apr 25, 2014

Yeah, number of commits should work unless you retroactively squash some commits or reset the branch. Apparently git-describe's output includes number of commits since the last tag, but you need to have one. I think this should not be a problem for most projects, there just needs to be a big fat red warning in the documentation

@pablohoffman
Copy link
Member

Yeah, I was thinking something along the lines of git rev-list --count HEAD, and we definitely need to add a note to the documentation, warning about squashes.

@Blender3D
Copy link

Just ran into this problem today. Is there any reason to maintain multiple eggs in the first place? Scrapyd doesn't offer you any way to schedule a spider to run using a specific version of an egg, so would it be easier to just have a single egg per project (or a symlink to the last-uploaded one)?

@dangra
Copy link
Member

dangra commented May 7, 2014

I really simple workaround to this issue is to add an annotated tag to your git project, I tried this on https://github.com/scrapinghub/testspiders project:

$ git describe --always 
bd11f43
$ git tag -a v1.0 01f3d898cb99500bd933d729ca100396a27dd160
$ git describe --always 
v1.0-3-gbd11f43

the "3" above is the commit count since the last annotated tag, so it should work with scrapyd sorting issue.

@dangra
Copy link
Member

dangra commented May 7, 2014

hmm ta.. it won't work for a commit count greater than 9, sorry about that :)

@dangra
Copy link
Member

dangra commented May 7, 2014

what if instead of relying on version sorting we add a symlink that points to the latest uploaded egg?

@umrashrf
Copy link

umrashrf commented May 8, 2014

will sorting by modified time of egg files work? umrashrf@a5ab226

@dangra
Copy link
Member

dangra commented May 8, 2014

@umrashrf yes, mtime sorting and symlinking are basically the same, both ignore version sorting and rely on latest uploaded egg. I prefer symlinking because it is O(1) and you can easily see in the filesystem to what version it is pointing.

@jnv
Copy link
Author

jnv commented May 8, 2014

I am afraid mtime can be very brittle to rely on, as it may be easily modified by an outside process. Furthermore there is already a timestamp versioning strategy, which is simpler and more robust – the only thing it's missing is a deduplication, which is guaranteed by commit ID (IMO no big deal).

I agree with @dangra, symlinking may be the way to go, it could also be used by all strategies.

@umrashrf
Copy link

umrashrf commented May 8, 2014

Right! +1 for symlinking. I am sure it is not as easy as changing def list(...) :)

@pablohoffman
Copy link
Member

In practice, scrapyd support a single version. Should we simplify it and keep only one version? (with an atomic rename to prevent race conditions, ofc)

@jnv
Copy link
Author

jnv commented May 8, 2014

👍 for single version. Keep things simple.

@dangra
Copy link
Member

dangra commented May 8, 2014

hmm ta.. it won't work for a commit count greater than 9, sorry about that :)

I am re-regretting on myself here, using annotated tags works fine because scrapyd sorts the versions using distutils.version.LooseVersion

>>> LooseVersion('v1.0-2-b') < LooseVersion('v1.0-10-a')
True

@dangra
Copy link
Member

dangra commented May 8, 2014

I think this fact is worth mentioning in the documentation, don't you think?

👍 in summary I prefer to fix the issue in docs instead of changing the behavior for everyone else not using GIT.

@dangra
Copy link
Member

dangra commented May 9, 2014

Symlink idea implemented by #46. Feedback welcome!

@dangra
Copy link
Member

dangra commented May 9, 2014

👍 for single version. Keep things simple.

@jnv: The problem with single version is that you can't consider it a bugfix, it changes all scrapyd apis including public webservice api endpoints: listversions.json and deleteversion.json

I think versions were added to scrapyd so people can schedule jobs with specific versions, think about stable/testing/feature versions, but so far schedule.json api doesn't support passing a version so removing "versions" won't affect users (in theory).

@dangra
Copy link
Member

dangra commented May 9, 2014

I think this fact is worth mentioning in the documentation, don't you think?
in summary I prefer to fix the issue in docs instead of changing the behavior for everyone else not using GIT.

Implemented by #47

@umrashrf
Copy link

I noticed that if version = is set in scrapy.cfg which is empty string then time.time() is sent to scrapyd to sort egg files which will work like #46. Right?

@dangra
Copy link
Member

dangra commented May 12, 2014

I noticed that if version = is set in scrapy.cfg which is empty string then time.time() is sent to scrapyd to sort egg files which will work like #46. Right?

yes, it works like that because timetstamp versioning is LooseVersion sorteable.

@umrashrf
Copy link

It will work even if "LooseVersion sorteable" was not there because time is sent in string of int.

Do you think this will fix #45 until we have #46 and/or #47 merged?

@dangra
Copy link
Member

dangra commented May 12, 2014

to fix #45 and still use GIT friendly versioning you just do #34 (comment)

@dangra
Copy link
Member

dangra commented May 26, 2014

fixed by #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants