兼容下载页面不直接返回页面信息,而是先返回一个自动计算acw_sc__v2后加入cookie然后自动reload获取正常页面的情况 #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
现象
今天发现在特定情况下(目前怀疑是一个链接被下载过多次),访问文件页面不会直接返回相关信息,导致get_file_info_by_url不能正常解析出相关信息
原因分析
经过调试发现,在这种情况下,会返回一个混淆后的页面,该页面会基于页面开头的一个变量,计算出一个acw_sc__v2添加到cookie中,之后自动重载页面,这时候服务器才会返回正常的页面
问题解决
通过搜索混淆页面中的特定关键词,找到了一个参考资料 https://zhuanlan.zhihu.com/p/228507547 ,基于该文章,实现了计算该cookie的逻辑,并在发现页面包含关键词
var arg1=
时,可以认为是出现了这种情况,这时候计算出cookie后添加到session中,然后重新请求页面就能获得正常的信息了示例
在chrome打开devtool,访问https://fzls.lanzous.com/iDBM7nbkzti (不一定能复现,似乎在判定为爬虫时才会触发)
会发现第一次请求返回的是形如下面内容的页面,之后会再次请求,cookie中会增加一个acw_sc__v2字段,并返回正常页面信息