Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does anybody else have this problem when using lncrawl in 69xinshu #2269

Closed
itsyahma opened this issue Feb 13, 2024 · 12 comments
Closed

Does anybody else have this problem when using lncrawl in 69xinshu #2269

itsyahma opened this issue Feb 13, 2024 · 12 comments

Comments

@itsyahma
Copy link

Let us know

Novel URL: https://www.69xinshu.com/book/9969673.htm
App Location: PIP | EXE | Discord | Telegram
App Version: x.y.z

Describe this issue

@itsyahma
Copy link
Author

image_2024-02-13_201534450

@itsyahma
Copy link
Author

this is what appears

@camp00000
Copy link

I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that.
You can try running lncrawl with the --auto-proxy arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.

@ncuxie
Copy link

ncuxie commented Feb 13, 2024

I fixed 69xinshu in #2256 which is currently not released yet, this branch fixes the first error, the 403 you got -> not sure about that. You can try running lncrawl with the --auto-proxy arg to see if that helps to 403, but donwloading won't work unless you use the dev branch or wait for the next release.

I used the dev branch to download from this website and found that there was a problem with downloading more than 650 chapters at once.
image
image
The website will deny access. It seems that the anti-crawler has been upgraded. (:з」∠)

@camp00000
Copy link

camp00000 commented Feb 13, 2024

Darn, that's not too good. I'll see if there's anything that can be done there..
I suppose for now you can try downloading in batches, you can select by chapter range so you can enter 1-500 and then 501-1000 and so on to probably bypass this if it's just a simple check.

and maybe combine that with --auto-proxy to get new source IPs each download batch.

Let me know if that works. @ncuxie

@ncuxie
Copy link

ncuxie commented Feb 15, 2024

image
I found that it started getting errors from chapter 251, so I tried downloading only chapters 1-250 and didn't encounter any problems. However, when I try to download the second time

image
It couldn't even get the directory, so I used a browser to access the website

image
This looks a bit troublesome (:з」∠)

After a period of time without completing the verification, the website becomes inaccessible.

But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔

I will try --auto-proxy later.

@camp00000

@camp00000
Copy link

image This looks a bit troublesome (:з」∠)

After a period of time without completing the verification, the website becomes inaccessible.

But if you are reading novels normally, you will not encounter verification if you read more than 250 chapters a day, so I guess it may be that you download too frequently. 🤔

I will try --auto-proxy later.

@camp00000

There's rate-limiting that can be done on the downloader-side but no way to enforce downloading only X amount of chapters.

My hopes are currently on the --auto-proxy approach, IP-Reputation may or may not break that but we'll see I guess.

To note: if I understood correctly, auto-proxy makes the crawler cicle through proxies when downloading, so it may be possible to download an entire novel with lots of chapters at once with the auto-proxy option, given that this is actually what it does and the IPs aren't all/mostly banned already.

Let me know how it goes.

@ncuxie
Copy link

ncuxie commented Feb 27, 2024

I can get chapters without --auto-proxy but not with --auto-proxy.

$ lncrawl -s https://www.69xinshu.com/book/40107.htm

===================================================

                 [#] Lightnovel Crawler v3.4.2
         https://github.com/dipu-bd/lightnovel-crawler

---------------------------------------------------------------------------------------

-> Press Ctrl + C to exit

Retrieving novel info...

[#] 从时间停止开始纵横诸天
14 volumes and 1357 chapters found.
- https://www.69xinshu.com/book/40107.htm

? Enter output directory: C:\Users\XIE\Lightnovels\www-69xinshu-com\C
ong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian


$ lncrawl -s https://www.69xinshu.com/book/40107.htm --auto-proxy

===================================================

                 [#] Lightnovel Crawler v3.4.2
         https://github.com/dipu-bd/lightnovel-crawler

---------------------------------------------------------------------------------------
Sources: 100%|█████████████████████| 24/24 [00:03<00:00, 6.20file/s]

-> Press Ctrl + C to exit

Retrieving novel info...
Exception in thread Thread-4:
Traceback (most recent call last):
File "D:\anaconda3\lib\threading.py", line 980, in _bootstrap_inner
self.run()
File "D:\anaconda3\lib\threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\XIE.lncrawl\sources\zh\69shuba.py", line 70, in read_novel_info
soup = self.get_soup(self.novel_url, encoding="gbk")
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 304, in get_soup
response = self.get_response(url, **kwargs)
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 201, in get_response
return self.__process_request(
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 130, in __process_request
raise e
File "D:\anaconda3\lib\site-packages\lncrawl\core\scraper.py", line 123, in __process_request
response.raise_for_status()
File "D:\anaconda3\lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://www.69xinshu.com/book/40107.htm

! Error: No chapters found
<class 'Exception'>
File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 107, in start
raise e
File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 101, in start
_download_novel()
File "D:\anaconda3\lib\site-packages\lncrawl\bots\console\integration.py", line 85, in _download_novel
self.app.get_novel_info()
File "D:\anaconda3\lib\site-packages\lncrawl\core\app.py", line 137, in get_novel_info
raise Exception("No chapters found")

----------------------------------------------------------------------
- https://github.com/dipu-bd/lightnovel-crawler/issues
======================================================================

@camp00000
Copy link

It looks like some of the the proxies are likely already on a blacklist or have a very bad IP reputation.

So the other somewhat simple way forward would be to find working proxies for 69xinshu and test them - once you have a few suitable ones you could make a custom proxies file and use as described in the lncrawl help section
--proxy-file FILE Proxies as SCHEME://HOST:PORT@USER:PASSWORD format in each line. All except HOST are optional
to download everything at once hopefully.

Otherwise you can slowly download part-by-part with your own IP and that might work given enough time and only selecting a few hundred chaps per day max. I suggest this way if you're fine waiting a bit and downloading in parts. The EPUB can always be concatenated into one big thing with some tool at a later time if you prefer it that way.

To make --auto-proxy viable as is for this source, I think the whole proxy handling would need to be reworked to treat certain status codes (like 401 access denied) as potential proxy issues instead of server/request issues. So that's not very feasible.

@wizerdo37
Copy link

this link of raws does not have limit rates for downloads: https://www.ddxsss.com/

@camp00000
Copy link

I checked and lncrawl doesn't currently support this source yet but if it does indeed not have any rate-limiting like 69xinshu then it would be a viable alternative, the site structure looks relatively similar as well so adding it shouldn't be too big of an issue.

I even found a novel with the same title as mentioned in the above logs https://www.ddxsss.com/book/46000/ so they seem to overlap in that part as well.

If someone wants to create an issue to add this source I'll look into doing that later this week.

@camp00000
Copy link

I actually went ahead and added the crawler already, it's currently a pull request so once it's merged into dev you can test it out by installing the newest dev version locally. #2287

I was able to download 1.3k chaps at once without any significant issues. The chapters with HTTP 503 reported did have their content available so it seemed to have failed once out of the few retries it has per chapter in those instances but no blocking from cloudflare / captchas or the like.

Retrieving novel info...

📒 从时间停止开始纵横诸天
14 volumes and 1357 chapters found.
🔗 https://www.ddxsss.com/book/46000

? Enter output directory: /home/.../lightnovel-crawler/Lightnovels/www-ddxsss-com/Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian
? Which chapters to download? Everything! (1357 chapters)
? 1357 chapters selected Continue
? Which output formats to create? [epub]
? How many files to generate? Pack everything into a single file
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/148.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/433.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/451.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/457.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/927.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1135.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1150.html
HTTPError: 503 Server Error: Service Unavailable for url: https://www.ddxsss.com/book/46000/1225.html
Chapters: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1357/1357 [00:42<00:00, 32.19item/s]
  Images: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.48item/s]
Created: Cong Shi Jian Ting Zhi Kai Shi Zong Heng Zhu Tian c1-1357.epub
✨ Task completed  

@dipu-bd dipu-bd closed this as completed Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants