-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JAVDatabase actor scraping #1548
Conversation
I was going to have a look at what would be needed to remove the native function. It looks fairly simple and might just need a couple of new functions that could also be useful elsewhere in the future. But, to find some test cases, I looked up a list of popular Trans Jav Actors from xvideos. The first 2 that I found in javdatabase, Serina Tachibana and Kaoru Oshima aren't tagged as Trans anyway. So, I'm not sure how robust using the Tags will be. My 2 cents, maybe 1 and a half, if you found a Trans tag, set it to Transgender Female because you know they are, otherwise leave it blank, because you don't know either way and that's not uncommon, about a quarter of the actors from other scrapers in my database have a blank gender |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 983 has me confused
siteDetails.SiteRules = append(siteDetails.SiteRules, GenericActorScraperRule{XbvrField: "image_url", Selector: `img[data-src^="https://www.javdatabase.com/idolimages/full/"]`, ResultType: "attr", Attribute: "data-src", PostProcessing: []PostProcessing{ | ||
{Function: "AbsoluteUrl"}, | ||
}}) | ||
siteDetails.SiteRules = append(siteDetails.SiteRules, GenericActorScraperRule{XbvrField: "images", Selector: `a[href^="https://pics.dmm.co.jp/digital/video/"]:not([href^="https://pics.dmm.co.jp/digital/video/mdj010/"])`, ResultType: "attr", Attribute: "href", PostProcessing: []PostProcessing{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this: :not([href^="https://pics.dmm.co.jp/digital/video/mdj010/"])
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at this actress page: https://www.javdatabase.com/idols/yui-nagase/
It does contain "images". But these are placeholder ones. And I wanted to already exlude them via the selector so I did. I know I could also have postprocessed them
Yeah maybe I should just leave it empty. Otherwise we can't process it because OnHTML is only called when the selector is found. And only in that case we are extracting a value and doing post processing EDIT: Removed the native function |
I was thinking you would need to select an upper node, then add a simple function to do the DOM FInd and another to do Blank if Empty, so you would set it if the Dom Find returned blank. But if you don't need them, I may just add them in a separate PR later. I'll also look at writing up something in the Wiki, (which won't help you now), we didn't have it when the original mod was done. |
Yeah we could select the parent p html element and then just do a search for transgender. Will update it @toshski Did some changes. Female or Transgender Female is now written to the database. |
…ueNotContains'. Gender is now correctly written to the database depending on the tags of the actress
@pl33x You added a substr function to this PR that isn't used anywhere. Can it be removed? |
This PR adds the functionality to scrape actors from JAVDatabase. The following XbvrFields will be populated:
Also I added the post processing function "DOMNextText". It will try to locate the NextSibling, check if it is a text node and when yes extract the text.