Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JAVDatabase actor scraping #1548

Merged
merged 11 commits into from
Jan 2, 2024

Conversation

pl33x
Copy link
Contributor

@pl33x pl33x commented Dec 30, 2023

This PR adds the functionality to scrape actors from JAVDatabase. The following XbvrFields will be populated:

  • image_url
  • images
  • biography
  • hair_color
  • birth_date
  • height
  • band_size
  • cup_size
  • waist_size
  • hip_size
  • aliases
  • gender
    • JAVDatabase only has female actors. But some of them are tagged as transgender. So I added a Native function which searches for the transgender tag. If found we return Transgender Female (as stashdb), otherwise Female

Also I added the post processing function "DOMNextText". It will try to locate the NextSibling, check if it is a text node and when yes extract the text.

@toshski
Copy link
Contributor

toshski commented Dec 30, 2023

I was going to have a look at what would be needed to remove the native function. It looks fairly simple and might just need a couple of new functions that could also be useful elsewhere in the future.

But, to find some test cases, I looked up a list of popular Trans Jav Actors from xvideos. The first 2 that I found in javdatabase, Serina Tachibana and Kaoru Oshima aren't tagged as Trans anyway. So, I'm not sure how robust using the Tags will be.

My 2 cents, maybe 1 and a half, if you found a Trans tag, set it to Transgender Female because you know they are, otherwise leave it blank, because you don't know either way and that's not uncommon, about a quarter of the actors from other scrapers in my database have a blank gender

Copy link
Contributor

@vt-idiot vt-idiot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 983 has me confused

siteDetails.SiteRules = append(siteDetails.SiteRules, GenericActorScraperRule{XbvrField: "image_url", Selector: `img[data-src^="https://www.javdatabase.com/idolimages/full/"]`, ResultType: "attr", Attribute: "data-src", PostProcessing: []PostProcessing{
{Function: "AbsoluteUrl"},
}})
siteDetails.SiteRules = append(siteDetails.SiteRules, GenericActorScraperRule{XbvrField: "images", Selector: `a[href^="https://pics.dmm.co.jp/digital/video/"]:not([href^="https://pics.dmm.co.jp/digital/video/mdj010/"])`, ResultType: "attr", Attribute: "href", PostProcessing: []PostProcessing{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this: :not([href^="https://pics.dmm.co.jp/digital/video/mdj010/"]) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at this actress page: https://www.javdatabase.com/idols/yui-nagase/
It does contain "images". But these are placeholder ones. And I wanted to already exlude them via the selector so I did. I know I could also have postprocessed them

@pl33x
Copy link
Contributor Author

pl33x commented Dec 30, 2023

I was going to have a look at what would be needed to remove the native function. It looks fairly simple and might just need a couple of new functions that could also be useful elsewhere in the future.

But, to find some test cases, I looked up a list of popular Trans Jav Actors from xvideos. The first 2 that I found in javdatabase, Serina Tachibana and Kaoru Oshima aren't tagged as Trans anyway. So, I'm not sure how robust using the Tags will be.

My 2 cents, maybe 1 and a half, if you found a Trans tag, set it to Transgender Female because you know they are, otherwise leave it blank, because you don't know either way and that's not uncommon, about a quarter of the actors from other scrapers in my database have a blank gender

Yeah maybe I should just leave it empty. Otherwise we can't process it because OnHTML is only called when the selector is found. And only in that case we are extracting a value and doing post processing

EDIT: Removed the native function

@toshski
Copy link
Contributor

toshski commented Dec 30, 2023

Yeah maybe I should just leave it empty. Otherwise we can't process it because OnHTML is only called when the selector is found. And only in that case we are extracting a value and doing post processing

I was thinking you would need to select an upper node, then add a simple function to do the DOM FInd and another to do Blank if Empty, so you would set it if the Dom Find returned blank. But if you don't need them, I may just add them in a separate PR later.

I'll also look at writing up something in the Wiki, (which won't help you now), we didn't have it when the original mod was done.

@pl33x
Copy link
Contributor Author

pl33x commented Dec 30, 2023

Yeah maybe I should just leave it empty. Otherwise we can't process it because OnHTML is only called when the selector is found. And only in that case we are extracting a value and doing post processing

I was thinking you would need to select an upper node, then add a simple function to do the DOM FInd and another to do Blank if Empty, so you would set it if the Dom Find returned blank. But if you don't need them, I may just add them in a separate PR later.

I'll also look at writing up something in the Wiki, (which won't help you now), we didn't have it when the original mod was done.

Yeah we could select the parent p html element and then just do a search for transgender. Will update it

@toshski Did some changes. Female or Transgender Female is now written to the database.

…ueNotContains'. Gender is now correctly written to the database depending on the tags of the actress
@crwxaj crwxaj merged commit ce68185 into xbapps:master Jan 2, 2024
1 check passed
@pl33x pl33x deleted the feature/javdatabase-scraping branch January 4, 2024 13:15
@theRealKLH
Copy link
Collaborator

@pl33x You added a substr function to this PR that isn't used anywhere. Can it be removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants