Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hbase): support gen HFile for hbase v2 (BETA) #358

Merged
merged 40 commits into from
Nov 10, 2022

Conversation

haohao0103
Copy link
Contributor

@haohao0103 haohao0103 commented Nov 7, 2022

close #357

1、Support write vertex/edge directly to KV storage
2、Just support customString and customNumber ID now
3、submit the loader code for bypass server for hbase writing

@imbajin
Copy link
Member

imbajin commented Nov 7, 2022

@JackyYangPassion Is this an improved part?

@codecov
Copy link

codecov bot commented Nov 7, 2022

Codecov Report

Merging #358 (e3c8a90) into master (c893f50) will decrease coverage by 2.37%.
The diff coverage is 6.92%.

@@             Coverage Diff              @@
##             master     #358      +/-   ##
============================================
- Coverage     64.82%   62.44%   -2.38%     
- Complexity     1851     1864      +13     
============================================
  Files           255      260       +5     
  Lines          9081     9462     +381     
  Branches        837      874      +37     
============================================
+ Hits           5887     5909      +22     
- Misses         2810     3169     +359     
  Partials        384      384              
Impacted Files Coverage Δ
...om/baidu/hugegraph/loader/builder/EdgeBuilder.java 67.74% <0.00%> (-25.60%) ⬇️
...baidu/hugegraph/loader/builder/ElementBuilder.java 89.71% <ø> (ø)
.../baidu/hugegraph/loader/builder/VertexBuilder.java 61.29% <0.00%> (-21.32%) ⬇️
...com/baidu/hugegraph/loader/constant/Constants.java 75.00% <ø> (ø)
...u/hugegraph/loader/direct/loader/DirectLoader.java 0.00% <0.00%> (ø)
...egraph/loader/direct/loader/HBaseDirectLoader.java 0.00% <0.00%> (ø)
...aidu/hugegraph/loader/direct/util/SinkToHBase.java 0.00% <0.00%> (ø)
...ugegraph/loader/metrics/LoadDistributeMetrics.java 0.00% <0.00%> (ø)
...u/hugegraph/loader/spark/HugeGraphSparkLoader.java 0.00% <0.00%> (ø)
...m/baidu/hugegraph/loader/executor/LoadOptions.java 70.40% <30.00%> (-4.60%) ⬇️
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@haohao0103 haohao0103 changed the title bypass server for hbase writing hugegraph-loader (BETA) feat(hbase): support gen HFile for hbase(BETA) Nov 7, 2022
@JackyYangPassion
Copy link
Contributor

@JackyYangPassion Is this an improved part?

  1. support bulkload from hive with client bypass server feature.
  2. this feature has been launched, which solves the problem of importing large amounts of data through API and affecting queries

@imbajin imbajin added enhancement New feature or request todo labels Nov 7, 2022
@imbajin
Copy link
Member

imbajin commented Nov 7, 2022

OK, mark it also as to be reviewed.

and could u handle the third-party dependencies check?

@haohao0103
Copy link
Contributor Author

1、The code style has been adjusted,
2、third-party dependencies has added to the known-dependencies.txt
@JackyYangPassion @javeme @imbajin

@imbajin
Copy link
Member

imbajin commented Nov 8, 2022

1、The code style has been adjusted,
2、third-party dependencies has added to the known-dependencies.txt
@JackyYangPassion @javeme @imbajin

thanks,the 3rd party check seems failed,need some help?

Copy link
Contributor

@javeme javeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution~
please also address other comments: https://github.com/apache/incubator-hugegraph-toolchain/pull/358/files (search by "ago"), and also address this file LoadOptions.java

imbajin
imbajin previously approved these changes Nov 9, 2022
@haohao0103
Copy link
Contributor Author

@imbajin Hi, I can help solve the loader ci check failure

@imbajin
Copy link
Member

imbajin commented Nov 9, 2022

@imbajin Hi, I can help solve the loader ci check failure

Thanks, I have already adopted the basic code, and current the differ is:

expected:

{
    "version":"2.0",
    "structs":[
        {
            "id":"1",
            "skip":false,
            "input":{
                "type":"FILE",
                "path":"users.dat",
                "file_filter":{
                    "extensions":[
                        "*"
                    ]
                },
                "format":"TEXT",
                "delimiter":"::",
                "date_format":"yyyy-MM-dd HH:mm:ss",
                "time_zone":"GMT+8",
                "skipped_line":{
                    "regex":"(^#|^//).*|"
                },
                "compression":"NONE",
                "batch_size":500,
                "header":[
                    "UserID",
                    "Gender",
                    "Age",
                    "Occupation",
                    "Zip-code"
                ],
                "charset":"UTF-8",
                "list_format":null
            },
            "vertices":[
                {
                    "label":"user",
                    "skip":false,
                    "id":null,
                    "unfold":false,
                    "field_mapping":{
                        "UserID":"id"
                    },
                    "value_mapping":{

                    },
                    "selected":[

                    ],
                    "ignored":[
                        "Occupation",
                        "Zip-code",
                        "Gender",
                        "Age"
                    ],
                    "null_values":[
                        ""
                    ],
                    "update_strategies":{

                    },
                    "batch_size":500
                }
            ],
            "edges":[

            ]
        },
        {
            "id":"2",
            "skip":false,
            "input":{
                "type":"FILE",
                "path":"ratings.dat",
                "file_filter":{
                    "extensions":[
                        "*"
                    ]
                },
                "format":"TEXT",
                "delimiter":"::",
                "date_format":"yyyy-MM-dd HH:mm:ss",
                "time_zone":"GMT+8",
                "skipped_line":{
                    "regex":"(^#|^//).*|"
                },
                "compression":"NONE",
                "batch_size":500,
                "header":[
                    "UserID",
                    "MovieID",
                    "Rating",
                    "Timestamp"
                ],
                "charset":"UTF-8",
                "list_format":null
            },
            "vertices":[

            ],
            "edges":[
                {
                    "label":"rating",
                    "skip":false,
                    "source":[
                        "UserID"
                    ],
                    "unfold_source":false,
                    "target":[
                        "MovieID"
                    ],
                    "unfold_target":false,
                    "field_mapping":{
                        "UserID":"id",
                        "MovieID":"id",
                        "Rating":"rate"
                    },
                    "value_mapping":{

                    },
                    "selected":[

                    ],
                    "ignored":[
                        "Timestamp"
                    ],
                    "null_values":[
                        ""
                    ],
                    "update_strategies":{

                    },
                    "batch_size":500
                }
            ]
        }
    ]
}

actual:

{
    "version":"2.0",
    "structs":[
        {
            "id":"1",
            "skip":false,
            "input":{
                "type":"FILE",
                "path":"users.dat",
                "file_filter":{
                    "extensions":[
                        "*"
                    ]
                },
                "format":"TEXT",
                "delimiter":"::",
                "date_format":"yyyy-MM-dd HH:mm:ss",
                "time_zone":"GMT+8",
                "skipped_line":{
                    "regex":"(^#|^//).*|"
                },
                "compression":"NONE",
                "batch_size":500,
                "header":[
                    "UserID",
                    "Gender",
                    "Age",
                    "Occupation",
                    "Zip-code"
                ],
                "charset":"UTF-8",
                "list_format":null
            },
            "vertices":[
                {
                    "label":"user",
                    "skip":false,
                    "id":null,
                    "unfold":false,
                    "field_mapping":{
                        "UserID":"id"
                    },
                    "value_mapping":{

                    },
                    "selected":[

                    ],
                    "ignored":[
                        "Occupation",
                        "Zip-code",
                        "Gender",
                        "Age"
                    ],
                    "null_values":[
                        ""
                    ],
                    "update_strategies":{

                    },
                    "batch_size":500
                }
            ],
            "edges":[

            ]
        },
        {
            "id":"2",
            "skip":false,
            "input":{
                "type":"FILE",
                "path":"ratings.dat",
                "file_filter":{
                    "extensions":[
                        "*"
                    ]
                },
                "format":"TEXT",
                "delimiter":"::",
                "date_format":"yyyy-MM-dd HH:mm:ss",
                "time_zone":"GMT+8",
                "skipped_line":{
                    "regex":"(^#|^//).*|"
                },
                "compression":"NONE",
                "batch_size":500,
                "header":[
                    "UserID",
                    "MovieID",
                    "Rating",
                    "Timestamp"
                ],
                "charset":"UTF-8",
                "list_format":null
            },
            "vertices":[

            ],
            "edges":[
                {
                    "label":"rating",
                    "skip":false,
                    "source":[
                        "UserID"
                    ],
                    "unfold_source":false,
                    "target":[
                        "MovieID"
                    ],
                    "unfold_target":false,
                    "field_mapping":{
                        "UserID":"id",
                        "MovieID":"id",
                        "Rating":"rate"
                    },
                    "value_mapping":{

                    },
                    "selected":[

                    ],
                    "ignored":[
                        "Timestamp"
                    ],
                    "null_values":[
                        ""
                    ],
                    "update_strategies":{

                    },
                    "batch_size":500
                }
            ]
        }
    ],
    "backendStoreInfo":null
}

seems "backendStoreInfo":null is newly, other problems u could fix it~

@haohao0103
Copy link
Contributor Author

The configuration information of the storage layer that bulkLoad depends on is specified in struct.json, so backendstoreinfo is added. The follow-up iteration is to obtain the configuration information of the storage layer from the server;

@imbajin
Copy link
Member

imbajin commented Nov 9, 2022

The configuration information of the storage layer that bulkLoad depends on is specified in struct.json, so backendstoreinfo is added. The follow-up iteration is to obtain the configuration information of the storage layer from the server

it's fine, just adopt it in test 😄 (so as other test problem if exists)

javeme
javeme previously approved these changes Nov 9, 2022
Copy link
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, we could handle the 3rd dependencies together before release (to avoid waste a lot time on it)

@haohao0103
Copy link
Contributor Author

seems "backendStoreInfo":null is newly, other problems u could fix it~

Do I need to solve 3rd dependencies check failed?
I believe many of the problems are caused by the hadoop-common upgrade from 3.2.4 to 3.3.1

thanks, we could handle the 3rd dependencies together before release (to avoid waste a lot time on it)

ok

@haohao0103
Copy link
Contributor Author

haohao0103 commented Nov 10, 2022

thanks, we could handle the 3rd dependencies together before release (to avoid waste a lot time on it)

many of the problems are caused by the hadoop-common upgrade from 3.2.4 to 3.3.1 ?

@simon824 could u exclude it in pom? (like #363)

@imbajin imbajin changed the title feat(hbase): support gen HFile for hbase(BETA) feat(hbase): support gen HFile for hbase v2 (BETA) Nov 10, 2022
@imbajin imbajin merged commit a622f98 into apache:master Nov 10, 2022
@simon824
Copy link
Member

thanks, we could handle the 3rd dependencies together before release (to avoid waste a lot time on it)

many of the problems are caused by the hadoop-common upgrade from 3.2.4 to 3.3.1 ?

@simon824 could u exclude it in pom? (like #363)

We can downgrade the version if necessary, hadoop dependency seems can not be excluded , loader needs it to read hdfs files.

@haohao0103
Copy link
Contributor Author

thanks, we could handle the 3rd dependencies together before release (to avoid waste a lot time on it)

many of the problems are caused by the hadoop-common upgrade from 3.2.4 to 3.3.1 ?
@simon824 could u exclude it in pom? (like #363)

We can downgrade the version if necessary, hadoop dependency seems can not be excluded , loader needs it to read hdfs files.

Yes, loader needs hadoop dependency . Internally, we read data from hdfs and load it into the graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request todo
Projects
No open projects
Status: Done
6 participants