Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Connector-V2] Support use EasyExcel as read excel engine #8064

Merged
merged 86 commits into from
Dec 30, 2024

Conversation

dwave
Copy link
Contributor

@dwave dwave commented Nov 15, 2024

#8040

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@Hisoka-X Hisoka-X changed the title [Hotfix][Connector-V2] ExcelReader read more than 65000 rows XSSFWorkbook will cause oom . so change POI to EasyExcel #8040 [Improve][Connector-V2] Change read excel util from POI to EasyExcel Nov 15, 2024
@@ -54,7 +55,7 @@ public class ExcelReadStrategyTest {

@Test
public void testExcelRead() throws IOException, URISyntaxException {
testExcelRead("/excel/test_read_excel.xlsx");
// testExcelRead("/excel/test_read_excel.xlsx");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disable this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test excel used in the commented out code, and the date string that needs to be converted is 2024/1/31, and the format is
{mso-generic-font-family:auto;
mso-font-charset:134;
mso-number-format:"yyyy/m/d"; }

In POI, we can get the correct data type according to the format of the cell, but in EasyExcel, we can only get the string, and the conversion of the string to the Date type does not conform to the defined YYYYY/MM/dd format, which causes the test case to fail, so I commented out this one test case

image

image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should find some way to make sure the old behavior not changed. Or add an option to let user to choose use POI or EasyExcel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll find a way to deal with it

@github-actions github-actions bot added the api label Nov 19, 2024
@corgy-w
Copy link
Contributor

corgy-w commented Nov 19, 2024

@dwave dwave force-pushed the bugfix-large-excel branch from 380e82b to eca2c5d Compare November 20, 2024 03:14
@dwave
Copy link
Contributor Author

dwave commented Nov 20, 2024

https://github.com/apache/seatunnel/runs/33188901598 @dwave Please open ci workflow

Okay, it's already opened

Comment on lines +162 to +167

<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>${easyexcel.version}</version>
</dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we all know, easyexcel is no longer maintained. It doesn't seem good to introduce it at this time. We can try other alternatives, such as fastexcel. There are also reports online that it is faster than easyexcel. What do you think? cc @hailin0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or easyexcel-plus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or easyexcel-plus?

easyexcel-plus was only on GitHub last night, and I haven't seen it in the maven repository yet

Copy link
Contributor Author

@dwave dwave Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we all know, easyexcel is no longer maintained. It doesn't seem good to introduce it at this time. We can try other alternatives, such as fastexcel. There are also reports online that it is faster than easyexcel. What do you think? cc @hailin0

I tried using fastexcel, but there is a problem with its xls support for excel97-2003

dhatim/fastexcel#287
image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a configuration item for excel reading to select the engine for excel reading. POI is used by default. EaseExcel can be used through configuration. The configuration item name is excel_engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
In the CI phase, there are errors in the test cases of other modules, causing CI to fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image In the CI phase, there are errors in the test cases of other modules, causing CI to fail

retry, many reasons can be resolved by retrying[/dog]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally succeeded

dwave and others added 2 commits November 21, 2024 11:26
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/excel/ExcelReaderListener.java

Co-authored-by: corgy-w <73771213+corgy-w@users.noreply.github.com>
# Conflicts:
#	seatunnel-common/src/main/java/org/apache/seatunnel/common/utils/DateUtils.java
#	seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ExcelReadStrategy.java
#	seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/Reader/ExcelReadStrategyTest.java
@github-actions github-actions bot added file and removed dependencies Pull requests that update a dependency file CI&CD core SeaTunnel core module Zeta transform-v2 Zeta Rest API e2e format labels Dec 24, 2024
return null;
}

@SneakyThrows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This annotation cannot ignore all errors

@hailin0 hailin0 merged commit b8e1177 into apache:dev Dec 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants