Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HDFS list status iterator #18295

Merged

Conversation

JiamingMai
Copy link
Contributor

When there are a lot of files in HDFS, it takes a large amount of time and memory to complete a listStatus request. Moreover, sometimes OOM occurs. This PR provides an iterator for the HDFS under file system to list files.

@JiamingMai JiamingMai added the type-feature This issue is a feature request label Oct 18, 2023
@JiamingMai JiamingMai self-assigned this Oct 18, 2023
@JiamingMai JiamingMai force-pushed the add-list-status-iterator-for-hdfs branch from 755e843 to 8ee2191 Compare October 18, 2023 17:11
@jja725 jja725 requested a review from elega October 18, 2023 23:26
/**
* HDFS under file system status iterator.
*/
public class HdfsUfsStatusIterator implements Iterator<UfsStatus> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the HDFS version compatability of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try build your code with HDFS 2 to see if it builds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can build the code successfully with HDFS 2.7.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested connecting HDFS node (2.7.2) with this PR's code. It works successfully.

} else {
ufsStatus = new UfsFileStatus(path.getName(), alluxioUri.hash(), fileStatus.getLen(),
fileStatus.getModificationTime(), fileStatus.getOwner(), fileStatus.getGroup(),
fileStatus.getPermission().toShort(), mUserBlockSizeBytesDefault);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be block size be the HDFS block size instead alluxio one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I used HDFS block size instead.

fileStatus.getModificationTime());
mDirPathsToProcess.addLast(new Pair<>(path.toString(), ufsStatus));
} else {
ufsStatus = new UfsFileStatus(path.getName(), alluxioUri.hash(), fileStatus.getLen(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use approximateContentHash to keep consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I used approximateContentHash instead.

} else {
ufsStatus = new UfsFileStatus(path.getName(), alluxioUri.hash(), fileStatus.getLen(),
fileStatus.getModificationTime(), fileStatus.getOwner(), fileStatus.getGroup(),
fileStatus.getPermission().toShort(), mUserBlockSizeBytesDefault);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JiamingMai JiamingMai force-pushed the add-list-status-iterator-for-hdfs branch from 0849da3 to 36445ac Compare October 19, 2023 07:58
@JiamingMai JiamingMai force-pushed the add-list-status-iterator-for-hdfs branch from e76d7f0 to d562dd4 Compare October 19, 2023 08:08
@jja725 jja725 self-requested a review October 19, 2023 21:23
@JiamingMai
Copy link
Contributor Author

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 62cc17a into Alluxio:main Oct 20, 2023
14 checks passed
elega pushed a commit that referenced this pull request Oct 24, 2023
When there are a lot of files in HDFS, it takes a large amount of time and memory to complete a `listStatus` request. Moreover, sometimes OOM occurs. This PR provides an iterator for the HDFS under file system to list files.
			pr-link: #18295
			change-id: cid-11019e8f163210c7664f3f2b6ddf3bae27e8ee8c
ssz1997 pushed a commit to ssz1997/alluxio that referenced this pull request Dec 15, 2023
When there are a lot of files in HDFS, it takes a large amount of time and memory to complete a `listStatus` request. Moreover, sometimes OOM occurs. This PR provides an iterator for the HDFS under file system to list files.
			pr-link: Alluxio#18295
			change-id: cid-11019e8f163210c7664f3f2b6ddf3bae27e8ee8c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature This issue is a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants