Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process lines fewer than 1000 #2

Closed
wants to merge 1 commit into from
Closed

Conversation

yiyangh-ps
Copy link

No description provided.

@rbehjati
Copy link

@yiyangh-ps I just noticed this. Can you merge it? It will make testing a lot easier.

@yiyangh-ps
Copy link
Author

I am waiting on anyone to review it. You can checkout process-small-chunk branch for now.

char delim = absl::GetFlag(FLAGS_new_line_delim);
for (const auto &filename : rest_args) {
if (filename.empty()) {
LOG(FATAL) << "Pipe input is not supported. Please use --input to specify the names of the input files";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is not pipe input supported? I am relying on this for in-memory processing of large amounts of bytes.

Also, please update the README accordingly. Because I think this basically means that all spm_encode examples in the README will stop to work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring getting data from file will have a huge impact on performance, as it adds the overhead of writing to disk and reading from it.

Copy link
Author

@yiyangh-ps yiyangh-ps Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is not pipe input supported?

I don't know. It bugged me for a while too.

Requiring getting data from file will have a huge impact on performance

Only if the encoding is faster than the disk read+write speed. Otherwise the bottleneck is the CPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It bugged me for a while too.

I tried to pipe in some input. There are not any errors. But the output is empty. Is this also what you had observed?

The logs actually say that 100 sentences were encoded:

spm_encode_main.cc(200) LOG(INFO) Encoded 50 sentences
spm_encode_main.cc(228) LOG(INFO) Encoded 50 sentences

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't remember what I saw.

@vmarkovtsev
Copy link
Member

vmarkovtsev commented Dec 19, 2023

This is superseded by #11

@vmarkovtsev
Copy link
Member

We'll add stdin support, if it does not already exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants