Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(shellwords)!: change arg handling strategy #11149

Merged
merged 6 commits into from
Jan 5, 2025

Conversation

RoloEdits
Copy link
Contributor

@RoloEdits RoloEdits commented Jul 12, 2024

While attempting to implement issue #11012, it became evident that Shellwords required more substantial changes to correctly interpret how arguments are handled by commands. This was further supported by feedback from @the-mikedavis, highlighting the need for improved parsing strategies, similar to those used by Kakoune.

Summary of Changes

  1. Shellwords Parsing Refactor:

    • Non-Allocating Iterator: Replaced the existing parsing approach, which used two Vec<Cow<str>>, with a non-allocating iterator Args<'_>. This change aims to streamline argument interaction, making it more linear, performant, and idiomatic.
    • Rewritten Parsing Operations: The parsing logic has been restructured to better support specialization cases. Examples include preserving trailing spaces in arguments and handling unclosed quotes by treating the remaining input as a single argument.
  2. Formalize :sh and :pipe escaping

    • Use raw input: By using the raw, literal input, it's no longer a guess as to how escaping is handled, and should always be predictable.
  3. Introduction of Unescape Function:

    • Functionality: Added an unescape function to process escape sequences in strings and convert them into their corresponding literal characters. Supported sequences include \n (newlines), \t (tabs), and Unicode characters (\u{...}).
    • Conservative Handling: Designed to process user input conservatively, returning the original string if it cannot be processed as expected. This approach minimizes unexpected behavior.
    • Integration with Commands: Enhanced the yank-join command to unescape separator sequences before joining selections, aligning with user expectations. Similarly, implemented for the :sh and :pipe-like commands.
  4. Parse JSON Lists

    • With the addition of Args::rest, an argument position aware function that returns the remaining raw text, its now trivial to parse JSON lists from the appropriate syntax, opening up the ability to :toggle and :set config options that are lists. e.g. :toggle shell ["nu", "-c"] ["bash", "-c"].

Visual Examples

  • Newline: separator_unescape_newline
  • Tab: separator_unescape_tab
  • Unicode: separator_unescape_unicode
  • sh Command: sh_unescape

Breaking Changes!

  • Due to the changes to :sh, :yank-join-like and :pipe-like commands using the raw args, previous config commands that relied on specific escaping might no longer function as expected.
  • Due to new unescaping, previous config commands that used \n or \t might now have these replaced with their literal character (This is command specific).

Further work

For now, basic conversions are done at the dap boundary, where the Args iterator is turned into something like a Vec<Cow> or Vec<String> to maintain the previous api. Mainly, Args still needs to be integrated into dap_start_impl and debug_parameter_prompt. This is being left till later when dap is developed more. For now, TODOs are added among the other future work to transition to shellwords::Args when refactoring.

Related Issues

@RoloEdits
Copy link
Contributor Author

@the-mikedavis

In working on this I came across some things I wasn't so sure about, and was hoping you might be able to help me understand them.

For instance, with the checks done to see of the input ended in a space, as well as tests that checked if it preserved it. I retained this behavior in the parsing, but I'm just not sure what use is for it? for instance in this check:

let shellwords = Shellwords::from(input);
let words = shellwords.words();
if words.is_empty() || (words.len() == 1 && !shellwords.ends_with_whitespace()) {
fuzzy_match(
input,
TYPABLE_COMMAND_LIST.iter().map(|command| command.name),
false,
)
.into_iter()
.map(|(name, _)| (0.., name.into()))
.collect()
} else {

It seems that if parsed in a way that removes any whitespace at the end would have been covered by just words.is_empty || words.len() == 1 already? If it was just the command being input and then a space, stripping the space would only require the == 1 check.

The next this is if the : is sent directly to the Shellwords::from and if there is a need to either keep this colon, if its there, or if it should be stripped, if its not already stripped from the input.

And lastly, for now, list support: ["this", "is", "a", "list"]. I have never used them so not sure if this is a legacy holdover or something that still needs support?

Thanks!

@RoloEdits RoloEdits changed the title refactor(shellwords): change arg parsing strategy refactor(shellwords): change arg handling strategy Jul 13, 2024
@RoloEdits RoloEdits force-pushed the shellwords branch 2 times, most recently from affbeb9 to 94740f5 Compare July 13, 2024 14:50
@RoloEdits
Copy link
Contributor Author

Starting to propagate the changes and came across:

ensure!(
args.len() > 2,
"Bad arguments. For string configurations use: `:toggle key val1 val2 ...`",
);

I've never really used the non bool toggle option when changing a setting, so when seeing that two values are required, I am not sure what the second value is supposed to be?

For example, using :get end-of-line-diagnostics I get "hint", to change it I'd think to enter :toggle end-of-line-diagnostics disable if I wanted to change the value, but I get the error above instead. What am I missing?

@kirawi
Copy link
Member

kirawi commented Jul 13, 2024

#10828 (comment)

@RoloEdits RoloEdits force-pushed the shellwords branch 5 times, most recently from 4b136bc to 1ff0084 Compare July 13, 2024 22:42
@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jul 13, 2024

When refactoring the changes to the main entry command function, the completion behavior slightly changed: when typing a partial command, to get the fuzzy matches, and selecting one with tab, and pressing space, the options remain.
shellwords_yj
Pressing tab after space also brings the cursor back to the end.

This is further an issue for a command like :theme:
shellwords_theme

master:

|editor: &Editor, input: &str| {
let shellwords = Shellwords::from(input);
let words = shellwords.words();
if words.is_empty() || (words.len() == 1 && !shellwords.ends_with_whitespace()) {
fuzzy_match(
input,
TYPABLE_COMMAND_LIST.iter().map(|command| command.name),
false,
)
.into_iter()
.map(|(name, _)| (0.., name.into()))
.collect()
} else {
// Otherwise, use the command's completer and the last shellword
// as completion input.
let (word, word_len) = if words.len() == 1 || shellwords.ends_with_whitespace() {
(&Cow::Borrowed(""), 0)
} else {
(words.last().unwrap(), words.last().unwrap().len())
};
let argument_number = argument_number_of(&shellwords);
if let Some(completer) = TYPABLE_COMMAND_MAP
.get(&words[0] as &str)
.map(|tc| tc.completer_for_argument_number(argument_number))
{
completer(editor, word)
.into_iter()
.map(|(range, file)| {
let file = shellwords::escape(file);
// offset ranges to input
let offset = input.len() - word_len;
let range = (range.start + offset)..;
(range, file)
})
.collect()
} else {
Vec::new()
}
}
}, // completion

pr:

|editor: &Editor, input: &str| {
let shellwords = Shellwords::from(input);
let command = shellwords.command();
if command.is_empty() || shellwords.args().next().is_none() {
fuzzy_match(
input.trim_end(),
TYPABLE_COMMAND_LIST.iter().map(|command| command.name),
false,
)
.into_iter()
.map(|(name, _)| (0.., name.into()))
.collect()
} else {
// Otherwise, use the command's completer and the last shellword
// as completion input.
let (word, word_len) = shellwords
.args()
.last()
.map_or(("", 0), |last| (last, last.len()));
TYPABLE_COMMAND_MAP
.get(command)
.map(|tc| tc.completer_for_argument_number(shellwords.args().count()))
.map_or_else(Vec::new, |completer| {
completer(editor, word)
.into_iter()
.map(|(range, file)| {
let file = shellwords::escape(file);
// offset ranges to input
let offset = input.len() - word_len;
let range = (range.start + offset)..;
(range, file)
})
.collect()
})
}
}, // completion

I also have some questions around what the count is expecting here as this seems pretty unintuitive and this might be part of the problem.

#[test]
fn test_argument_number_of() {
let cases = vec![
("set-option", 0),
("set-option ", 0),
("set-option a", 0),
("set-option asdf", 0),
("set-option asdf ", 1),
("set-option asdf xyz", 1),
("set-option asdf xyz abc", 2),
("set-option asdf xyz abc ", 3),
];
for case in cases {
assert_eq!(case.1, argument_number_of(&Shellwords::from(case.0)));
}
}

I suspect this is connected to the earlier question I had about the significance of end whitespace, as this seems like something someone would have done to solve a problem you'd only understand when initially implementing the function.

@RoloEdits RoloEdits force-pushed the shellwords branch 2 times, most recently from baf5128 to 88419ad Compare July 14, 2024 08:27
@RoloEdits RoloEdits marked this pull request as ready for review July 14, 2024 08:43
@RoloEdits RoloEdits force-pushed the shellwords branch 3 times, most recently from 892116b to 7929ea7 Compare July 15, 2024 11:28
@RoloEdits RoloEdits force-pushed the shellwords branch 7 times, most recently from 091740d to b2998f2 Compare July 16, 2024 08:42
@RoloEdits
Copy link
Contributor Author

I believe this should be ready for reviewing.

@kirawi kirawi added the A-helix-term Area: Helix term improvements label Jul 17, 2024
@kirawi
Copy link
Member

kirawi commented Jul 17, 2024

It sounds like this could help address #10549 in a future PR.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jul 17, 2024

I added a raw function to Args that returns exactly what was input, minus the command. For now I have kept the logical semantics the same for the commands, but for 1:1 interactions for :sh and others, I believe it should be as simple as let args = args.raw().to_string()(there is a move).

Also I believe that even now the pipe: sed -n "s/test t/not /p" should work, even on master? Same for the pipe: echo "", it should return nothing, not the \"\".

We can still change over to the raw form to formalize how to interact with with for sh and others, piggybacking off rusts handling raw strings. This way the rules would always be clear and something we could reference.

@the-mikedavis
Copy link
Member

Yeah we should restore the behavior of \ on spaces, it was the subject of a few PRs in the past and definitely has some amount of usage. It's not as simple a change as the match condition though, the iterator needs to change its Item type to Cow<'a, str> because the backslash should be dropped from the arg.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

Ah, you are right, I remember now that I had brought this up too, about how to deal with escaping slashes and this. Would all slashes need to be escaped? Or just when its for spaces? And one more thing I remember was that the Shellwords::ends_with_whitespace would need to be checked differently now as well with a change like this.

Desdaemon pushed a commit to Desdaemon/helix that referenced this pull request Jan 6, 2025
@the-mikedavis
Copy link
Member

the-mikedavis commented Jan 6, 2025

Right yeah, if I remember correctly that function was for the sake of completion and we wanted "a\ " to complete for "a b.txt", so we would consider "a\ " to not end with whitespace. (Maybe that function could use a better name). Edit: here are some tests that regressed

I'm not sure if all backslashes need escaping. With the prior shellwords we may have only used backslash to escape on non-windows? (Thinking that on Windows you would want to use it as a path separator)

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

This is how the completion works now with out the escaping
image
It inserts the quotes for you. Which is actually something that was broken before as this wasnt actually a valid input.

@RoloEdits
Copy link
Contributor Author

Would it be fine to let this sit for a bit longer so more people can see if they need it? I dont think this is needed anymore now that with spaces, 'with spaces' and "with spaces" works properly now. And completions auto add the quotes for you.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

Although now that I think about it windows completion inserts backslashes right? If you have a path\"space text file.txt" I think this would escape the first quote leaving you with path\"space, text, file.txt and ".

Would it work if windows also inserted / like unix-like OS's? Or should it bring the quotes around the entire completion text? Or should it insert two \\ to always escape the separators?

edit:

helix_completion_slash_windows

Just checked and this is indeed what happens. Though id classify this as a separate issue, as the completer is giving invalid windows paths:
image

@the-mikedavis
Copy link
Member

I believe we should have different backslashing rules between unix and windows. The completion results for unix are different in this example: "a b.txt" is completed as a\ b.txt because of the ascii whitespace handling in shellwords::escape that is cfg!(unix).

I would consider the backslashing behavior pretty important as it follows unix shell conventions so if this isn't a small fix then I'd prefer to revert this PR temporarily.

@RoloEdits
Copy link
Contributor Author

I see. I dont think it should be too bad, so we can leave it for now, and then see what the PR looks like to get this implemented.

I think this only fixes for unix completions though, with the windows one able to produce invalid file paths in the path completion. One solution besides escaping it, would be to handle unix and windows the same with a "all paths in quotes" approach. Would no longer need to escape anything for any platform. I forget if quotes are valid names in linux or not, but that would then be the sole edge case.

@RoloEdits
Copy link
Contributor Author

Also, is this only for completions or are backslashes expected to be used by a user?

@RoloEdits
Copy link
Contributor Author

Yeah I think this will be ok, propagated the Cow change and got these to pass:

        assert_eq!(
            Shellwords::from(r":o a\ ").args().next(),
            Some(Cow::Borrowed("a "))
        );
        assert_eq!(
            Shellwords::from(r":o a\ b.txt").args().next(),
            Some(Cow::Borrowed("a b.txt"))
        );

Though tabs are being weird, so have to figure that out. I think I might just not be representing the expected text correctly but ill leave that to last. Still have to correct the other tests but so far I think its straight forward. Will see if that holds true as I get to them.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

@the-mikedavis How should backslashes in quotes be handled exactly? Want to make sure we are all on the same page.

Should "\\three" be \three, or should it only be escaped, or I guess unescaped?(view point can make this weird), when there is a quote and leave the rest as is: "\\three with \"escaped quote" -> \\three with escaped quote?

I think this is the last edge case to deal with. If you can perhaps paste some more examples that you feel could be wonky and I can make tests around them

@the-mikedavis
Copy link
Member

Ok actually I'm going to revert for the time being because I think we may also want to re-examine how we store the arguments and that could change what the API looks like. Sorry for the flip/flop on this - Pascal and I both thought this looked pretty promising but I don't want to commit to a really big change like this without being totally satisfied with the API. What I liked reading through this previously was the lack of allocation for the args type but I don't think it's necessary to avoid allocating since parsing the command line is not a hot loop, and I think it will make it harder to add switches in an ergonomic way later. Plus we'll need to allocate sometimes for backslash escapes anyways.

I'd like to explore parsing similar to how Kakoune does it since Kakoune already has variable expansions and switches. The parsing in Kakoune is done in two steps in command_manager.cc (initial parsing of the command line, handling of variables and other expansions, completions) first to set up basically a Vec<String> and then another step in parameters_parser.cc to handle that list of parameters according to some settings on the command's description (see ParameterDesc). We can't reuse Kakoune's logic totally as-is since Kakoune doesn't support Windows but the overall strategy I feel is pretty elegant.

@RoloEdits
Copy link
Contributor Author

before a final choice is made could you look at the progress I have made in this fix?

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

Im only one test away from getting this done. Its just deciding on the exact behavior.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

Ill take a look at ParameterDesc but the rest seems to fit this api already, unless im missing something. Its a Vec but lazy. And cloning only copies the reference, which is fast. The other issues, to me, dont become easier just because its a Vec<String>. None of the big issues I am facing for flags get solved with a Vec<String>.

@RoloEdits
Copy link
Contributor Author

The parsing rules would also be the same right? So it would be the same as using the existing iterator approach but just collect a String in to a Vec instead. I dont see how that fixes anything or what opportunities it presents going forward.

@the-mikedavis
Copy link
Member

An immediate benefit of collecting the parameters upfront is checking the minimum and maxmimum number of non-flag params a command can accept

the-mikedavis added a commit that referenced this pull request Jan 6, 2025
@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

There is arg_count on Args. Though with the potential allocations this is no longer guaranteed to be free, but as you said its not really in a hot loop. This still needs to be done in the commands themselves though, which is already at the end of the chain, no longer just parsing issues. And this check is still taking place, not gotten rid of.

@RoloEdits RoloEdits restored the shellwords branch January 6, 2025 18:01
@RoloEdits
Copy link
Contributor Author

So what kind of changes are wanted? I will get the backslash escaping done, but what you say as benefits of an upfront collect dont make a lot of sense to me, so I assume there is other issues besides not having a Vec<String>?

@RoloEdits
Copy link
Contributor Author

Looking at the kakuone code it looks like it only validates count, as far as validating arguments, something we also do, just in the command body, but even this is validation is only partially done as there could be flags that are mutually exclusive, and this would have to be done in the bulk of the logic on execution as there is no simple way to declare this statically.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

There is also some commands that have different requirements, like toggle or set, which for bools only are very specific for their requirements but strings or number can have many. No way to know when deciding if its enough or if its valid for the option.

@the-mikedavis
Copy link
Member

Not all validation can be done with static information attached to the command description but we should be setting a min and optional max number of parameters a command will accept so that command_mode can perform an early validation. Checking the number of parameters is currently done very inconsistently because it's done within the command body - it's easy to forget to do. For more complicated commands like :toggle-option for example we can set a minimum of one argument and no maximum and perform the extra check within the function body. Most commands though can be validated based on static min/max parameters in their definition.

I'd like to see the parameters parsed eagerly and stored as a Vec of either Strings or Cow<'a, str> from the input string. This makes that kind of blanket validation possible and should help with flags later as the API to get a flag could be something like args.get_flag("no-format") (reading out of a prepared hashmap).

One difference between Kakoune's and our commands is that Kakoune doesn't provide commands that need to avoid their line being split into parameters like :sh or :pipe-to. That can be done by attaching some more metadata to the command's description that tells the parser to only perform basic replacements like \t and \n and avoid any other splitting/quoting/parsing. :sh would use that for example to fix #10549: the args vec would have one entry, for example :sh echo "hello" would be parsed as ["echo \"hello\""] in this mode. #6171 is another edge-case but as Pascal said that one is possible to do with the right quoting and should really be solved with the config refactor instead. It could also take advantage of that same mode and do a bit of manual parsing to get the config option and then treat the rest as raw though.

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 6, 2025

we should be setting a min and optional max number of parameters a command will accept

Perhaps something like a similar in nature to size_hint?: params_hint: (usize, Option<usize>)?

One difference between Kakoune's and our commands is that Kakoune doesn't provide commands that need to avoid their line being split into parameters ... do a bit of manual parsing to get the config option and then treat the rest as raw though.

This was handled exactly by rest or raw. It just sends the raw, literal string to the shell. (or at least thats one use case for them). Thats why that issue was marked as solved by this.

Pre collecting into a Vec also means that the rest and raw ways of dong things wouldnt be possible in the same way (at least nothing immediate pops into the mind that is as straight forward as the current way). These composed together well with the iterator approach but for instance one thing this PR allowed was the parsing of lists in set or toggle: :toggle shell ["nu", "-c"] ["powershell", "-c"]. This is also something which caused parsing issues for the variable expansion and needing special handling adding complicity to the rest of the parsing.

This makes that kind of blanket validation possible and should help with flags later as the API to get a flag could be something like args.get_flag("no-format") (reading out of a prepared hashmap).

So like a HashMap<String, Param> where Param can be an enum of Param::String(String), Param::Bool, Param::Number(i64), or something to this effect? Though this was would mean a lot more boilerplate than the current implimentation. So perhaps just a Param::Bool and a Param::Value(String). Though this isnt exactly getting us closer to a serde_json::from_str like solution with the flags. There still would need to be boiler plate to get the flags and arguments out as needed.

With my experience of exploring a flags solution and implementing a working one, I'll say once again the issues there wouldn't be magically solved by having a Vec<String>. If the worry was that it accepted only declared flags and skipped any not declared that could be implemented with either "storage" backing, it just doesn't currently.

All I can see this doing is that it moves the validation of basic parameter counts up one level. And this could be done with the current implimentation by adding a validate_param_count(number) to TypableCommand and then internally it would just do basic Boolean logic to check it fits in the range set. And then there are still situations where you need to do an extra check anyways.

Should we try to explore more for flags and see what actually needs to be done, rather than guessing limitations or freedoms one solution offers over another? Because even after reading this, the problems were either solved already by this PR, or the proposed alternative seems to offer no benefits over it, but come with more limitations that need its own work arounds.

I understand the gut feeling to revert something you feel sketchy about, I am committed to this work either way, I just feel this was written off too early for vague issues when this solves existing issues and doesn't clearly create more. (besides the escaping rules which is something I did bring up multiple times and was never given a clear answer for until this was merged)

I will get another PR up with the fixes from the other PRs added in and we can discuss more there with fresh eyes on the problem.

diucicd pushed a commit to diucicd/helix that referenced this pull request Jan 8, 2025
diucicd pushed a commit to diucicd/helix that referenced this pull request Jan 8, 2025
rmburg pushed a commit to rmburg/helix that referenced this pull request Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-helix-term Area: Helix term improvements
Projects
None yet
9 participants