Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect filetype from shebang line #1001

Merged
merged 1 commit into from
Nov 8, 2021

Conversation

ath3
Copy link
Contributor

@ath3 ath3 commented Nov 7, 2021

Most of the shell and perl scripts i have to work on have no extensions.
With this change im adding shebang support for detecting file types.
There is also a new config entry in languages.toml: "shebangs"

if std::io::Read::read(&mut file, &mut buf[..]).is_ok() {
if let Ok(str) = str::from_utf8(&buf) {
static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"#!/[^\s]*/(env\s)*([_a-zA-Z0-9-]+)").unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at cleaning up the shebang regex. Here's an alternative to yours that captures only the interpreter part, and allows more exotic interpreter names (like "php7.4").

^#!\s*/\S*/(?:env\s+)?(\S+)

I wrote a bunch of tests over here so you can see how it breaks down and modify if needed:
https://regex101.com/r/fsxGMO/2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the regex cleanup!
I made some minor changes to also support lines like "#! python"
One thing im not sure of, if we really want to match for minor versions of interpreters, is there a real world usecase for this?
This makes it harder to cut off the .exe/.cmd/.. part of interpreters, while having the need o support minor versions in language toml shebangs entry.

what i have now (excluding minor versions):

^#!\s*(?:\S*/(?:env\s+)?)?([^\s\.]+)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ath3 good point about supporting things like php7.4. My thought was that sometimes you might have more than one version of php installed on a server (like with ondrej's ubuntu repos) that name them this way, and you may possibly want to specifically invoke one. But for our case, we just want to discover the name of the language, not the version, so you're modification makes sense. Although, it does still erroneously catch php7 out of my example. Do we want to exclude digits from the capture group? That would also clean up the python3 case.

Also, for windows, would we need to worry about supporting \ as a directory separator? If so, just modify the / in your version to [/\\].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for windows, would we need to worry about supporting \ as a directory separator?

I don't think windows supports shebangs, not natively anyway (cygwin can use it I think, with / like in unix).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched and saw that some people use Windows style paths in shebang lines which is probably not the most correct way, but we can use a regex handling those cases too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the new regex, hopefully this works for all the variants.

@ath3 ath3 force-pushed the shebang-filetype branch 3 times, most recently from e057337 to cbd3016 Compare November 8, 2021 01:56
Comment on lines 305 to 322
// If we have not found the configuration_id, see if we can get it from a shebang line
if configuration_id.is_none() {
if let Ok(mut file) = File::open(path) {
let mut buf = [0; 100];
if std::io::Read::read(&mut file, &mut buf[..]).is_ok() {
if let Ok(str) = str::from_utf8(&buf) {
static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"^#!\s*(?:\S*[/\\](?:env\s+)?)?([^\s\.\d]+)").unwrap()
});
configuration_id = SHEBANG_REGEX
.captures(str)
.and_then(|cap| cap.get(1))
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))
}
}
}
};

// TODO: content_regex handling conflict resolution
configuration_id.and_then(|&id| self.language_configs.get(id).cloned())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see this as a separate method, then called via language_config_for_file_name(..).or_else(|| shebang(..))

helix-core/src/syntax.rs Outdated Show resolved Hide resolved
Comment on lines 316 to 317
.and_then(|cap| cap.get(1))
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.and_then(|cap| cap.get(1))
.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))
.and_then(|cap| self.language_config_ids_by_shebang.get(&cap[1]))

// Read the first 128 bytes of the file. If its a shebang line, try to find the language
let file = File::open(path).ok()?;
let mut buf = String::with_capacity(128);
Read::read_to_string(&mut Read::take(file, 128), &mut buf).ok()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just file.take(128).read_to_string(&mut buf)? It's a trait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid unwraps that can cause panic, thats why i used if let on those places.
Is it ok to have those here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah no unwraps, just Read::read(buf).ok() is equivalent to buf.read().ok -- Read is a trait

@ath3 ath3 force-pushed the shebang-filetype branch 2 times, most recently from b658f90 to fb64735 Compare November 8, 2021 03:01
// Read the first 128 bytes of the file. If its a shebang line, try to find the language
let file = File::open(path).ok()?;
let mut buf = String::with_capacity(128);
file.take(128).read_to_string(&mut buf).ok();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file.take(128).read_to_string(&mut buf).ok();
file.take(128).read_to_string(&mut buf).ok()?;

@archseer
Copy link
Member

archseer commented Nov 8, 2021

So it turns out the reading from disk is unnecessary since the file is already read by the time detection runs. I'll fix it after merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants