Detect filetype from shebang line #1001

ath3 · 2021-11-07T03:26:47Z

Most of the shell and perl scripts i have to work on have no extensions.
With this change im adding shebang support for detecting file types.
There is also a new config entry in languages.toml: "shebangs"

EpocSquadron · 2021-11-07T13:06:35Z

helix-core/src/syntax.rs

+                if std::io::Read::read(&mut file, &mut buf[..]).is_ok() {
+                    if let Ok(str) = str::from_utf8(&buf) {
+                        static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| {
+                            Regex::new(r"#!/[^\s]*/(env\s)*([_a-zA-Z0-9-]+)").unwrap()


I took a pass at cleaning up the shebang regex. Here's an alternative to yours that captures only the interpreter part, and allows more exotic interpreter names (like "php7.4").

^#!\s*/\S*/(?:env\s+)?(\S+)

I wrote a bunch of tests over here so you can see how it breaks down and modify if needed:
https://regex101.com/r/fsxGMO/2

Thanks for the regex cleanup!
I made some minor changes to also support lines like "#! python"
One thing im not sure of, if we really want to match for minor versions of interpreters, is there a real world usecase for this?
This makes it harder to cut off the .exe/.cmd/.. part of interpreters, while having the need o support minor versions in language toml shebangs entry.

what i have now (excluding minor versions):

^#!\s*(?:\S*/(?:env\s+)?)?([^\s\.]+)

@ath3 good point about supporting things like php7.4. My thought was that sometimes you might have more than one version of php installed on a server (like with ondrej's ubuntu repos) that name them this way, and you may possibly want to specifically invoke one. But for our case, we just want to discover the name of the language, not the version, so you're modification makes sense. Although, it does still erroneously catch php7 out of my example. Do we want to exclude digits from the capture group? That would also clean up the python3 case.

Also, for windows, would we need to worry about supporting \ as a directory separator? If so, just modify the / in your version to [/\\].

Also, for windows, would we need to worry about supporting \ as a directory separator?

I don't think windows supports shebangs, not natively anyway (cygwin can use it I think, with / like in unix).

I searched and saw that some people use Windows style paths in shebang lines which is probably not the most correct way, but we can use a regex handling those cases too.

I pushed the new regex, hopefully this works for all the variants.

archseer · 2021-11-08T01:12:09Z

helix-core/src/syntax.rs

+        // If we have not found the configuration_id, see if we can get it from a shebang line
+        if configuration_id.is_none() {
+            if let Ok(mut file) = File::open(path) {
+                let mut buf = [0; 100];
+                if std::io::Read::read(&mut file, &mut buf[..]).is_ok() {
+                    if let Ok(str) = str::from_utf8(&buf) {
+                        static SHEBANG_REGEX: Lazy<Regex> = Lazy::new(|| {
+                            Regex::new(r"^#!\s*(?:\S*[/\\](?:env\s+)?)?([^\s\.\d]+)").unwrap()
+                        });
+                        configuration_id = SHEBANG_REGEX
+                            .captures(str)
+                            .and_then(|cap| cap.get(1))
+                            .and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))
+                    }
+                }
+            }
+        };

-        // TODO: content_regex handling conflict resolution
+        configuration_id.and_then(|&id| self.language_configs.get(id).cloned())


I'd rather see this as a separate method, then called via language_config_for_file_name(..).or_else(|| shebang(..))

helix-core/src/syntax.rs

archseer · 2021-11-08T01:16:23Z

helix-core/src/syntax.rs

+                            .and_then(|cap| cap.get(1))
+                            .and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))


Suggested change

.and_then(|cap| cap.get(1))

.and_then(|cap| self.language_config_ids_by_shebang.get(cap.as_str()))

.and_then(|cap| self.language_config_ids_by_shebang.get(&cap[1]))

archseer · 2021-11-08T02:51:09Z

helix-core/src/syntax.rs

+        // Read the first 128 bytes of the file. If its a shebang line, try to find the language
+        let file = File::open(path).ok()?;
+        let mut buf = String::with_capacity(128);
+        Read::read_to_string(&mut Read::take(file, 128), &mut buf).ok()?;


why not just file.take(128).read_to_string(&mut buf)? It's a trait

I was trying to avoid unwraps that can cause panic, thats why i used if let on those places.
Is it ok to have those here?

Yeah no unwraps, just Read::read(buf).ok() is equivalent to buf.read().ok -- Read is a trait

helix-core/src/syntax.rs

archseer · 2021-11-08T03:12:23Z

helix-core/src/syntax.rs

+        // Read the first 128 bytes of the file. If its a shebang line, try to find the language
+        let file = File::open(path).ok()?;
+        let mut buf = String::with_capacity(128);
+        file.take(128).read_to_string(&mut buf).ok();


Suggested change

file.take(128).read_to_string(&mut buf).ok();

file.take(128).read_to_string(&mut buf).ok()?;

archseer · 2021-11-08T15:19:20Z

So it turns out the reading from disk is unnecessary since the file is already read by the time detection runs. I'll fix it after merge

ath3 force-pushed the shebang-filetype branch from dc44953 to a7c52af Compare November 7, 2021 03:39

EpocSquadron suggested changes Nov 7, 2021

View reviewed changes

ath3 force-pushed the shebang-filetype branch 3 times, most recently from e057337 to cbd3016 Compare November 8, 2021 01:56

archseer requested changes Nov 8, 2021

View reviewed changes

ath3 force-pushed the shebang-filetype branch from cbd3016 to 357dbab Compare November 8, 2021 02:46

archseer reviewed Nov 8, 2021

View reviewed changes

ath3 force-pushed the shebang-filetype branch 2 times, most recently from b658f90 to fb64735 Compare November 8, 2021 03:01

archseer reviewed Nov 8, 2021

View reviewed changes

helix-core/src/syntax.rs Show resolved Hide resolved

archseer reviewed Nov 8, 2021

View reviewed changes

Detect filetype from shebang line

a1c2fa0

ath3 force-pushed the shebang-filetype branch from fb64735 to a1c2fa0 Compare November 8, 2021 03:18

archseer merged commit 77dbbc7 into helix-editor:master Nov 8, 2021

sudormrfbin mentioned this pull request Nov 25, 2021

Help with syntax highlighting #1098

Closed

the-mikedavis mentioned this pull request Jan 16, 2024

Allowing osascript -l JavaScript as a JavaScript shebang #9337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect filetype from shebang line #1001

Detect filetype from shebang line #1001

ath3 commented Nov 7, 2021

EpocSquadron Nov 7, 2021

ath3 Nov 7, 2021

EpocSquadron Nov 7, 2021

sudormrfbin Nov 7, 2021

ath3 Nov 7, 2021

ath3 Nov 7, 2021

archseer Nov 8, 2021

archseer Nov 8, 2021

archseer Nov 8, 2021

ath3 Nov 8, 2021

archseer Nov 8, 2021

archseer Nov 8, 2021

archseer commented Nov 8, 2021

		.and_then(\|cap\| cap.get(1))
		.and_then(\|cap\| self.language_config_ids_by_shebang.get(cap.as_str()))

	file.take(128).read_to_string(&mut buf).ok();
	file.take(128).read_to_string(&mut buf).ok()?;

Detect filetype from shebang line #1001

Detect filetype from shebang line #1001

Conversation

ath3 commented Nov 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

archseer commented Nov 8, 2021