Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a better long-term solution for detection of text files. #4

Open
pyjarrett opened this issue May 25, 2021 · 4 comments
Open
Labels
enhancement New feature or request
Milestone

Comments

@pyjarrett
Copy link
Owner

Septum currently checks a very limited selection of extensions to determine if a file is text or not, in order to speed up loading of large source trees and minimize junk files loaded into memory to minimize its memory footprint.

@pyjarrett pyjarrett added the enhancement New feature or request label May 25, 2021
@pyjarrett
Copy link
Owner Author

pyjarrett commented May 27, 2021

Now that septum supports configuration files, SP.Cache.Is_Text could accept a list of extensions from the Search, allowing this to be configurable on a per-user or a per-project basis.

@pyjarrett pyjarrett added this to the Beta milestone Jun 3, 2021
@kalkin
Copy link

kalkin commented Oct 5, 2021

Have a look at this kalkin/file-expert . I wrote a programm for detecting the language type based on the data gathered by github/linguist. At some point in time I will refactor the code to provide C bindings for non Rust library users, if some one is interested.
The other way is to reuse the data to rewrite file-expert as an Ada library. It should be pretty easy, a weekend or two project.

@pyjarrett
Copy link
Owner Author

@kalkin , you project looks exciting! I'm not sure if it helps solve this issue currently, since Septum's search is language agnostic and the goal is just to determine if a file is readable text, or binary data. At some future point, Septum might gain this need and then I'd reconsider.

@kalkin
Copy link

kalkin commented Oct 7, 2021

@pyjarrett Thanks! Seems like I misunderstood the workings of septum. I thought it does some basic language specific parsing.

If you ever want parse different languages I strongly suggest looking at tree-sitter if you do not know it yet it's a way to specify how to parse your library (in JS :() and then it generates you a library, which returns a universal ast, which contains all the line/character-range coordinates. Quiete a few popular programming languages have tree-sitter support already. https://github.com/tree-sitter/tree-sitter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants