Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): support nested .gitignore and better glob patterns support #1134

Merged
merged 29 commits into from
Dec 17, 2023

Conversation

ematipico
Copy link
Member

@ematipico ematipico commented Dec 9, 2023

Summary

This PR implements a bunch of changes to our file system crawling strategy and how we parse globs inside ignore and include properties.

File system crawling

In order to improve the file system crawling strategy, we decided to use the ignore crate. This was a big judgment call, but I think it's going to pay off in the long run.

This crate depended on regex-automata, which is an engine to compile regexes. For some time we were trying to stay away from crates of this nature because they could slow down compiling times and runtime times, but I believe we should just include them and make sure that we carefully use them. E.g.:

  • compile multiple regex at once
  • use const regexes
  • reuse the same regex
  • cache results

ignore is well maintained, and it's used by ripgrep. This crate provides a bunch of functionalities out of the box, such as:

  • .gitignore support
  • overrides globs
  • multi-threading support

Plus, it would allow us to support glob patterns in the CLI, but this isn't implemented in this PR yet, because it requires some more work.

Technical changes

  • I added a new function to TraversalScope called traverse_paths, which is the first function to call when beginning the traversal. This function accepts all the paths we need to traverse
  • The implementation in os.rs of traverse_paths is the function that uses ignore. We create a WalkerBuilder to pass all the paths we need to traverse and all the overrides (this doesn't work well at the moment, because we don't pass absolute paths to the walker).
  • Errors coming from the first traversal are filtered because our inner traversal already handles them, e.g. symbolic links, hence why I created can_track_error function
  • Correctly replace inputs like . and ./ with the current working directory. This allows us to create absolute paths and apply globs in a more predictable way

While implementing this, I found out that there were a few issues on how we handle files that we don't support. I applied the correct changes and how they should be fixed. They are around FileFeaturesResult.

Better glob support

Our previous implementation of glob support used an internal fork of the glob create. This wasn't ideal, because it didn't support some edge cases and .gitignore patterns.

I removed this fork and used globset. This is another crate that pumps ripgrep, so it's battle-tested and provides the features we want:

  • glob support
  • .gitignore globs support

To support this feature, I had to do some refactoring on how we retrieve this information. Some of those changes lead also to some optimisations that we didn't do before.

Technical changes

  • the function load_configuration was accepting the whole session object, and it was also writing diagnostics to the terminal. This wasn't well architected, because we don't have the console object in the LSP. I changed, by pass only the FileSystem trait, and created a function called validate_configuration_diagnostics that checks whether there are errors and blocks the process.
  • In the LSP, this is done a different way - we just use the error! macro and block the process
  • In the LSP, I addressed VSCode Please add an option disable the warning tooltips! biome-vscode#24 by logging a warning, and not showing the toast. That was indeed annoying
  • The function update_settings now accepts the root path of the VCS and the contents inside the .gitignore file. Before Biome was reading the contents of the file and adding the paths to files.ignore. This isn't the case anymore. We pass this information to the Workspace server so it can properly create the Matcher types. This isn't ideal, I am sure we can optimise this part later.
  • The Matcher object is not an Option anymore, but instead, we use the new empty function to determine whether we should apply its matching logic or not. This is required because we don't want to apply the matching if there isn't anything to match. This is an important requirement. The empty function checks if there are matches defined in the configuration e.g. files.ignore: ["dist/**"] and in the .gitignore.
  • The diagnostics that belonged to VCS have been moved to biome_service because these kinds of errors are expected inside the LSP too (yay!!)

Test Plan

The current tests should pass and shouldn't change, except for a few that should be updated.

I also tested it in our internal script. This revealed a bunch of bugs!! For example, we were not receiving diagnostics for files that we can't handle such as mdx, css etc. Note that the option files.ignoreUnknown wasn't turned on. I added more globs to our biome.json and I could see that these files aren't processed anymore.

I will create a nightly release and test it in other repositories.

Copy link

netlify bot commented Dec 9, 2023

Deploy Preview for biomejs ready!

Name Link
🔨 Latest commit 288e62d
🔍 Latest deploy log https://app.netlify.com/sites/biomejs/deploys/657ed8d8ff7a55000852b941
😎 Deploy Preview https://deploy-preview-1134--biomejs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 100 (no change from production)
Accessibility: 97 (no change from production)
Best Practices: 100 (no change from production)
SEO: 93 (no change from production)
PWA: -
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify site configuration.

@github-actions github-actions bot added A-CLI Area: CLI A-Core Area: core A-Project Area: project A-Diagnostic Area: diagnostocis labels Dec 9, 2023
@ematipico ematipico changed the title chore: wip chore(cli): use ignore and globset Dec 9, 2023
@github-actions github-actions bot added the A-LSP Area: language server protocol label Dec 12, 2023
@ematipico ematipico force-pushed the feat/use-ignore branch 2 times, most recently from 08b87d6 to 53e1df8 Compare December 12, 2023 17:49
@ematipico ematipico changed the title chore(cli): use ignore and globset feat(cli): support nested .gitignore and better glob patter support Dec 13, 2023
@github-actions github-actions bot added A-Website Area: website A-Changelog Area: changelog labels Dec 13, 2023
@ematipico ematipico changed the title feat(cli): support nested .gitignore and better glob patter support feat(cli): support nested .gitignore and better glob patterns support Dec 13, 2023
@ematipico ematipico marked this pull request as ready for review December 14, 2023 09:42
@ematipico ematipico requested review from a team December 14, 2023 09:42
@ematipico ematipico force-pushed the feat/use-ignore branch 2 times, most recently from 59b647a to 105eac6 Compare December 14, 2023 09:56
@anonrig
Copy link
Contributor

anonrig commented Dec 15, 2023

Does this have a performance impact?

# pinning to version 1.18 to avoid multiple versions of windows-sys as dependency
tokio = { version = "~1.18.5" }
unicode-bom = "2.0.3"
tokio = { version = "~1.18.5" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacing before = seems to be wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand what you mean

@@ -128,6 +128,16 @@ pub(crate) fn process_file(ctx: &TraversalOptions, path: &Path) -> FileResult {
category!("files/missingHandler"),
)?;

// first we stop if there are some files that don't have ALL features enabled, e.g. images, fonts, etc.
if file_features.is_ignored() || file_features.is_not_enabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use experimental Rust features in Biome? If yes, can we surround this with std::intrinsics::unlikely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't use experimental features, and I don't think we plan for now (at least for stable releases)

crates/biome_cli/src/execute/traverse.rs Show resolved Hide resolved
@ematipico ematipico requested a review from a team as a code owner December 15, 2023 11:29
@ematipico ematipico removed the request for review from a team December 16, 2023 12:13
@ematipico ematipico merged commit 22055d0 into main Dec 17, 2023
21 checks passed
@ematipico ematipico deleted the feat/use-ignore branch December 17, 2023 12:01
@Conaclos
Copy link
Member

I had not the time to review the PR. Does this also add support for .ignore files? and global gitignore file $HOME/.git/config/ignore?

Also, we should certainly update our website documentation for which glob patterns we support (ref to gitignore docs?) and the fact that we support nested .gitignore files.

@ematipico
Copy link
Member Author

@Conaclos you can still leave your review, I will address your comments in another PR.

@ematipico
Copy link
Member Author

Does this also add support for .ignore files? and global gitignore file $HOME/.git/config/ignore?

No and yes (maybe we should add this to the changelog and documentation).

@LinusU
Copy link

LinusU commented Dec 20, 2023

@ematipico not sure if you want this as a separate issue, but I'm seeing issues with this in Biome 1.4.1-nightly.e087146.

My repo looks like so:

.
├── .gitignore
├── biome.json
├── dist
│  └── (generated .js files)
├── lambda
│  ├── foobar
│  │  ├── .gitignore
│  │  └── package.json
│  └── (other folders with package.json and .gitignore files)
├── package.json
├── src
│  ├── postgres (folder generated by prisma)
│  ├── types.ts (file generated by graphql-codegen)
│  └── (other files and folders)
└── (other files and folders)

The .gitignores looks like so:

.gitignore

/.env
/dist/
/node_modules/
/src/postgres/
/src/types.ts

lambda/foobar/.gitignore

/node_modules/

Biome incorrectly looks in lambda/foobar/node_modules and reports errors:

/Users/linus/coding/my-project/lambda/foobar/node_modules/symbol-observable/lib/ponyfill.js:1:1 lint/suspicious/noRedundantUseStrict  FIXABLE  ━━━━━━━━━━

  ✖ Redundant use strict directive.
  
  > 1 │ 'use strict';
      │ ^^^^^^^^^^^^^
    2 │ 
    3 │ Object.defineProperty(exports, "__esModule", {
  
  ℹ The entire contents of JavaScript modules are automatically in strict mode, with no statement needed to initiate it.
  
  ℹ Safe fix: Remove the redundant use strict directive.
  
     1    │ - 'use·strict';
     2  1 │   
     3  2 │   Object.defineProperty(exports, "__esModule", {
  

It seems like src/postgres/** and src/types.ts is correctly ignored.

@ematipico
Copy link
Member Author

We reverted the support for nested git ignore files.

After a bunch of testing, we realised that there were some issues in the crawling of the file system.

We will implement the feature in a separate PR with a different logic.

Also, FYI the result you're seeing seems expected, because the glob /node_modules/ has a slash at the beginning, which means that the crawler ignores only the node_modules folder at the root of your project.

You should specify **/node_modules instead.

@LinusU
Copy link

LinusU commented Dec 20, 2023

@ematipico

Also, FYI the result you're seeing seems expected, because the glob /node_modules/ has a slash at the beginning, which means that the crawler ignores only the node_modules folder at the root of your project.

This is how Git handles there .gitignore files, when a line begins with / it's anchored to the directory which holds the .gitignore file. I can tell that this works since git doesn't try to add the entire lambda/foobar/node_modules folder when I run git add.

You should specify **/node_modules instead.

I believe that this would be the equivalent to just doing node_modules/, without the leading slash, if one goes by Gits parsing.


Is the intent of your crawler to be compatible with how Git uses .gitignore files?

In that case I believe that some care needs to be taken to account for these edge cases. I'd be happy to lend some help since I've worked with this recently when creating a tool that converts Git ignore files to a Docker ignore file.

I would also recommend to make it compatible with Git from the start, since otherwise it's easy to get stuck in a place where a change to Git's model is desired, but would be considered a breaking change. As seems to have happened to Docker 😅

@ematipico
Copy link
Member Author

I can tell that this works since git doesn't try to add the entire lambda/foobar/node_modules folder when I run git add.

That is another proof that I did well to remove the feature for the time being 😅

Is the intent of your crawler to be compatible with how Git uses .gitignore files?

Yeah, that's the idea. The next release (your current nightly) should be compatible with git blogs inside files.ignore and files.include, so that's a good start.

In that case I believe that some care needs to be taken to account for these edge cases. I'd be happy to lend some help since I've worked with this recently when creating a tool that converts Git ignore files to a Docker ignore file.

That would be awesome! I always appreciate help from people who are more expert than me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Changelog Area: changelog A-CLI Area: CLI A-Core Area: core A-Diagnostic Area: diagnostocis A-LSP Area: language server protocol A-Project Area: project A-Website Area: website
Projects
None yet
4 participants