Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add migrate_to_toml_config.py script to automatically update INI config files to TOML #9054

Merged
merged 4 commits into from
Feb 7, 2020

Conversation

Eric-Arellano
Copy link
Contributor

This script removes 80% of the tedium of migrating from INI config files to TOML config files.

It does not attempt to handle the remaining 20%:

  • multiline options
  • comments
  • nested lists and nested dictionaries

Users will be able to manually fix the remaining issues through either the validation from their editor or the site https://www.toml-lint.com.

--

Similar to fix_deprecated_globs_usage.py, this is not officially shipped with the pantsbuild.pants wheel, but we keep it in this repository for version control, tests, and to have a stable URL that we can instruct users to curl.

@cosmicexplorer
Copy link
Contributor

This script removes 80% of the tedium

Yes!

but we keep it in this repository for version control, tests, and to have a stable URL that we can instruct users to curl.

Where will this be documented?

Copy link
Contributor

@cosmicexplorer cosmicexplorer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally fine accepting this if we're very confident the regex solution will work 80%, but I feel like the ini parser route might not be hard to drop in instead, and to me it feels much more likely that it would work for whatever our users' code might be, and while it's very good that we've identified the toml lint service to clean up the result, it still seems like there's a lot of value in trying to make the automated park work as much as possible.



def generate_new_config(config: Path) -> List[str]:
original_text = config.read_text()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were entirely correct that I would love the regex, and I am aware that this is an 80% script, but it seems like using a python ini config parser wouldn't be a huge amount of effort, because we'll then be working with parsed objects instead of raw strings. Is there a reason we decided to go the regex route here over an ini parser? I would love to toss up an example ini -> toml parser or pair on it if that seems like the right way to go.

assert result == expected.splitlines()


def test_fully_automatable_config() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing great tests for this script and acknowledging that everything we ship to users should be tested!!! This makes me happy.

@blorente
Copy link
Contributor

blorente commented Feb 4, 2020

I'm on the same boat as @cosmicexplorer. Maybe, if using the full ini parser is too much of a hassle[1], we can fail visibly on the lines we leave alone (e.g. line 45, "opt1: False # Good Riddance" was not converted because it had a comment in it.).

Also on the same boat with regards to tests. Thanks for them!

[1]: Or if it swallows comments, which I'm not sure it does, but if it does it's a dealbreaker.

@Eric-Arellano
Copy link
Contributor Author

Is there a reason we decided to go the regex route here over an ini parser?

Yes. See #9052 (comment). The issue with using a parser is that you lose information meaningful to the user, such as comments and whitespace between sections. Parsers throw away information like a section having 4 blank lines between the next section vs. only 1 blank line, and IMO we should do our best to preserve that.

We could possibly combine the parser with a regex-based approach. We do that in fix_deprecated_globs_usages.py: use ast to parse the globs() function and then use a simple if # in line check to nope out of the change if it has a comment. But, the allowable symbols are much more complex here than in a globs() function, so we would still need to keep the pre-existing regex to detect if there are comments. In other words, we'd need both the regex and the configparser. At that point, I'm not sure configparser would bring much value?

Outside of converting multiline options, comments, and nested lists/dictionaries, are there any use cases this script does not do that you think it should? Or one of those 3 that you think are important to automatically fix?

Maybe, if using the full ini parser is too much of a hassle[1], we can fail visibly on the lines we leave alone

I agree that this is nice to have, but did not pursue generating warnings—unlike with fix_deprecated_globs_usages.py—because a) the scope updating 1-3 config files is much smaller than updating hundreds of BUILD files and b) there are good tools to draw attention already to what the remaining issues are (editors and https://www.toml-lint.com/).

--

Generally, with this script, my main pitch is the Pareto principle. It would be neat if we could get this script converting things 100% perfectly, but that would mean far more time into this script than I have budgeted.

I don't expect users to have more than 50-500 lines of config(s). If this converts 80%, that leaves them 10-50 lines to manually fix. They'll be able to use all these tools to figure out how to fix those remaining issues:

  • the example provided by the 80% we did convert
  • our email explaining the highlights, especially with list options and dict options
  • their editors
  • https://www.toml-lint.com/
  • the TOML specification

So, I am not too concerned about getting us from 80% to 100%.

@cosmicexplorer
Copy link
Contributor

Parsers throw away information like a section having 4 blank lines between the next section vs. only 1 blank line, and IMO we should do our best to preserve that.

This should probably read "the existing set of ini parsers". "Good" parsers will expose every part of the language to the user, including comments. I haven't immediately been able to find one that does this and I've ruled out writing one from scratch, but it might be possible to dive into the ConfigParser code and make some subclass which does retain this information.

Outside of converting multiline options, comments, and nested lists/dictionaries, are there any use cases this script does not do that you think it should?

I'm concerned that this script might e.g. accidentally miss a compiler warnings option and secretly cause people to start missing bugs. This should be an extremely important concern for us to avoid, imho. The twitter repo makes heavy use of compiler_option_sets, which you'll note is an option that has both compiler warnings options and is also a nested dict.

I'm a big fan of @blorente's suggestion to scream loudly if something is recognized as a construct that's not supported by this script. If it did something like:

[compile.zinc]
####### UNTRANSLATED INI CONTENT -- MUST BE HAND-CONVERTED TO TOML
compiler_option_sets: {
  'asdf': [...],
  ...
}
####### END UNTRANSLATED INI CONTENT

@Eric-Arellano
Copy link
Contributor Author

I'm concerned that this script might e.g. accidentally miss a compiler warnings option and secretly cause people to start missing bugs.

I don't think this is possible. The TOML file will not load correctly when running ./pants if it is not entirely valid TOML. They can't silently have bugs - the bugs are loud.

suggestion to scream loudly if something is recognized as a construct that's not supported by this script

That makes this script much harder to implement. For example, the regex currently will not match if there is a comment, it's a multiline option, or there's a nested list. We don't know which of these 3 causes it is, only that it's not possible to safely fix. Now, we would have to change the regex to check which of those 3 situations is happening so that we can generate the correct warning message.

I don't currently have the time to get that last 20%, especially given that users will see the errors in the 5-50 lines that we could not convert through all the different mechanisms described in #9054 (comment).

Perhaps we could land this and then you can add the extra 20%?

@blorente
Copy link
Contributor

blorente commented Feb 5, 2020

I'm fine with merging without the following, but my additional 2 cents:

We don't know which of these 3 causes it is, only that it's not possible to safely fix.

For me, it would be enough to output "I couldn't fix line x, refer to for a list of possible reasons"
Let me spend a little bit of time to prototype what it would look like.

option, value = parsed_dict_line.groups()
updated_text_lines[i] = f'{option} = """{value}"""'
continue

Copy link
Contributor

@blorente blorente Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'll go on a limb here and suggest some plaintext code:
This is supposed to go after the last if and before the return

Suggested change
print(f'We couldn't automatically convert line {i}: {line}. Please refer to https://github.com/pantsbuild/pants/pull/9054 for a description of what might be the case.', file=sys.stderr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the code snippet!

The one challenge with this snippet is that it will trigger on every non-option line, such as every single one of these lines, except for the very last one:

# Comment not attached to any option
[cache.java]
args: [
     'foo',
     'bar'
  ]


next_option: foo

I think that's more output than we would want. We could, of course, teach the script how to handle those different things like what a section header looks like, but then we're getting to the complexity I wanted to avoid.

--

Favor: can you please pull down this PR and run build-support/migration-support/migrate_to_toml_config.py pants.ini pants.travis-ci.ini pants.remote.ini, then open up the new TOML files in your editor? (Or, do this on Twitter's config files).

I suspect you'll find that there aren't many remaining issues and that those errors are fairly intuitive to fix, thanks to the editor warnings. This is why I'm pushing back on going from 80-100%; after trying out the script, it doesn't seem worth it to make it even better.

@benjyw's point is that we don't even need to provide a script in the first place - it's not very difficult to convert from INI to TOML, only a little tedious. We're providing this script as a courtesy to users, and I don't think we need to have it be perfect for them to be happy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think I'd prefer to be overly verbose, but I agree that nobody is going to die for figuring out 20 manual changes. Particularly, because "users" here most likely means the same people that are seeing these changes happen, not actual end-users.

So not going to block on that, thanks for putting the effort to consider this point :)

@blorente blorente self-requested a review February 7, 2020 11:36
@Eric-Arellano
Copy link
Contributor Author

Looks like TOML will land soon after 1-2 more reviews—i.e. we are going to move forward with TOML support—so I'm going to merge this. This PR works without official TOML support as it's only a conversion script. I want to merge now so that we aren't blocked by CI later in the day.

@Eric-Arellano Eric-Arellano merged commit 4fb7572 into pantsbuild:master Feb 7, 2020
@Eric-Arellano Eric-Arellano deleted the ini-to-toml branch February 7, 2020 20:36
Eric-Arellano added a commit that referenced this pull request Feb 7, 2020
## Problem

INI files have some issues that make them frustrating for users:

* Gotcha with when to use quotes:
   * If the option is a string, you cannot use quotes, e.g. `version: ipython==4.8`.
   * If the value appears in a list, you _must_ use quotes, e.g. `config: [".isort.cfg"]`
* Gotcha with indenting lists and dict values:
    * The 7th paragraph of our instructions for new users warns: 

 > Note that the formatting of the plugins list is important; all lines below the plugins: line must be indented by at least one white space to form logical continuation lines. This is standard for Python ini files. See Options for a guide on modifying your pants.ini.

* Awkard to append and filter to a list option at the same time:

```ini
backend_packages: +[
  'pants.backend.scala.lint.scalafmt'],-[
  'pants.backend.codegen.antlr.java',
  'pants.backend.codegen.antlr.python',
  'pants.backend.codegen.wire.java']
```

* No defined standard for INI. Users may try using INI features not supported or interpreted differently by Python's `configparser`.
* No syntax highlighting or validation from editors, given that there's no standard.

## Why TOML?

### PEP 518's thoughts on file formats
Python core developers recently researched modern config values to choose the format for `pyproject.toml`. They compiled their results into [PEP 518](https://www.python.org/dev/peps/pep-0518/#other-file-formats), comparing INI, JSON, YAML, and TOML.

Reasons they rejected INI:
* No standard.

Reasons they rejected JSON:
* Difficult for humans to edit, e.g. no support for comments.

Reasons they rejected YAML:
* Specification is 86 pages when printed!
* Not safe by default due to arbitrary execution of code.
* PyYAML is large and has a C extension.

Reasons they liked TOML:
* Simple while still being flexible enough.
* Rust core developers said they have been "quite happy" with using TOML for Cargo.

### Reasons for us to use TOML
* Very similar syntax to INI, as TOML was inspired by INI.
    * Less churn for users.
* Simpler than YAML.
    * I don't think we want users using anchors/aliases, for example.
* Alignment with Rust and Python ecosystems.

## Solution

Implement a `config._TomlValues` class that behaves identically to `_IniValues`. Both of these are subclassses of `_ConfigValues`, meaning that the two config formats may be interchanged without any file outside of `config.py` knowing which config format was originally used. 

We use extremely comprehensive unit tests to ensure that `_TomlValues` behaves identically to `_IniValues`.

### How we implement interpolation

TOML does not have native support for string interpolation, i.e. substituting `%(foo)s` for the `DEFAULT` value `foo`. So, we provide our own implementation that behaves identically.

### How we implement list options

TOML has native list support, but does not have syntax for `+[]` and `-[]`. 

We could require users to use multiline strings, like `config = """+[".isort.cfg"],-["bad.cfg"]"""`. This will work, and in fact the format we convert into in `config.py` so that the rest of Pants understands the value.

But, `config = """+[".isort.cfg"],-["bad.cfg"]"""` is clunky, so we introduce syntatic sugar:

```toml
[jvm]
options = ["-Xmx1g"]

[isort]
config.add= [".isort.cfg"]
config.remove = ["bad.cfg"]
```

This sugar allows users to use native TOML lists (more ergonomic than strings) and makes it easy to both add to and remove from a list option at the same time.

### How we implement dict options

TOML has native support for dicts through both [Tables](https://github.com/toml-lang/toml#user-content-table) and [Inline Tables](https://github.com/toml-lang/toml#inline-table).

However, we already use Tables to implement the distinct option scopes. So, if we also used Tables to implement dictionary values, we would introduce an ambiguity whether a table refers to an options scope or a dictionary value. Originally, this PR tried to take this approach, but quickly rejected support for using Tables because it dramatically complicates the solution and would cause a leaky API.

Instead, users should use a multiline string for dictionary options:

```toml
[jvm-platform]
default_platform = "java8"
platforms = """
{
  "java7": {"source": 7, "target": 7, "args": []},
  "java8": {"source": 8, "target": 8, "args": []},
  "java9": {"source": 9, "target": 9, "args": []},
}
"""
``` 

(Note the indentation - this would error in INI because of being indented too much to the left, but it works in TOML.)

## Remaining followup

1. [ ] Update our own usage to use TOML for dogfooding.
     * We'll also update Toolchain's config.
1. [x] Create a script that will automatically convert 80%-90% of INI files to TOML.
     * See #9054.
1. [ ] Update our docs to use TOML.
1. [ ] Deprecate INI.
     * Will allow us to remove a lot of awkwardness around the TOML implementation.
     * Will allow us to confidently generate suggested config values in help/deprecation/error messages, whereas now we have to deal with things like indentation of lists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants