Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge branch 'filter-by-regex' #57

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,12 @@ jobs:
run: cargo test --color=always -- --color=always
- name: Run tests serde
run: cargo test --features serde --color=always -- --color=always
- name: Run regex tests
working-directory: ./tests/check-regex-filtering
env:
CHRONO_TZ_BUILD_TIMEZONES: (Europe/London|GMT)
run: cargo test --color=always -- --color=always

- name: Check with no default features
run: cargo check --no-default-features --color=always

Expand Down
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ std = []

[build-dependencies]
parse-zoneinfo = { version = "0.3" }
regex = { default-features = false, version = "1" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work to make this optional, and put it behind a "filter-timezones" feature? I'm genuinely not sure if you can cfg(feature = "filter-timezones") in a build.rs file.

Copy link
Contributor Author

@PhilipDaniels PhilipDaniels Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't swear to it, but I don't think it's possible. I tried making a commit here and it is easy enough to refactor chrono-tz itself to use the feature and conditionally compile out all the relevant regex code, but then using it from the test example doesn't seem to work. I think features are not passed through as one would want. This might be relevant:rust-lang/cargo#5499


[dev-dependencies]
serde_test = "1"
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,25 @@ lto = true
Otherwise, the additional binary size added by this library may overflow
available program space and trigger a linker error.

## Limiting the Timezone Table to Zones of Interest

`Chrono-tz` by default generates timezones for all entries in the
[IANA database](http://www.iana.org/time-zones). If you are interested
in only a few timezones you can use an environment variable to
select them. The environment variable is called CHRONO_TZ_BUILD_TIMEZONES
and is a regular expression. It should be specified in your top-level build:

```sh
CHRONO_TZ_BUILD_TIMEZONES="(Europe/London|US/.*)" cargo build
```

This can significantly reduce the size of the generated database, depending
on how many timezones you are interested in. Wikipedia has an [article
listing the timezone names](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).

The filtering applied is liberal; if you use a pattern such as "US/.*" then `chrono-tz` will
include all the zones that are linked, such as 'America/Denver', not just 'US/Mountain'.

## Future Improvements

- Handle leap seconds
Expand Down
107 changes: 105 additions & 2 deletions build.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
extern crate parse_zoneinfo;
extern crate regex;

use std::collections::BTreeSet;
use std::collections::{BTreeSet, HashSet};
use std::env;
use std::fs::File;
use std::io::{self, BufRead, BufReader, Write};
Expand All @@ -12,6 +13,8 @@ use parse_zoneinfo::table::{Table, TableBuilder};
use parse_zoneinfo::transitions::FixedTimespan;
use parse_zoneinfo::transitions::TableTransitions;

use regex::Regex;

// This function is needed until zoneinfo_parse handles comments correctly.
// Technically a '#' symbol could occur between double quotes and should be
// ignored in this case, however this never happens in the tz database as it
Expand Down Expand Up @@ -274,7 +277,101 @@ fn write_directory_file(directory_file: &mut File, table: &Table) -> io::Result<
Ok(())
}

/// The name of the environment variable which possibly holds the filter regex.
const FILTER_ENV_VAR_NAME: &str = "CHRONO_TZ_BUILD_TIMEZONES";

/// Checks the CHRONO_TZ_BUILD_TIMEZONES environment variable.
/// Converts it to a regex if set. Panics if the regex is not valid, as we want
/// to fail the build if that happens.
fn get_filter_regex() -> Option<Regex> {
match env::var(FILTER_ENV_VAR_NAME) {
Ok(val) => {
let val = val.trim();
if val.is_empty() {
return None;
}
match Regex::new(val) {
Ok(regex) => Some(regex),
Err(err) => panic!(
"The value '{:?}' for environment variable {} is not a valid regex, err={}",
val, FILTER_ENV_VAR_NAME, err
),
}
}
Err(env::VarError::NotPresent) => None,
Err(env::VarError::NotUnicode(s)) => panic!(
"The value '{:?}' for environment variable {} is not valid Unicode",
s, FILTER_ENV_VAR_NAME
),
}
}

/// Insert a new name in the list of names to keep. If the name has 3
/// parts, then also insert the 2-part prefix. If we don't do this we will lose
/// half of Indiana in `directory.rs`. But we *don't* want to keep one-part names,
/// otherwise we will inevitably end up with 'America' and include too much as
/// a consequence.
fn insert_keep_entry(keep: &mut HashSet<String>, new_value: &str) {
let mut parts = new_value.split('/');
if let (Some(p1), Some(p2), Some(_), None) =
(parts.next(), parts.next(), parts.next(), parts.next())
{
keep.insert(format!("{}/{}", p1, p2));
}

keep.insert(new_value.to_string());
}

/// Filter `table` by applying `filter_regex`.
fn filter_timezone_table(table: &mut Table, filter_regex: Regex) {
// Compute the transitive closure of things to keep.
// Doing this, instead of just filtering `zonesets` and `links` by the
// regiex, helps to keep the `structure()` intact.
let mut keep = HashSet::new();
for (k, v) in &table.links {
if filter_regex.is_match(k) {
insert_keep_entry(&mut keep, k);
}
if filter_regex.is_match(v) {
insert_keep_entry(&mut keep, v);
}
}

let mut n = 0;
loop {
let len = keep.len();

for (k, v) in &table.links {
if keep.contains(k) && !keep.contains(v) {
insert_keep_entry(&mut keep, v);
}
if keep.contains(v) && !keep.contains(k) {
insert_keep_entry(&mut keep, k);
}
}

if keep.len() == len {
break;
}

n += 1;
if n == 50 {
println!("cargo:warning=Recursion limit reached while building filter list");
break;
}
}

// Actually do the filtering.
table.links.retain(|k, v| keep.contains(k) || keep.contains(v));

table
.zonesets
.retain(|k, _| filter_regex.is_match(&k) || keep.iter().any(|s| k.starts_with(s)));
}

fn main() {
println!("cargo:rerun-if-env-changed={}", FILTER_ENV_VAR_NAME);

let parser = LineParser::new();
let mut table = TableBuilder::new();

Expand Down Expand Up @@ -310,10 +407,16 @@ fn main() {
Line::Space => {}
}
}
let table = table.build();

let mut table = table.build();
if let Some(filter_regex) = get_filter_regex() {
filter_timezone_table(&mut table, filter_regex);
}

let timezone_path = Path::new(&env::var("OUT_DIR").unwrap()).join("timezones.rs");
let mut timezone_file = File::create(&timezone_path).unwrap();
write_timezone_file(&mut timezone_file, &table).unwrap();

let directory_path = Path::new(&env::var("OUT_DIR").unwrap()).join("directory.rs");
let mut directory_file = File::create(&directory_path).unwrap();
write_directory_file(&mut directory_file, &table).unwrap();
Expand Down
9 changes: 9 additions & 0 deletions tests/check-regex-filtering/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[package]
name = "check-regex-filtering"
version = "0.1.0"
authors = ["Philip Daniels <Philip.Daniels1971@gmail.com>"]
edition = "2018"

[dependencies]
chrono = "0.4"
chrono-tz = { path = "../../", default-features = false }
67 changes: 67 additions & 0 deletions tests/check-regex-filtering/src/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
/// This test is compiled by the Github workflows with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic, exactly what I hoped for.

Could you either either document the way that this checks the recursive timezone check, or change the filter to filter out something that would only be filtered out if the filter applied recursively? For example, changing the filter to Europe and then asserting that at least some European countries do not appear. (Assuming that that's a correct implication of the hierarchy, I'm not particularly familiar with all the DB rules.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more tests and some comments explaining what I think should be included and excluded. This stuff does get tricky...if we take your example of 'Europe' for example you get some Asian timezones - because of Istanbul! It all depends what's in the links, and I guess that is subject to change so we should be careful of how restrictive we are.

/// filter regex set thusly: CHRONO_TZ_BUILD_TIMEZONES="(Europe/London|GMT)"
///
/// We use it to check two things:
/// 1) That the compiled chrono-tz contains the correct timezones (a compilation
/// failure will result if it doesn't).
/// 2) That the compiled chrono-tz DOES NOT contain other, non-matched,
/// timezones. This is rather trickier to do without triggering a compilation
/// failure: we try our best by looking over the TZ_VARIANTS array to try and
/// ascertain if it contains anything obviously wrong.

#[cfg(test)]
mod tests {
use chrono::offset::TimeZone;
use chrono_tz::{Europe, Europe::London, Tz, TZ_VARIANTS};
use std::str::FromStr;

#[test]
fn london_compiles() {
// This line will be a compilation failure if the code generation
// mistakenly excluded Europe::London.
let _london_time = London.ymd(2013, 12, 25).and_hms(14, 0, 0);
assert_eq!("Europe/London", London.name());

// Since London is included, converting from the corresponding
// string representation should also work.
assert_eq!(Tz::from_str("Europe/London"), Ok(London));

// We did not explicitly ask for Isle Of Man or Belfast in our regex, but there is a link
// from Europe::London to Isle_of_Man and Belfast (amongst others)
// so these conversions should also work.
assert_eq!(Tz::from_str("Europe/Isle_of_Man"), Ok(Europe::Isle_of_Man));
assert_eq!(Tz::from_str("Europe/Belfast"), Ok(Europe::Belfast));
}

#[test]
fn excluded_things_are_missing() {
// Timezones from outside Europe should not be included.
// We can't test all possible strings, here we just handle a
// representative set.
assert!(Tz::from_str("Australia/Melbourne").is_err());
assert!(Tz::from_str("Indian/Maldives").is_err());
assert!(Tz::from_str("Mexico/BajaSur").is_err());
assert!(Tz::from_str("Pacific/Kwajalein").is_err());
assert!(Tz::from_str("US/Central").is_err());

// The link table caused us to include some extra items from the UK (see
// `london_compiles()`), but it should NOT include various other timezones
// from around Europe since there is no linkage between them.
assert!(Tz::from_str("Europe/Brussels").is_err());
assert!(Tz::from_str("Europe/Dublin").is_err());
assert!(Tz::from_str("Europe/Warsaw").is_err());

// Also, entire continents outside Europe should be excluded.
for tz in TZ_VARIANTS.iter() {
assert!(!tz.name().starts_with("Africa"));
assert!(!tz.name().starts_with("Asia"));
assert!(!tz.name().starts_with("Australia"));
assert!(!tz.name().starts_with("Canada"));
assert!(!tz.name().starts_with("Chile"));
assert!(!tz.name().starts_with("Indian"));
assert!(!tz.name().starts_with("Mexico"));
assert!(!tz.name().starts_with("Pacific"));
assert!(!tz.name().starts_with("US"));
}
}
}