-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inclusive schema name filtering, and file logging changes #218
Conversation
I upvote this PR, please consider to test and merge/approve. Thanks! |
Thanks for your contributions @devinshteyn2 ! The work may now begin: we have a lot to talk about in your PR, I'll try to give you an overall review first and then use the GitHUB UI to have a detailed review. First thing I notice is that we should split the PR and have several with contained changes. For instance the log file is a good idea, but requires its own review (command line switch, rotate on SIGHUP signal, etc). |
About the log file, that was a workaround on my end to facilitate faster debugging for testing with many schema objects, and grepping logs for errors. IMHO, file logging is also a good thing for use in automation projects and pipelines where running things manually from console isn't an option, like in production environment. If you like the idea, I can add a new parameter, please name it. But why stop there? I would also suggest adding a separate option for log formatting. Logging in plain text format isn't always the best option. Modern log monitoring solutions especially in cloud based environments prefer JSON for easy of log parsing and aggregation. It seems that the existing log library doesn't support that, it may need to be updated to support different formatting options. Optional suppression of color coding symbols might be desired too, so they don't get in the way of integrated log monitoring, in Splunk, CloudWatch, Azure Log Monitor, and likes. |
Looking forward to your PR review and further comments |
Please have a look at #234 that implements your ideas about the log file, including JSON support. Note that we already have support for bypassing colors when we detect that stderr is not a tty, so that should be fine for logging pipelines already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first round of review. In the meantime several other PRs have made it to the main branch that require a rebase and your attention, with changes in areas you're also working on. In particular PR #247 introduces support for SourceFilters in dump_restore.c in a way that you can benefit now.
README.md
Outdated
clone Clone an entire database from source to target | ||
fork Clone an entire database from source to target | ||
clone Clone an entire database from source to target, schema and table | ||
filters may apply (see Filters for more info) | ||
fork Clone an entire database from source to target, schema and table | ||
filters may apply (see Filters for more info) | ||
follow Replay changes from the source database to the target database | ||
copy-db Clone an entire database from source to target | ||
copy-db Clone an entire database from source to target, schema and table | ||
filters may apply (see Filters for more info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the short descriptions to the command to fit in a single line, even if that means we don't mention filtering abilities. Also we don't have filtering support for logical decoding at the moment. Let's revert that change.
README.md
Outdated
## Filters | ||
Filtering allows to skip some object definitions and data when copying from the source to the target database. | ||
The pgcopydb commands accepts the option --filter (or --filters) which expects an existing filename as the option argument. | ||
The given filename is read in the INI file format, but only uses sections and option keys. Option values are not used. | ||
Filtering supports two kinds of filters - exclusion and inclusion-only filters. Their complete description and examples | ||
can be found in pgcopydb documentation. | ||
|
||
The following filter types can be used for exclusions: | ||
exclude-schema | ||
exclude-index | ||
exclude-table-data | ||
exclude-table | ||
|
||
The following filter types can be used for inclusions only: | ||
include-only-schema | ||
include-only-table | ||
|
||
Multiple filters can be specified in the same filter file, yet not all combinations of filters are compatible and can be used | ||
together. For example, using include-only-schema together with exclude-table is fine, but using exclude-schema together with | ||
include-only-schema is not. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README is just a README. See the docs directory for full documentation, including https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_config.html which details our filtering options. This PR should probably edit the docs to enhance our filtering support with the added functionality, and maybe make it more visible in the general discussions.
src/bin/pgcopydb/dump_restore.c
Outdated
&(specs->filters.includeOnlySchemaList), | ||
&(specs->filters.excludeSchemaList))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be enough to pass the filter itself as an argument and let the copydb_dump_source_schema
deal with the detailed filtering support.
src/bin/pgcopydb/dump_restore.c
Outdated
&(specs->filters.includeOnlySchemaList), | ||
&(specs->filters.excludeSchemaList))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before, let's pass the whole Filtering SourceFilters
structure here as an argument and let the function deal with the support it has for it.
src/bin/pgcopydb/dump_restore.c
Outdated
/* if restoring specific schemas as specified in the inclusion filter | ||
make sure they exist in the target database, if not create them. | ||
*/ | ||
if (specs->filters.includeOnlySchemaList.count > 0) | ||
{ | ||
if (!copydb_target_create_snpname(specs)) | ||
{ | ||
/* errors have already been logged */ | ||
return false; | ||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that needed in the workflow? pg_dump and pg_restore should be taking care of that, right?
We need either to comment on why pg_restore fails to restore the needed schema, or better yet fix our code to make sure that pg_restore does the work for us here.
src/bin/pgcopydb/filtering.c
Outdated
/* | ||
* schemaFiltersJoin concatenates elements of a filter array to | ||
* a single string (for use with pg_dump and pg_restore) | ||
*/ | ||
char * | ||
schemaFiltersJoin(char *dest, size_t dest_size, SourceFilterSchemaList *list, char *separator) | ||
{ | ||
size_t separator_size = strlen(separator); | ||
char *target = dest; /* start of buffer, where to copy to */ | ||
char *target_end = dest + dest_size; /* end of buffer, cant go beyond that */ | ||
*target = '\0'; | ||
size_t nspname_size = 0; | ||
|
||
for (size_t i = 0; i < list->count; i++) | ||
{ | ||
nspname_size = strlen(list->array[i].nspname); | ||
|
||
if (i > 0) /* first element or not? */ | ||
{ | ||
if (target_end <= (target + separator_size + nspname_size)) /* no more space */ | ||
return dest; | ||
/* add separator */ | ||
strcat(target, separator); | ||
target += separator_size; | ||
} | ||
else if (target_end <= (target + nspname_size)) /* no more space */ | ||
return dest; | ||
|
||
/* add element */ | ||
strcat(target, list->array[i].nspname); | ||
target += nspname_size; /* move pointer to the end of string */ | ||
}; | ||
|
||
return dest; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need that function? It looks like something that would better be done at the SQL level, or at least exposing the schemas as a VALUES statement (one row per namespace, rather than a composite value). If you really want to send the namespace list as a composite value, could we maybe prepare a JSON array and send its text representation then?
Otherwise, if we need to keep that code, please use the project formatting rules, and then use a PQExpBuffer API to build the string.
@@ -32,6 +33,7 @@ size_t ps_buffer_size; /* space determined at run time */ | |||
size_t last_status_len; /* use to minimize length of clobber */ | |||
|
|||
Semaphore log_semaphore = { 0 }; /* allows inter-process locking */ | |||
FILE *log_fp = NULL; /* file handle for optional log file */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase with the current main branch, where support for logging to file (and optionally in JSON format) has been added thanks to your suggestion. You can now remove your own approach to it ;-)
src/bin/pgcopydb/pgcmd.c
Outdated
const SourceFilterSchemaList *includeSchemaList, | ||
const SourceFilterSchemaList *excludeSchemaList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceFilters
src/bin/pgcopydb/pgcmd.c
Outdated
const SourceFilterSchemaList *includeSchemaList, | ||
const SourceFilterSchemaList *excludeSchemaList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceFilters
# copy one schema only | ||
pgcopydb clone --${LOG_LEVEL} --source=${PGCOPYDB_SOURCE_PGURI} --target=${PGCOPYDB_TARGET_PGURI} --filter=./include-1-schema.ini --no-acl --no-owner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a check that the other two schemas were indeed not copied over?
ping @devinshteyn2 ; do you think you will have more time to spend on finishing this PR? |
Hi. Unfortunately I've been completely swamped at work, and haven't had a chance to review latest changes and merge them into my fork. I may or may not have an opportunity next week to return to the project, I would say there is a 50% chance of that happening. I will keep you posted on the status. |
Quick status update for you. I merged today latest changes from the main branch to my local fork. Going to spend some time tomorrow on testing, and if all goes well, commit and push them back. |
Pushed merged code with minor additions, reverted README.md to your liking, and addressed I think all comments in PR, except going back to single filter parameter. I can look into that tomorrow. IMHO, I find the filtering interface hidden too deep to be found by many people. Not a single reference appears in the original README, nor in the help script output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time and efforts to prepare this PR for merge, still some more to do!
.gitignore
Outdated
@@ -41,3 +41,5 @@ lib*.pc | |||
/env/ | |||
/GIT-VERSION-FILE | |||
/version | |||
/tests/schema-filter | |||
.vscode/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refrain from adding personal preferences to the repository, add .vscode to your local .gitignore setup!
src/bin/pgcopydb/dump_restore.c
Outdated
/* if restoring specific schemas as specified in the inclusion filter | ||
make sure they exist in the target database, if not create them. | ||
*/ | ||
if (specs->filters.includeOnlySchemaList.count > 0) | ||
{ | ||
if (!copydb_target_prepare_namespaces(specs)) | ||
{ | ||
/* errors have already been logged */ | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we would take care of that automatically within the pg_dump and pg_restore calls. If that's not happening, I think the comment ought to say why and give some context. I would prefer that we don't have to create the schemas here, that said, because what if the user made a typo in their filtering setup? Then we create unwanted schemas on the target databases...
Also nitpicking: Please follow the code style in use all around in the same code/file. It looks like this:
/*
* If restoring specific schemas as specified in the inclusion filter
* make sure they exist in the target database, if not create them.
*/
Note the capital, empty lines, and stars at the beginning of all lines within the comment.
|
||
/* duplicate output to log file if PGCOPYDB_LOG env var is set. | ||
* Warning: if the value is incorrect, it may cause segfault. | ||
*/ | ||
if (env_exists("PGCOPYDB_LOG")) | ||
{ | ||
char env_log_file[BUFSIZE] = { 0 }; | ||
|
||
if (get_env_copy("PGCOPYDB_LOG", env_log_file, BUFSIZE) > 0) | ||
{ | ||
if ((log_fp = fopen(env_log_file, "w"))) /* extra parenthesis around assignment to avoid compiler warnings */ | ||
{ | ||
log_set_fp(log_fp); | ||
} | ||
else | ||
{ | ||
log_error("Failed to open log file for writing %s. Error reason: %s", env_log_file, strerror(errno)); | ||
exit(EXIT_CODE_BAD_ARGS); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove that parts, the corresponding feature has been added in the meantime already. See https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_clone.html#environment and PGCOPYDB_LOG_TIME_FORMAT, PGCOPYDB_LOG_JSON, PGCOPYDB_LOG_FILENAME, and PGCOPYDB_LOG_JSON_FILE.
src/bin/pgcopydb/pgcmd.c
Outdated
/* verbose output */ | ||
//args[argsIndex++] = "--verbose"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is meant to be removed later, before final review and merge? You might also be able to use log_get_level
to selectively add --verbose
to the pg_dump call when using --debug or --trace at the pgcopydb level?
#define PG_CMD_MAX_ARG 256 | ||
#define PG_DUMP_CMD_RESERVED_ARG 10 | ||
#define PG_RESTORE_CMD_RESERVED_ARG 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition, still deserves some commenting.
/* include-only-schema */ | ||
" left join pg_temp.filter_include_only_schema fn " | ||
" on n.nspname = fn.nspname " | ||
|
||
/* include-only-table */ | ||
" join pg_temp.filter_include_only_table inc " | ||
" on n.nspname = inc.nspname " | ||
" and c.relname = inc.relname " | ||
" left join pg_temp.filter_include_only_table ft " | ||
" on n.nspname = ft.nspname " | ||
" and c.relname = ft.relname " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using a LEFT JOIN for an “include-only” filter implementation? I expected a JOIN here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two left joins, one for inclusive-schema(s) and another for inclusive-table(s) filters, both can be handled in the same code, avoids unions with nearly duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, nice trick!
Removed redundant parameters in pg_dump_db and pg_restore_db Cleanup up old comments
Committed latest version with code cleanup as per your comments. Please review. |
I just found some time to review this PR again, and when playing with the SQL queries and filtering I realized the behaviour was a little suprising. Thinking more, what I wanted from this feature is the exact same behavior as the already existing "exclude-schema" filter, just built the other way around by listing the schema names to filter-in rather than the schema names to filter-out. Maybe the best way to explain it is this part of the implementation for the new "include-only-schema": insert into pg_temp.filter_exclude_schema
select n.nspname
from pg_namespace n
left join pg_temp.filter_include_only_schema inc
on n.nspname = inc.nspname
where inc.nspname is null; |
Thank you very much for merging this PR! If I understand correctly, you expected an inverse filter, having multiple exclusion filters inserted into the filter table, leaving just one schema (or a few) to be copied. Perhaps using that inverse method required much less code changes, can't argue with that. In our use case, we deal with hundreds sometimes thouthands of schemas in a single database and we copy just one when we need to move it to a different server. Exclusion filters feel very non intuitive in such cases. Maybe that's why to me the inverse method feels like... if I go to a grocery store to buy an apple, I bring all other apples found in the store to the cashier and tell them these are the apples I don't want to buy :-) |
I agree it's counter-intuitive. Making the SQL query actually work with the "include-only-schema" approach was also very complex, because when the |
Please review the proposed changes