Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusive schema name filtering, and file logging changes #218

Closed
wants to merge 7 commits into from

Conversation

devinshteyn2
Copy link

Please review the proposed changes

@lospejos
Copy link
Contributor

lospejos commented Apr 1, 2023

I upvote this PR, please consider to test and merge/approve. Thanks!

@dimitri
Copy link
Owner

dimitri commented Apr 3, 2023

Thanks for your contributions @devinshteyn2 ! The work may now begin: we have a lot to talk about in your PR, I'll try to give you an overall review first and then use the GitHUB UI to have a detailed review. First thing I notice is that we should split the PR and have several with contained changes. For instance the log file is a good idea, but requires its own review (command line switch, rotate on SIGHUP signal, etc).

@devinshteyn2
Copy link
Author

About the log file, that was a workaround on my end to facilitate faster debugging for testing with many schema objects, and grepping logs for errors.

IMHO, file logging is also a good thing for use in automation projects and pipelines where running things manually from console isn't an option, like in production environment. If you like the idea, I can add a new parameter, please name it. But why stop there? I would also suggest adding a separate option for log formatting. Logging in plain text format isn't always the best option. Modern log monitoring solutions especially in cloud based environments prefer JSON for easy of log parsing and aggregation. It seems that the existing log library doesn't support that, it may need to be updated to support different formatting options. Optional suppression of color coding symbols might be desired too, so they don't get in the way of integrated log monitoring, in Splunk, CloudWatch, Azure Log Monitor, and likes.

@devinshteyn2
Copy link
Author

Looking forward to your PR review and further comments

@dimitri
Copy link
Owner

dimitri commented Apr 5, 2023

Please have a look at #234 that implements your ideas about the log file, including JSON support. Note that we already have support for bypassing colors when we detect that stderr is not a tty, so that should be fine for logging pipelines already.

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first round of review. In the meantime several other PRs have made it to the main branch that require a rebase and your attention, with changes in areas you're also working on. In particular PR #247 introduces support for SourceFilters in dump_restore.c in a way that you can benefit now.

README.md Outdated
Comment on lines 49 to 55
clone Clone an entire database from source to target
fork Clone an entire database from source to target
clone Clone an entire database from source to target, schema and table
filters may apply (see Filters for more info)
fork Clone an entire database from source to target, schema and table
filters may apply (see Filters for more info)
follow Replay changes from the source database to the target database
copy-db Clone an entire database from source to target
copy-db Clone an entire database from source to target, schema and table
filters may apply (see Filters for more info)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the short descriptions to the command to fit in a single line, even if that means we don't mention filtering abilities. Also we don't have filtering support for logical decoding at the moment. Let's revert that change.

README.md Outdated
Comment on lines 329 to 350
## Filters
Filtering allows to skip some object definitions and data when copying from the source to the target database.
The pgcopydb commands accepts the option --filter (or --filters) which expects an existing filename as the option argument.
The given filename is read in the INI file format, but only uses sections and option keys. Option values are not used.
Filtering supports two kinds of filters - exclusion and inclusion-only filters. Their complete description and examples
can be found in pgcopydb documentation.

The following filter types can be used for exclusions:
exclude-schema
exclude-index
exclude-table-data
exclude-table

The following filter types can be used for inclusions only:
include-only-schema
include-only-table

Multiple filters can be specified in the same filter file, yet not all combinations of filters are compatible and can be used
together. For example, using include-only-schema together with exclude-table is fine, but using exclude-schema together with
include-only-schema is not.


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is just a README. See the docs directory for full documentation, including https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_config.html which details our filtering options. This PR should probably edit the docs to enhance our filtering support with the added functionality, and maybe make it more visible in the general discussions.

Comment on lines 65 to 66
&(specs->filters.includeOnlySchemaList),
&(specs->filters.excludeSchemaList)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be enough to pass the filter itself as an argument and let the copydb_dump_source_schema deal with the detailed filtering support.

Comment on lines 96 to 97
&(specs->filters.includeOnlySchemaList),
&(specs->filters.excludeSchemaList)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before, let's pass the whole Filtering SourceFilters structure here as an argument and let the function deal with the support it has for it.

Comment on lines 160 to 172
/* if restoring specific schemas as specified in the inclusion filter
make sure they exist in the target database, if not create them.
*/
if (specs->filters.includeOnlySchemaList.count > 0)
{
if (!copydb_target_create_snpname(specs))
{
/* errors have already been logged */
return false;
}
}


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that needed in the workflow? pg_dump and pg_restore should be taking care of that, right?

We need either to comment on why pg_restore fails to restore the needed schema, or better yet fix our code to make sure that pg_restore does the work for us here.

Comment on lines 439 to 473
/*
* schemaFiltersJoin concatenates elements of a filter array to
* a single string (for use with pg_dump and pg_restore)
*/
char *
schemaFiltersJoin(char *dest, size_t dest_size, SourceFilterSchemaList *list, char *separator)
{
size_t separator_size = strlen(separator);
char *target = dest; /* start of buffer, where to copy to */
char *target_end = dest + dest_size; /* end of buffer, cant go beyond that */
*target = '\0';
size_t nspname_size = 0;

for (size_t i = 0; i < list->count; i++)
{
nspname_size = strlen(list->array[i].nspname);

if (i > 0) /* first element or not? */
{
if (target_end <= (target + separator_size + nspname_size)) /* no more space */
return dest;
/* add separator */
strcat(target, separator);
target += separator_size;
}
else if (target_end <= (target + nspname_size)) /* no more space */
return dest;

/* add element */
strcat(target, list->array[i].nspname);
target += nspname_size; /* move pointer to the end of string */
};

return dest;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need that function? It looks like something that would better be done at the SQL level, or at least exposing the schemas as a VALUES statement (one row per namespace, rather than a composite value). If you really want to send the namespace list as a composite value, could we maybe prepare a JSON array and send its text representation then?

Otherwise, if we need to keep that code, please use the project formatting rules, and then use a PQExpBuffer API to build the string.

@@ -32,6 +33,7 @@ size_t ps_buffer_size; /* space determined at run time */
size_t last_status_len; /* use to minimize length of clobber */

Semaphore log_semaphore = { 0 }; /* allows inter-process locking */
FILE *log_fp = NULL; /* file handle for optional log file */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase with the current main branch, where support for logging to file (and optionally in JSON format) has been added thanks to your suggestion. You can now remove your own approach to it ;-)

Comment on lines 344 to 345
const SourceFilterSchemaList *includeSchemaList,
const SourceFilterSchemaList *excludeSchemaList)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceFilters

Comment on lines 766 to 767
const SourceFilterSchemaList *includeSchemaList,
const SourceFilterSchemaList *excludeSchemaList)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceFilters

Comment on lines +29 to +30
# copy one schema only
pgcopydb clone --${LOG_LEVEL} --source=${PGCOPYDB_SOURCE_PGURI} --target=${PGCOPYDB_TARGET_PGURI} --filter=./include-1-schema.ini --no-acl --no-owner
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check that the other two schemas were indeed not copied over?

@dimitri
Copy link
Owner

dimitri commented May 22, 2023

ping @devinshteyn2 ; do you think you will have more time to spend on finishing this PR?

@devinshteyn
Copy link

Hi. Unfortunately I've been completely swamped at work, and haven't had a chance to review latest changes and merge them into my fork. I may or may not have an opportunity next week to return to the project, I would say there is a 50% chance of that happening. I will keep you posted on the status.

@devinshteyn
Copy link

Quick status update for you. I merged today latest changes from the main branch to my local fork. Going to spend some time tomorrow on testing, and if all goes well, commit and push them back.

@devinshteyn
Copy link

Pushed merged code with minor additions, reverted README.md to your liking, and addressed I think all comments in PR, except going back to single filter parameter. I can look into that tomorrow.

IMHO, I find the filtering interface hidden too deep to be found by many people. Not a single reference appears in the original README, nor in the help script output.
In my opinion using simple parameters as in pg_dump/restore is more transparent and more convenient. Maybe you would consider adding an enhancements for supplying filters via command line? As it stands now, for use in cloud and service provider sorts of environments servicing many users the existing interface requires writing dynamic filter files in automation scripts and then removing them after execution.

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time and efforts to prepare this PR for merge, still some more to do!

.gitignore Outdated
@@ -41,3 +41,5 @@ lib*.pc
/env/
/GIT-VERSION-FILE
/version
/tests/schema-filter
.vscode/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refrain from adding personal preferences to the repository, add .vscode to your local .gitignore setup!

Comment on lines 160 to 170
/* if restoring specific schemas as specified in the inclusion filter
make sure they exist in the target database, if not create them.
*/
if (specs->filters.includeOnlySchemaList.count > 0)
{
if (!copydb_target_prepare_namespaces(specs))
{
/* errors have already been logged */
return false;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we would take care of that automatically within the pg_dump and pg_restore calls. If that's not happening, I think the comment ought to say why and give some context. I would prefer that we don't have to create the schemas here, that said, because what if the user made a typo in their filtering setup? Then we create unwanted schemas on the target databases...

Also nitpicking: Please follow the code style in use all around in the same code/file. It looks like this:

	/* 
     * If restoring specific schemas as specified in the inclusion filter
	 *	make sure they exist in the target database, if not create them.
	 */

Note the capital, empty lines, and stars at the beginning of all lines within the comment.

Comment on lines +283 to +303

/* duplicate output to log file if PGCOPYDB_LOG env var is set.
* Warning: if the value is incorrect, it may cause segfault.
*/
if (env_exists("PGCOPYDB_LOG"))
{
char env_log_file[BUFSIZE] = { 0 };

if (get_env_copy("PGCOPYDB_LOG", env_log_file, BUFSIZE) > 0)
{
if ((log_fp = fopen(env_log_file, "w"))) /* extra parenthesis around assignment to avoid compiler warnings */
{
log_set_fp(log_fp);
}
else
{
log_error("Failed to open log file for writing %s. Error reason: %s", env_log_file, strerror(errno));
exit(EXIT_CODE_BAD_ARGS);
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove that parts, the corresponding feature has been added in the meantime already. See https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_clone.html#environment and PGCOPYDB_LOG_TIME_FORMAT, PGCOPYDB_LOG_JSON, PGCOPYDB_LOG_FILENAME, and PGCOPYDB_LOG_JSON_FILE.

Comment on lines 883 to 885
/* verbose output */
//args[argsIndex++] = "--verbose";

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is meant to be removed later, before final review and merge? You might also be able to use log_get_level to selectively add --verbose to the pg_dump call when using --debug or --trace at the pgcopydb level?

Comment on lines +24 to +26
#define PG_CMD_MAX_ARG 256
#define PG_DUMP_CMD_RESERVED_ARG 10
#define PG_RESTORE_CMD_RESERVED_ARG 15
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, still deserves some commenting.

Comment on lines +432 to +439
/* include-only-schema */
" left join pg_temp.filter_include_only_schema fn "
" on n.nspname = fn.nspname "

/* include-only-table */
" join pg_temp.filter_include_only_table inc "
" on n.nspname = inc.nspname "
" and c.relname = inc.relname "
" left join pg_temp.filter_include_only_table ft "
" on n.nspname = ft.nspname "
" and c.relname = ft.relname "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a LEFT JOIN for an “include-only” filter implementation? I expected a JOIN here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two left joins, one for inclusive-schema(s) and another for inclusive-table(s) filters, both can be handled in the same code, avoids unions with nearly duplicate code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, nice trick!

Removed redundant parameters in pg_dump_db and pg_restore_db
Cleanup up old comments
@devinshteyn
Copy link

Committed latest version with code cleanup as per your comments. Please review.

@dimitri
Copy link
Owner

dimitri commented Jul 25, 2023

I just found some time to review this PR again, and when playing with the SQL queries and filtering I realized the behaviour was a little suprising. Thinking more, what I wanted from this feature is the exact same behavior as the already existing "exclude-schema" filter, just built the other way around by listing the schema names to filter-in rather than the schema names to filter-out.

Maybe the best way to explain it is this part of the implementation for the new "include-only-schema":

insert into pg_temp.filter_exclude_schema
     select n.nspname
       from pg_namespace n
  left join pg_temp.filter_include_only_schema inc
         on n.nspname = inc.nspname
      where inc.nspname is null;

@devinshteyn2
Copy link
Author

Thank you very much for merging this PR!

If I understand correctly, you expected an inverse filter, having multiple exclusion filters inserted into the filter table, leaving just one schema (or a few) to be copied. Perhaps using that inverse method required much less code changes, can't argue with that. In our use case, we deal with hundreds sometimes thouthands of schemas in a single database and we copy just one when we need to move it to a different server. Exclusion filters feel very non intuitive in such cases. Maybe that's why to me the inverse method feels like... if I go to a grocery store to buy an apple, I bring all other apples found in the store to the cashier and tell them these are the apples I don't want to buy :-)

@dimitri
Copy link
Owner

dimitri commented Jul 26, 2023

I agree it's counter-intuitive.

Making the SQL query actually work with the "include-only-schema" approach was also very complex, because when the filter_include_only_schema table is empty (0 rows) you want to keep everything rather than filter-out everything, and inverting the filter condition based on how a table being empty or not in SQL is... well I couldn't find a nicer way to do it than inverting the filter and re-using a framework and a set of queries that all work already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants