Inclusive schema name filtering, and file logging changes #218

devinshteyn2 · 2023-03-23T16:21:18Z

Please review the proposed changes

lospejos · 2023-04-01T13:10:42Z

I upvote this PR, please consider to test and merge/approve. Thanks!

dimitri · 2023-04-03T09:17:09Z

Thanks for your contributions @devinshteyn2 ! The work may now begin: we have a lot to talk about in your PR, I'll try to give you an overall review first and then use the GitHUB UI to have a detailed review. First thing I notice is that we should split the PR and have several with contained changes. For instance the log file is a good idea, but requires its own review (command line switch, rotate on SIGHUP signal, etc).

devinshteyn2 · 2023-04-03T15:38:48Z

About the log file, that was a workaround on my end to facilitate faster debugging for testing with many schema objects, and grepping logs for errors.

IMHO, file logging is also a good thing for use in automation projects and pipelines where running things manually from console isn't an option, like in production environment. If you like the idea, I can add a new parameter, please name it. But why stop there? I would also suggest adding a separate option for log formatting. Logging in plain text format isn't always the best option. Modern log monitoring solutions especially in cloud based environments prefer JSON for easy of log parsing and aggregation. It seems that the existing log library doesn't support that, it may need to be updated to support different formatting options. Optional suppression of color coding symbols might be desired too, so they don't get in the way of integrated log monitoring, in Splunk, CloudWatch, Azure Log Monitor, and likes.

devinshteyn2 · 2023-04-03T15:39:30Z

Looking forward to your PR review and further comments

dimitri · 2023-04-05T09:50:21Z

Please have a look at #234 that implements your ideas about the log file, including JSON support. Note that we already have support for bypassing colors when we detect that stderr is not a tty, so that should be fine for logging pipelines already.

dimitri

Here is a first round of review. In the meantime several other PRs have made it to the main branch that require a rebase and your attention, with changes in areas you're also working on. In particular PR #247 introduces support for SourceFilters in dump_restore.c in a way that you can benefit now.

dimitri · 2023-04-18T10:41:33Z

README.md

-    clone     Clone an entire database from source to target
-    fork      Clone an entire database from source to target
+    clone     Clone an entire database from source to target, schema and table
+              filters may apply (see Filters for more info)
+    fork      Clone an entire database from source to target, schema and table
+              filters may apply (see Filters for more info)
    follow    Replay changes from the source database to the target database
-    copy-db   Clone an entire database from source to target
+    copy-db   Clone an entire database from source to target, schema and table
+              filters may apply (see Filters for more info)


I would prefer the short descriptions to the command to fit in a single line, even if that means we don't mention filtering abilities. Also we don't have filtering support for logical decoding at the moment. Let's revert that change.

dimitri · 2023-04-18T10:43:05Z

README.md

+## Filters
+Filtering allows to skip some object definitions and data when copying from the source to the target database.
+The pgcopydb commands accepts the option --filter (or --filters) which expects an existing filename as the option argument.
+The given filename is read in the INI file format, but only uses sections and option keys. Option values are not used.
+Filtering supports two kinds of filters - exclusion and inclusion-only filters. Their complete description and examples
+can be found in pgcopydb documentation.
+
+The following filter types can be used for exclusions:
+exclude-schema
+exclude-index
+exclude-table-data
+exclude-table
+
+The following filter types can be used for inclusions only:
+include-only-schema
+include-only-table
+
+Multiple filters can be specified in the same filter file, yet not all combinations of filters are compatible and can be used
+together. For example, using include-only-schema together with exclude-table is fine, but using exclude-schema together with
+include-only-schema is not.
+
+


The README is just a README. See the docs directory for full documentation, including https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_config.html which details our filtering options. This PR should probably edit the docs to enhance our filtering support with the added functionality, and maybe make it more visible in the general discussions.

dimitri · 2023-04-18T10:46:25Z

src/bin/pgcopydb/dump_restore.c

+							 &(specs->filters.includeOnlySchemaList),
+							 &(specs->filters.excludeSchemaList)))


It should be enough to pass the filter itself as an argument and let the copydb_dump_source_schema deal with the detailed filtering support.

dimitri · 2023-04-18T10:47:19Z

src/bin/pgcopydb/dump_restore.c

+							 &(specs->filters.includeOnlySchemaList),
+							 &(specs->filters.excludeSchemaList)))


Same comment as before, let's pass the whole Filtering SourceFilters structure here as an argument and let the function deal with the support it has for it.

dimitri · 2023-04-18T10:48:33Z

src/bin/pgcopydb/dump_restore.c

+	/* if restoring specific schemas as specified in the inclusion filter
+		make sure they exist in the target database, if not create them.
+	 */
+	 if (specs->filters.includeOnlySchemaList.count > 0)
+	 {
+		if (!copydb_target_create_snpname(specs))
+		{
+			/* errors have already been logged */
+			return false;
+		}
+	 }
+
+


Why is that needed in the workflow? pg_dump and pg_restore should be taking care of that, right?

We need either to comment on why pg_restore fails to restore the needed schema, or better yet fix our code to make sure that pg_restore does the work for us here.

dimitri · 2023-04-18T10:55:44Z

src/bin/pgcopydb/filtering.c

+/*
+ * schemaFiltersJoin concatenates elements of a filter array to
+ * a single string (for use with pg_dump and pg_restore)
+ */
+char *
+schemaFiltersJoin(char *dest, size_t dest_size, SourceFilterSchemaList *list, char *separator)
+{
+    size_t separator_size = strlen(separator);
+    char *target = dest;              		 /* start of buffer, where to copy to */
+	char *target_end = dest + dest_size;	 /* end of buffer, cant go beyond that */
+    *target = '\0';
+	size_t nspname_size = 0;
+
+    for (size_t i = 0; i < list->count; i++)
+	{
+		nspname_size = strlen(list->array[i].nspname);
+
+        if (i > 0) /* first element or not? */
+		{
+			if (target_end <= (target + separator_size + nspname_size)) /* no more space */
+				return dest;
+			/* add separator */
+            strcat(target, separator);
+            target += separator_size;
+        }
+		else if (target_end <= (target + nspname_size)) /* no more space */
+			return dest;
+
+		/* add element */
+        strcat(target, list->array[i].nspname);
+        target += nspname_size;   /* move pointer to the end of string */
+    };
+
+    return dest;
+}


Do we really need that function? It looks like something that would better be done at the SQL level, or at least exposing the schemas as a VALUES statement (one row per namespace, rather than a composite value). If you really want to send the namespace list as a composite value, could we maybe prepare a JSON array and send its text representation then?

Otherwise, if we need to keep that code, please use the project formatting rules, and then use a PQExpBuffer API to build the string.

dimitri · 2023-04-18T10:56:38Z

src/bin/pgcopydb/main.c

@@ -32,6 +33,7 @@ size_t ps_buffer_size;          /* space determined at run time */
 size_t last_status_len;         /* use to minimize length of clobber */

 Semaphore log_semaphore = { 0 }; /* allows inter-process locking */
+FILE *log_fp = NULL;            /* file handle for optional log file */


Please rebase with the current main branch, where support for logging to file (and optionally in JSON format) has been added thanks to your suggestion. You can now remove your own approach to it ;-)

dimitri · 2023-04-18T10:56:49Z

src/bin/pgcopydb/pgcmd.c

+		   const SourceFilterSchemaList *includeSchemaList,
+		   const SourceFilterSchemaList *excludeSchemaList)


SourceFilters

dimitri · 2023-04-18T10:57:05Z

src/bin/pgcopydb/pgcmd.c

+		      const SourceFilterSchemaList *includeSchemaList,
+		      const SourceFilterSchemaList *excludeSchemaList)


SourceFilters

dimitri · 2023-04-18T10:59:40Z

tests/filtering-copy-1-schema/copy-schema.sh

+# copy one schema only
+pgcopydb clone --${LOG_LEVEL} --source=${PGCOPYDB_SOURCE_PGURI} --target=${PGCOPYDB_TARGET_PGURI} --filter=./include-1-schema.ini --no-acl --no-owner


Can we add a check that the other two schemas were indeed not copied over?

dimitri · 2023-05-22T09:28:20Z

ping @devinshteyn2 ; do you think you will have more time to spend on finishing this PR?

devinshteyn · 2023-05-31T00:15:24Z

Hi. Unfortunately I've been completely swamped at work, and haven't had a chance to review latest changes and merge them into my fork. I may or may not have an opportunity next week to return to the project, I would say there is a 50% chance of that happening. I will keep you posted on the status.

devinshteyn · 2023-06-05T23:51:36Z

Quick status update for you. I merged today latest changes from the main branch to my local fork. Going to spend some time tomorrow on testing, and if all goes well, commit and push them back.

devinshteyn · 2023-06-07T05:30:03Z

Pushed merged code with minor additions, reverted README.md to your liking, and addressed I think all comments in PR, except going back to single filter parameter. I can look into that tomorrow.

IMHO, I find the filtering interface hidden too deep to be found by many people. Not a single reference appears in the original README, nor in the help script output.
In my opinion using simple parameters as in pg_dump/restore is more transparent and more convenient. Maybe you would consider adding an enhancements for supplying filters via command line? As it stands now, for use in cloud and service provider sorts of environments servicing many users the existing interface requires writing dynamic filter files in automation scripts and then removing them after execution.

dimitri

Thanks for taking the time and efforts to prepare this PR for merge, still some more to do!

dimitri · 2023-06-07T12:14:43Z

.gitignore

@@ -41,3 +41,5 @@ lib*.pc
 /env/
 /GIT-VERSION-FILE
 /version
+/tests/schema-filter
+.vscode/


Please refrain from adding personal preferences to the repository, add .vscode to your local .gitignore setup!

dimitri · 2023-06-07T12:18:33Z

src/bin/pgcopydb/dump_restore.c

+	/* if restoring specific schemas as specified in the inclusion filter
+		make sure they exist in the target database, if not create them.
+	 */
+	 if (specs->filters.includeOnlySchemaList.count > 0)
+	 {
+		if (!copydb_target_prepare_namespaces(specs))
+		{
+			/* errors have already been logged */
+			return false;
+		}
+	 }


I thought we would take care of that automatically within the pg_dump and pg_restore calls. If that's not happening, I think the comment ought to say why and give some context. I would prefer that we don't have to create the schemas here, that said, because what if the user made a typo in their filtering setup? Then we create unwanted schemas on the target databases...

Also nitpicking: Please follow the code style in use all around in the same code/file. It looks like this:

/* * If restoring specific schemas as specified in the inclusion filter * make sure they exist in the target database, if not create them. */

Note the capital, empty lines, and stars at the beginning of all lines within the comment.

dimitri · 2023-06-07T12:22:16Z

src/bin/pgcopydb/main.c

+
+	/* duplicate output to log file if PGCOPYDB_LOG env var is set.
+	 * Warning: if the value is incorrect, it may cause segfault.
+	*/
+	if (env_exists("PGCOPYDB_LOG"))
+	{
+		char env_log_file[BUFSIZE] = { 0 };
+
+		if (get_env_copy("PGCOPYDB_LOG", env_log_file, BUFSIZE) > 0)
+		{
+			if ((log_fp = fopen(env_log_file, "w"))) /* extra parenthesis around assignment to avoid compiler warnings */
+			{
+				log_set_fp(log_fp);
+			}
+			else
+			{
+				log_error("Failed to open log file for writing %s. Error reason: %s", env_log_file, strerror(errno));
+				exit(EXIT_CODE_BAD_ARGS);
+			}
+		}
+	}


Please remove that parts, the corresponding feature has been added in the meantime already. See https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_clone.html#environment and PGCOPYDB_LOG_TIME_FORMAT, PGCOPYDB_LOG_JSON, PGCOPYDB_LOG_FILENAME, and PGCOPYDB_LOG_JSON_FILE.

dimitri · 2023-06-07T12:24:48Z

src/bin/pgcopydb/pgcmd.c

+	/* verbose output */
+	//args[argsIndex++] = "--verbose";
+


I suspect this is meant to be removed later, before final review and merge? You might also be able to use log_get_level to selectively add --verbose to the pg_dump call when using --debug or --trace at the pgcopydb level?

dimitri · 2023-06-07T12:25:42Z

src/bin/pgcopydb/pgcmd.h

+#define PG_CMD_MAX_ARG 256
+#define PG_DUMP_CMD_RESERVED_ARG 10
+#define PG_RESTORE_CMD_RESERVED_ARG 15


Nice addition, still deserves some commenting.

dimitri · 2023-06-07T12:26:50Z

src/bin/pgcopydb/schema.c

+		/* include-only-schema */
+		"         left join pg_temp.filter_include_only_schema fn "
+		"                on n.nspname = fn.nspname "
+
 		/* include-only-table */
-		"         join pg_temp.filter_include_only_table inc "
-		"           on n.nspname = inc.nspname "
-		"          and c.relname = inc.relname "
+		"         left join pg_temp.filter_include_only_table ft "
+		"                on n.nspname = ft.nspname "
+		"               and c.relname = ft.relname "


Why using a LEFT JOIN for an “include-only” filter implementation? I expected a JOIN here.

There are two left joins, one for inclusive-schema(s) and another for inclusive-table(s) filters, both can be handled in the same code, avoids unions with nearly duplicate code.

Ah yeah, nice trick!

Removed redundant parameters in pg_dump_db and pg_restore_db Cleanup up old comments

devinshteyn · 2023-06-07T18:02:13Z

Committed latest version with code cleanup as per your comments. Please review.

dimitri · 2023-07-25T21:12:54Z

I just found some time to review this PR again, and when playing with the SQL queries and filtering I realized the behaviour was a little suprising. Thinking more, what I wanted from this feature is the exact same behavior as the already existing "exclude-schema" filter, just built the other way around by listing the schema names to filter-in rather than the schema names to filter-out.

Maybe the best way to explain it is this part of the implementation for the new "include-only-schema":

insert into pg_temp.filter_exclude_schema
     select n.nspname
       from pg_namespace n
  left join pg_temp.filter_include_only_schema inc
         on n.nspname = inc.nspname
      where inc.nspname is null;

devinshteyn2 · 2023-07-26T02:09:44Z

Thank you very much for merging this PR!

If I understand correctly, you expected an inverse filter, having multiple exclusion filters inserted into the filter table, leaving just one schema (or a few) to be copied. Perhaps using that inverse method required much less code changes, can't argue with that. In our use case, we deal with hundreds sometimes thouthands of schemas in a single database and we copy just one when we need to move it to a different server. Exclusion filters feel very non intuitive in such cases. Maybe that's why to me the inverse method feels like... if I go to a grocery store to buy an apple, I bring all other apples found in the store to the cashier and tell them these are the apples I don't want to buy :-)

dimitri · 2023-07-26T09:23:57Z

I agree it's counter-intuitive.

Making the SQL query actually work with the "include-only-schema" approach was also very complex, because when the filter_include_only_schema table is empty (0 rows) you want to keep everything rather than filter-out everything, and inverting the filter condition based on how a table being empty or not in SQL is... well I couldn't find a nicer way to do it than inverting the filter and re-using a framework and a set of queries that all work already.

devinshteyn and others added 2 commits March 23, 2023 12:17

Inclusive schema name filtering, and file logging changes

d28bfbb

Merge branch 'main' into main

e8764b1

dimitri requested changes Apr 18, 2023

View reviewed changes

devinshteyn added 4 commits June 6, 2023 10:14

Merge remote-tracking branch 'upstream/main'

c6da3af

post-merge additional code updates

0c0ba1c

Restored original README.md

eab3866

Changes to address some concerns posted in PR review

0fc4d47

dimitri requested changes Jun 7, 2023

View reviewed changes

Code cleanup per PR review comments

56addd9

Removed redundant parameters in pg_dump_db and pg_restore_db Cleanup up old comments

dimitri mentioned this pull request Jun 21, 2023

Is it possible to copy only schemas I want by using pgcopydb clone/copy or other approaches? #330

Closed

dimitri mentioned this pull request Jul 25, 2023

Implement the new filter "include-only-schema". #403

Merged

dimitri closed this in #403 Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inclusive schema name filtering, and file logging changes #218

Inclusive schema name filtering, and file logging changes #218

devinshteyn2 commented Mar 23, 2023

lospejos commented Apr 1, 2023

dimitri commented Apr 3, 2023

devinshteyn2 commented Apr 3, 2023

devinshteyn2 commented Apr 3, 2023

dimitri commented Apr 5, 2023

dimitri left a comment

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri Apr 18, 2023

dimitri commented May 22, 2023

devinshteyn commented May 31, 2023

devinshteyn commented Jun 5, 2023

devinshteyn commented Jun 7, 2023

dimitri left a comment

dimitri Jun 7, 2023

dimitri Jun 7, 2023

dimitri Jun 7, 2023

dimitri Jun 7, 2023

dimitri Jun 7, 2023

dimitri Jun 7, 2023

devinshteyn Jun 7, 2023

dimitri Jun 7, 2023

devinshteyn commented Jun 7, 2023

dimitri commented Jul 25, 2023 •

edited

Loading

devinshteyn2 commented Jul 26, 2023

dimitri commented Jul 26, 2023

		&(specs->filters.includeOnlySchemaList),
		&(specs->filters.excludeSchemaList)))

		const SourceFilterSchemaList *includeSchemaList,
		const SourceFilterSchemaList *excludeSchemaList)

		# copy one schema only
		pgcopydb clone --${LOG_LEVEL} --source=${PGCOPYDB_SOURCE_PGURI} --target=${PGCOPYDB_TARGET_PGURI} --filter=./include-1-schema.ini --no-acl --no-owner

Inclusive schema name filtering, and file logging changes #218

Inclusive schema name filtering, and file logging changes #218

Conversation

devinshteyn2 commented Mar 23, 2023

lospejos commented Apr 1, 2023

dimitri commented Apr 3, 2023

devinshteyn2 commented Apr 3, 2023

devinshteyn2 commented Apr 3, 2023

dimitri commented Apr 5, 2023

dimitri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitri commented May 22, 2023

devinshteyn commented May 31, 2023

devinshteyn commented Jun 5, 2023

devinshteyn commented Jun 7, 2023

dimitri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinshteyn commented Jun 7, 2023

dimitri commented Jul 25, 2023 • edited Loading

devinshteyn2 commented Jul 26, 2023

dimitri commented Jul 26, 2023

dimitri commented Jul 25, 2023 •

edited

Loading