Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using XML / XSLT instead of string manipulation? #389

Closed
bclothier opened this issue Apr 2, 2023 · 5 comments
Closed

Using XML / XSLT instead of string manipulation? #389

bclothier opened this issue Apr 2, 2023 · 5 comments

Comments

@bclothier
Copy link
Contributor

Is there a particular reason to not use MSXML2's native methods for handling XML contents?

Using the XML methods should be faster than than string processing and would avoid bugs like this one addressed in this commit: a66e4fb

I do see you already invested a lot of effort in optimizing string manipulation operations which are usually the biggest bottleneck but wanted to confirm if anyone had done performance tests between XML manipulation vs. string manipulation.

FWIW, here's an example of using XSLT to control the indentation in a declarative manner.
https://github.com/joyfullservice/msaccess-vcs-integration/blob/a66e4fbcb899b77e6eff1b5851ed0734c52784f5/Version%20Control.accda.src/modules/modSanitize.bas#L738-L774

@joyfullservice
Copy link
Owner

Thanks for the suggestion on this! I have never actually used XSL transformations, but I really like the idea of a declarative stylesheet that handles the formatting of the XML content.

On the performance side most of the time involved in running a full export on a complex database tended to be in sanitizing form objects, especially ones with embedded images or other components that added large amounts of binary data to the form definition sections. Developing and using clsConcat for string manipulation brought massive performance gains to the parsing process.

I would imagine you have probably seen the export log files, but at the very end of the file you can find the performance report that breaks down the performance of various operations during the export or build process. This was one of of the key tools that I used when refining and tuning the performance of exports. This would be a great place to review the difference between string manipulation and XSLT.

@bclothier
Copy link
Contributor Author

bclothier commented Apr 4, 2023

Using a highly sourced[citation needed], rigorous[citation needed], peer-reviewed[citation needed] scientific study[citation needed], I was able to determine that with version 4.0.9, I got the following times:

Attempt 1
Sanitize XML                  22        0.24
Format XML                    22        0.16

Attempt 2
Sanitize XML                  22        0.23
Format XML                    22        0.16

With version 4.0.10, I got the following times:

Sanitize XML                  22        0.07
Format XML                    22        0.10

Sanitize XML                  22        0.07
Format XML                    22        0.10

This seems to be approximately faster[citation needed].

NOTE: the overall time, however, is either at best a wash or or at worst slightly slower for 4.0.10, due to the other changes introduced. The variance in run times were too great to properly assess the overall performance impact.

@joyfullservice
Copy link
Owner

I love the [citation needed] quotes!! 🤣 Well, I would help with the peer review except that I don't think I would quite measure up as a peer. 😄 I ran some testing this morning using a tweaked version of the sanitize function that removes the file writing and logging to keep it just down to the actual processing of the content.

Using an 86KB XML file (1500 lines) as a sample, and processing the file 1000 times, I was getting a pretty consistent output.

Public Sub TestSanitizing()

    Const cstrFile As String = "C:\...\tblSubscriptionMeta.xml"

    Dim strData As String
    Dim lngCnt As Long

    strData = ReadFile(cstrFile)
    Perf.StartTiming
    
    For lngCnt = 1 To 1000
        SanitizeXML2 (strData), False
    Next lngCnt
    
    Perf.EndTiming
    Debug.Print Perf.GetReports

End Sub
--------------------------------------------------
                PERFORMANCE REPORTS
--------------------------------------------------
Operations                    Count     Seconds
--------------------------------------------------
Sanitize XML                  1000      13.37
Format XML                    1000      16.43
--------------------------------------------------
Other Operations                        0.68
--------------------------------------------------

The processing time for formatting the XML was virtually identical whether I used the string manipulation or XSLT functions. Based on this I am pretty confident that we have no performance loss for taking the XSLT approach, and it gives us some other definite advantages. Great idea!! 👍

@hecon5
Copy link
Contributor

hecon5 commented Apr 20, 2023

I like it, too! Been away for a while (other projects ate me), but glad to see this is still being improved!

joyfullservice added a commit that referenced this issue Dec 12, 2023
Large XML files may cause memory errors with XSLT operations. Adding an alternate approach to simply replace the leading tabs with two spaces. This should allow the add-in to export even extremely large table data files as formatted XML. #389, fixes #474
josef-poetzl added a commit to josef-poetzl/msaccess-vcs-addin that referenced this issue Jan 19, 2024
* Updating clsPerformance, as some objects never restart timing, and when resetting some objects are not cleared. Fixes joyfullservice#331

* Fixing Private/Public declarations.

* This isn't actually used.

* Bump Version

* Update API examples

Removed dependency on an external function, and added an example for building from source.

* Refine some dialect-specific SQL string quotations

Backticks only apply to MySQL, while square brackets are used with MSSQL and Access. joyfullservice#442

* Allow wrapping of long names in performance class

Extending the performance class to allow the wrapping of long names used for categories or operations. (Not really needed within this project, but could potentially be helpful in the future with translations.) joyfullservice#441

* Update based on feedback from @joyfullservice.

* Resolve conflict with upstream file

Putting the comma after the argument seems to be the preferred industry-standard approach, based on ChatGPT and Bard.

* Add option to pass path to build API

You can now specify the source files path when you initiate a build through the API. This allows automated builds to be run even if a copy of the database does not yet exist. (Such as after checking out a project in an automated CI workflow.) joyfullservice#430

* The logic for checking of existence of git files wasn't always working as expected due to searching the current directory rather than using the export folder.

* Add a check when loading XML and verify it was successfully parsed. This avoid generating a bad export where the data are not actually exported due to invalid XML being generated by Application.ExportXML. Unfortunately, if a table contains any characters that aren't valid for XML document, it won't try to escape them and include them as literals. Even if they were escaped, they might not be accepted anyway. XML specifications forbids having any characters in 0x01-0x31 range so if a table data contains such characters, this can cause the XML export to fail. In this case, tab delimited will have to be used instead. However, the previous version was simply silently exporting as if everything is hunky-dory when it's not. Hence, the error.

* The export log was littered with bunch of warnings about unclosed blocks. This seems to be due to not closing it when evaluating the UseTheme. Even if we skipped it, we still need to remove it from m_colBlocks to balance everything out.

* Fix a subscript out of range error where the tokens advance beyond the end of the string but the function GetNextTokenID returns 0, which then fails within FormatSQL function since there is no member at index 0. It's not clear why this only fails every second time a query is exported but it is the case where if it fails, exporting it next time will not yield the error. Do it 3rd time, then it fails.

* Add more types of queries that should not be formatted by SQL formatter because they are a variant of pass-through queries.

* The AutoClose may run after the form has closed (e.g. if the user is quick to close it) which may result in an error about object members not available. Since the form is closed, there's no point in setting the timer interval. To avoid the error when debugging, we add a IsLoaded check and skip it if it's not loaded.

* Fix issue with LogUnhandledErrors and simplify use. (joyfullservice#449)

* Add option to SplitLayoutFromVBA

This option (on by default) will save the VBA code from forms and reports as a related .cls file. (Still under development.) joyfullservice#378

Also removed the "Strip out Publish Option" from the options form. I have never heard of a case where this needs to be changed, and it frees up space for the new option we are adding without cluttering the form.

* Refactor code module export to shared function

This logic will be shared when exporting code modules from forms and reports.

* Support "|" in performance log entry names

Refactored parsing the key from the performance item so that we are not dependent upon a unique delimiter. The timing value is always a number, so we can be confident that the first pipe character is the delimiter. The text after that can be anything, including pipe characters. joyfullservice#450

* Adjust indenting

(minor change)

* Convert Sanitize module to class

In some cases sanitizing a source file actually creates two distinct outputs. A layout file and a code file. Rather than making the sanitize function more complicated with byref outputs and non-obvious side effects, I am taking the approach of a more explicit object-oriented route where the code is easier to understand and maintain. (And also allows for future enhancements such as SQL extraction for query definition files.)

* Refactor sanitizing to use class

Updating the existing code to use the new class.

* Refactor class variables

* Refactor form/report export to split VBA

Export is now splitting the VBA from Form and Report objects to separate files with a .cls extension. Moving on to the code that will stitch these files back together before import.

* Rename Sanitize class to SourceParser

This better reflects the expanded role of the class.

* Refactor for name change

* Verify ribbon active state when the add-in loads

Ensure that the ribbon is active when installing or activating the add-in. See joyfullservice#451

* Don't auto split layout/VBA for existing projects

For existing projects in git repositories, form and report layouts should not be automatically split from the VBA code classes. There is another process that will allow us to split the files while preserving history in both files, but this involves a couple commits and requires a clean branch. For existing projects, this is a manual upgrade (option changes). For new projects, it can happen by default.

* Move print settings processing to clsSourceParser

This keeps the LoadComponentFromText function cleaner and easier to read.

* Move source reading function

This is used in several areas, and allows us to maintain the source file encoding determination in a single location.

* Rework merging source content before import

Cleaning this up to avoid reading and writing the file additional times while merging content from different sources. (Print settings, VBA code)

* Add support to overlay VBA code after import

For some (rare) situations, it is necessary to push the VBA code directly using VBE to preserve certain extended characters that may be corrupted in a regular round-trip export/import cycle.

* Code cleanup and minor tweaks

* Fix bugs in build logic

Uncovered these while testing.

* Check for diff tool before comparing objects

* Implement correction according to rubberduck (joyfullservice#453)

replace VBA commands:
format with format$
trim with trim$

* Add wiki page for Split Files

Describes the process in a little more detail.

* Add change hook for options

Used for special processing when certain options change.

* Automate splitting forms and reports

Adds a link and some code automation to split forms and reports in existing projects to layout and class files.

* Rename function

Git.Installed sounds better than Git.GitInstalled, and will almost always be called in the context of the git class.

* Fixes joyfullservice#354 and Fixes joyfullservice#452 (joyfullservice#454)

From @hecon5:

Bump version minor number because it's not clear that the index will allow round trip from prior types in all cases; it worked on my machine, but that may not always be the case.

The date types for the index are handled natively by modJsonConverter and should import/export correctly regardless of user's date / time zone or date encoding on machines.

* Add performance timing to ISO date parsing

See joyfullservice#354

* Add high-performance wrapper functions

Avoids the use of RegEx when it is not necessary to parse a standard date format. joyfullservice#354

* Fix copy-paste oversight

joyfullservice#354

* Update error handling

Refactored a number of locations to use the new syntax for On Error Resume Next, and added code to clear expected errors.

* Use faster date parsing for date only values

* Add Split Files utility to ribbon (Advanced Tools)

Also added an informational message box when the split is complete.

* Rename as new files

* Restore original files

* Split layout from VBA in testing database

Separates the VBA code from the layout definition in the source files. (Applying to testing database now, will apply to main project soon.)

* Adjust version number

I am using the minor version number to represent releases from the main branch, and the build number to continuously increment during the development cycle.

* Revert the ConvDateUTC and ConvTimeUTC functions to always parse the "Fast" way first and revert otherwise. this allows the optimization to be used everywhere with no code changes. Ensure that millisecond accuracy is kept for otherse using the function. No Speed impact is noted on my end to doing this.

* Pass by ref so we don't need to build more memory use. Optimize Offset string building to only do math when it's required and fix whitespace.

* Cache the format types instead of needing to build them every time.

* Bump Version

* Verify consistent naming and byref passing of strings

* Implement dialect in SQL formatting

This was previously only partially implemented. joyfullservice#457

* Add support for ! boundary character

This character is used in Microsoft Access in a query when referring directly to a control on a form, and should be treated similar to a period as a separator between elements. joyfullservice#457

* Add SQL formatting query to testing database

This query demonstrates that we can properly parse and format expressions that refer to controls. joyfullservice#457

* Solve rare edge case with SQL IN clause

Just in case a user has an embedded unquoted path in a string, the colon will be treated as a non-spaced boundary character during formatting. (For Microsoft Access SQL only) Fixes joyfullservice#447

* Addresses joyfullservice#459 (joyfullservice#460)

Addresses joyfullservice#459

* Allow sort of operationames with leading spaces (joyfullservice#463)

If a operationname has a leading space (like " MyOperation" ) the function "SortItemsByTime" fails.
Now sorting will success.

* Update comment

After removing string padding in the previous commit.

* Adjust detection of system tables

Switching to just using a bit flag check to solve joyfullservice#462

* Log warning for UNC path access errors

Failing to convert a path to UNC may not prevent the operation from completing, but it should be handled and logged. Fixes joyfullservice#461

* Refactor date conversion for DB Properties

Save custom date properties in ISO (UTC) format in source files, without converting other property types like strings that may parse as dates. joyfullservice#459

* Turn off date ISO conversion by default

This is only used in the index and certain database properties. joyfullservice#459

* Turn on date ISO conversion before reading index

These dates need to be converted to local dates for internal processing. joyfullservice#459

* Add saved date property to testing database

Verifies that the round trip conversion of saved date properties is working correctly. (The dates are stored as UTC in source files, but converted to local dates when imported into the database properties.) joyfullservice#459

* Add dates stored as text to testing database

One stored as a regular date string, and the other as an ISO8601 string. (Neither should convert when reading or writing to JSON.) joyfullservice#459

* Add default git files if dbs in repository root

If you use the default options of no special export folder path defined, the project may likely be in the repository root. Add the default .gitignore and .gitattributes files if they are missing. (This would be the default setup for most projects.)

* Add note about Access 2007 support

See joyfullservice#464

* Add test for Public Creatable class instance

This is an undocumented property value that is sometimes used in the wild. Currently when you build from source, PublicCreatable (5) classes are converted to PublicNotCreatable (2). This instancing property can be set in VBA, and we want the imported class to match what was exported. This test currently fails, but will pass when the add-in is updated to support this property.

* Support for PublicCreatable instancing for classes

This (technically undocumented) technique allows class objects to be created by external projects without using factory methods. This approach was used in some of my projects, so it was important for me to see this property correctly set when the application was built from source.

* Check VCS version before export

Checks the VCS version before export to warn the user if we are running an export with an older version of VCS than was last used on this project. joyfullservice#465

* Check VCS version on build

Check the VCS version before building, and warn if the project version is greater than the installed version. joyfullservice#465

* Reset error break mode after loading options

The ability to break and debug VBA errors is dependent on an option value that is saved with each project. The on error directive should be reset after loading the project options to ensure that we can successfully break on errors.

* Include name prefix with VBA code overlay

Testing this on a machine using the Unicode BETA option in Windows.

* Resolve build path during options upgrade

We may not have a database open when upgrading options. Fall back to the path used to load the project options to determine the source file path. joyfullservice#467

* Add some documentation for merge build

Taking some time to document the intended behavior of the merge build functionality as we work through some bugs. joyfullservice#471, joyfullservice#81

* Update Merge-Build.md

Expand table of expected behavior.

* Remove git integration for getting modified source

I don't think this is actually being used in the wild, and it simplifies the process to have only a single code path for detecting changed source files.

* Add missing database objects on merge

If a source file exists, but no matching database object exists, we should merge the source file into the database. joyfullservice#471

* Fix SQL export of non-formatted queries

Pass-through queries are now exported as SQL again.

* Refactor source modified date for multiple files

Some types of components, such as tables, forms, reports, queries, and shared images may use multiple source files to represent a single database component. We may need to check all of the related source files to accurately determine the latest modification date.

* Log performance of clearing files by extension

* Refactor components to provide file extension list

Parent functionality such as determining the most recent file modified, getting the last modified date, and checking for alternate source files is better done by having the class provide the list of file extensions that might be used by the class, and having single external functions perform these tasks. (Avoids some redundant code.)

* Add multi-file support to file properties hash

This will allow us to more accurately detect changes in non-primary source files. (Such as a change in a shared image.)

* Simplify component class

Removing three functions that are not uniquely specific to each component type and are handled by external functions now that we have exposed the source code file extensions.

* Remove Upgrade function on IDbComponent

We don't need to try to support mixes of various versions of export files. Use the same version of VCS to build a fresh copy of the project, then export with the latest VCS to upgrade source file.

* Move logic to clear legacy files

Moving this to the Export function.

* Rework processing of conflicts & orphaned objects

Refactored the detection and processing of source conflicts and orphaned source files & database objects to better handle source file types that involve multiple files. joyfullservice#473 joyfullservice#471 joyfullservice#472

* Compare source contents of related files

When checking for changes in source files, we need to check all the related source files for each component, not just the primary source file.

* Adapt export comparison to support multiple files

Further changes to compare all related source files.

* Update testing database

Updated to latest version of VCS.

* Include class instancing in code module hash

Class modules have an instancing property that needs to be checked for changes along with the VBA code to ensure that the database object matches the last export. A module will now be flagged as changed if the instancing property is changed.

* Trap any XML import errors

* Add alternate XML format function for big files

Large XML files may cause memory errors with XSLT operations. Adding an alternate approach to simply replace the leading tabs with two spaces. This should allow the add-in to export even extremely large table data files as formatted XML. joyfullservice#389, fixes joyfullservice#474

* Remove format version from custom groups

Any recent version of export file should be using the new format, and we don't need to carry this conversion forward into v4.

* Fix issue with orphaned file detection

Need to pass a dictionary, not a collection to the CompareToIndex function.

* Move testing code to testing module

* Require hash on index update

Any update to the index is now required to provide a hash to match the source file. joyfullservice#472

* Save schema filter rules as collection

Saving each filter line as a single element in a collection makes a much more readable section in the options.json file, especially when the rules become more complex. Previously this was saved as a combined string value which makes it harder to read changes to individual rules.

* Support AfterBuild hooks in add-in project

Made a tweak so we can use the RunAfterBuild hook in the add-in project to verify (load) the resources immediately after building from source. This will help prevent accidentally deploying the add-in without the needed resource records, as happened in joyfullservice#477.

* Rename as new files

* Restore original files

* Split forms from VBA code in add-in project

Going forward, this will allow us to edit the VBA code without affecting the layout definition files in forms.

* Add region support type double (joyfullservice#481)

Co-authored-by: Festiis <festim.nuredini@axami.se>

* Add some additional comments to code changes

Clarifies why we are using the Val() function when parsing ISO dates.

* Add initial support for CommandBar popup menus

This is still a work in progress, but has the basic functionality of exporting and importing custom CommandBars.

* Add error handling to linked table refresh

This could be related to a recent Access bug, but it is helpful to trap the error if it occurs. joyfullservice#484

---------

Co-authored-by: Hecon5 <54177882+hecon5@users.noreply.github.com>
Co-authored-by: joyfullservice <joyfullservice@users.noreply.github.com>
Co-authored-by: bclothier <bgclothier@gmail.com>
Co-authored-by: Tanarri <Tanarri@users.noreply.github.com>
Co-authored-by: Festim Nuredini <44016065+Festiis@users.noreply.github.com>
Co-authored-by: Festiis <festim.nuredini@axami.se>
@joyfullservice
Copy link
Owner

This enhancement was a success. Closing this out as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants