Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double escaping text in XML files? #352

Closed
jhgarrison opened this issue Sep 7, 2022 · 18 comments
Closed

Double escaping text in XML files? #352

jhgarrison opened this issue Sep 7, 2022 · 18 comments

Comments

@jhgarrison
Copy link

When exporting table contents, the latest version (3.4.23) seems to be double-escaping text in XML files.

Here's one sample diff from within a file that hasn't been exported since before I upgraded to 3.4.23

-<Comment>Observation Req'd</Comment>
+<Comment>Observation Req&amp;apos;d</Comment>

If you were just escaping I'd expect Observation Req&apos;d, but it seems the escaping has been applied twice. I haven't tried building from source but I suspect this is not correct.

@jhgarrison
Copy link
Author

This seems to be a bug introduced by #314. While Access ExportXML may incorrectly encode the ampersand character in an a TableDef index name, it DOES correctly encode table data.

The end result is that the output of ExportXML for table data is correct, but then calling SanitizeXML on the output file results in double encoding.

It appears the application of SanitizeXML must be restricted to only those files (or even parts of files) that require it.

@jhgarrison
Copy link
Author

For a local fix I'm just removing the call to SanitizeXML from clsDbTableData::IDbComponent_Export(). Not sure if this is the correct fix globally (it may need to be removed other places) so I won't submit a PR.

@jhgarrison
Copy link
Author

Another thought... It appears MS have made a change to ExportXML sometime in the last few months. That same file, exported with my local fix, now produces this diff:

-<Comment>Observation Req'd</Comment>
+<Comment>Observation Req&apos;d</Comment>

which is what I would expect, but does mean ExportXML is now encoding characters that previously were not encoded.

It might be necessary to evaluate if SanitizeXML is still needed.

@joyfullservice
Copy link
Owner

Thanks for the research and testing on this! We can certainly limit the sanitizing by version if different versions of Access behave differently on the character encoding. I will try to do a little testing on Access 2010 when I have a chance...

@bgsmase
Copy link

bgsmase commented Oct 14, 2022

Apart from comment text, this can break things if you have escaped characters in validation rules. For example, I found that when I had a ValidationRule on a field of >=1 this was getting exported as <od:fieldProperty name="ValidationRule" type="12" value="&amp;gt;=1"/> and when trying to build from source again it came in as "&gt;"=1 in the Validation Rule box.

Microsoft® Access® for Microsoft 365 MSO (Version 2209 Build 16.0.15629.20152) 32-bit
VCS add in version 3.4.23

@bgsmase
Copy link

bgsmase commented Oct 14, 2022

My current workaround is scripting an edit of the exported tbldef XML files changing &amp; only when it is the start of entities like &amp;gt; to &gt; and &amp;lt; to &lt; but not when it is just escaping an instance of &amp; that isn't the start of another entity reference.

@Indigo744
Copy link
Contributor

I just came across this issue recently when I tried to migrate my few local tables from Tab Delimited to XML.

Particularly, it botched the USysRibbons table, which has an RibbonXml field used to customized the ribbon.

Before (in this custom ribbon, I only hide the Help toolbar):
image

After exporting and rebuilding from source:
image

For the time being, I will keep the exported data as Tab Delimited for this table.
image

But I do think this issue should be fixed. @joyfullservice did you look into it? Do you need help?

@joyfullservice
Copy link
Owner

Does it have this problem in the latest build on the dev branch? I am thinking we had worked through some XML issues a little while back... We also have some recent updates to rework the entire XML structuring, thanks to @bclothier's excellent contributions to refactor the XML processing to use XSL.

Let me know if you still see this issue after building the dev branch from source...

@Indigo744
Copy link
Contributor

Ah, I didn't think of checking the dev branch! I was waiting a stable release 😸
Anyway, checking with latest release 4.0.9, I was able to export and reimport successfully, and it works as expected! So this is great. Thank you and I'm sorry to have bothered you without properly searching.

Maybe you can add this information as a "known issue" of the 3.x branch.

As a side note, the export time is roughly longer by 30% between 3.x and 4.x branch, mainly due in the tables category (x5) and the Read File operation (x10). I'm not really sure why, I haven't looked precisely into it.

@bclothier
Copy link
Contributor

One concern is whether it's due to the changes introduced in #388 . A quick test would be to compare the version from this commit which is the last one before #388 was merged.

I do expect some slowdown due to the new table connection checks introduced in the PR but would like to confirm whether it's actually 30% slower which seems surprising.

@Indigo744
Copy link
Contributor

Please note that I used the latest released build, which is the version 4.0.9. I'll try again with the 4.0.10 once available and share the results.

@joyfullservice
Copy link
Owner

@Indigo744 - Attached is a fresh build from the current dev branch, which would have all the latest updates, including the enhancements from @bclothier. Let me know what you see on the performance side... (You will notice that the performance reports now sort the items by time, so you can see the slowest items at the top of the list.)
Version Control_v4.0.10(dev).zip

@joyfullservice
Copy link
Owner

I did a fair bit of testing on this with one of my most complex databases, comparing between the 4.04 and the current dev build. After testing back and forth between the versions, running multiple full exports, the current version is coming up consistently faster.

The chart below compares the two version with the seconds involved in running some of the bigger operations:

image

Here are the actual performance reports from version 4.0.10 on this database during a full export:

--------------------------------------------------
                PERFORMANCE REPORTS
--------------------------------------------------
Category                      Count     Seconds
--------------------------------------------------
Forms                         209       4.79
Tables                        138       2.42
Reports                       46        1.98
Queries                       75        1.20
Modules                       42        0.51
DB Connections                5         0.37
Doc Properties                1         0.29
DB Properties                 54        0.09
VBE Forms                     1         0.07
Macros                        2         0.06
Shared Images                 3         0.05
Nav Pane Groups               1         0.05
Table Data                    1         0.04
Themes                        1         0.04
VBE References                11        0.04
Hidden Attributes             0         0.04
VB Project                    1         0.03
Project                       1         0.03
Table Data Macros             0         0.01
Proj Properties               0         0.00
Relations                     0         0.00
IMEX Specs                    0         0.00
Saved Specs                   0         0.00
--------------------------------------------------
TOTALS:                       592       12.11
--------------------------------------------------

--------------------------------------------------
Operations                    Count     Seconds
--------------------------------------------------
Read File                     590       2.51
Sanitize File                 332       2.03
App.SaveAsText()              332       1.92
Scan DB Objects               1         0.88
Increment Progress            1815      0.83
Save Table SQL                138       0.64
Convert to JSON               3606      0.57
Read File Bytes               632       0.39
Write File                    723       0.35
Clear Orphaned Files          10        0.32
Get File Property Hash        1048      0.26
Save Query SQL                75        0.23
Console Updates               8         0.19
Compute SHA256                2076      0.18
Read File DevMode             255       0.15
Get VBA Hash                  297       0.12
Check for linked table        138       0.12
Parse JSON                    1         0.12
Export VBE Module             42        0.11
Delete File                   377       0.10
Verify Path                   1103      0.10
Get Modified Date             524       0.09
Enc. Windows-1252 as utf-8    42        0.06
App.ExportXML()               2         0.01
RunBeforeExport               1         0.01
Export Table Data as TDF      1         0.01
Quick Count Objects           1         0.00
Quick Count Files             1         0.00
Sanitize XML                  2         0.00
Format XML                    2         0.00
Write Binary File             3         0.00
Export Theme                  1         0.00
Clear Orphaned Folders        1         0.00
--------------------------------------------------
Other Operations                        0.06
--------------------------------------------------

@Indigo744
Copy link
Contributor

@joyfullservice @bclothier
After testing the latest 4.0.10 build, the difference is only 10% slower (4s), which is only marginally slower.

                      3.4.23                      		                         4.0.10
--------------------------------------------------		--------------------------------------------------
                PERFORMANCE REPORTS               		                PERFORMANCE REPORTS
--------------------------------------------------		--------------------------------------------------
Object Type                   Count     Seconds   		Category                      Count     Seconds
--------------------------------------------------		--------------------------------------------------
Project                       1         0.01      		Queries                       348       17.96
VB Project                    1         0.01      		Forms                         265       11.40
VBE References                7         0.03      		Tables                        249       6.10
Proj Properties               2         0.00      		Reports                       134       4.57
DB Properties                 61        0.14      		VB Project                    1         0.92
Shared Images                 16        0.10      		Modules                       48        0.89
Themes                        1         0.00      		DB Connections                1         0.76
Tables                        249       3.38      		Doc Properties                3         0.71
Queries                       348       18.97     		Shared Images                 16        0.22
Forms                         265       11.22     		Hidden Attributes             3         0.19
Macros                        14        0.13      		DB Properties                 61        0.19
Reports                       134       4.54      		Macros                        14        0.17
Table Data                    4         0.09      		Table Data                    4         0.14
Modules                       48        0.70      		Nav Pane Groups               1         0.04
Doc Properties                3         0.02      		VBE References                7         0.04
Nav Pane Groups               1         0.01      		Proj Properties               2         0.03
Hidden Attributes             3         0.00      		Table Data Macros             0         0.02
--------------------------------------------------		Themes                        1         0.02
TOTALS:                       1158      39.36     		Project                       1         0.01
--------------------------------------------------		Relations                     0         0.00
                                                  		IMEX Specs                    0         0.00
                                                  		VBE Forms                     0         0.00
                                                  		Saved Specs                   0         0.00
                                                  		--------------------------------------------------
                                                  		TOTALS:                       1159      44.38
                                                  		--------------------------------------------------
                                                  		
--------------------------------------------------		--------------------------------------------------
Operations                    Count     Seconds   		Operations                    Count     Seconds
--------------------------------------------------		--------------------------------------------------
Read File                     1564      0.30      		Read File                     1179      11.00
Parse JSON                    386       0.29      		App.SaveAsText()              761       10.95
Convert to JSON               10736     2.41      		Read File Bytes               1236      4.60
Compute SHA256                1751      0.08      		Scan DB Objects               1         3.34
Console Updates               7         0.08      		Save Query SQL                348       2.90
Compare Dictionary            384       0.00      		Sanitize File                 761       2.90
Get Modified Date             1061      0.11      		Convert to JSON               8449      2.61
Get File Property Hash        1061      0.20      		Save Table SQL                249       1.86
Clear Orphaned                8         0.42      		Increment Progress            3994      0.61
Write File                    1394      0.47      		Write File                    1762      0.56
Verify Path                   2238      0.12      		Enc. Windows-1252 as utf-8    48        0.42
Write to Disk                 16        0.00      		Clear Orphaned Files          9         0.42
Create Folder                 1         0.00      		Check for linked table        249       0.40
Export Theme                  1         0.00      		Get File Property Hash        2168      0.38
Save Table SQL                249       0.22      		Get VBA Hash                  447       0.23
App.ExportXML()               18        0.17      		Compute SHA256                4005      0.22
Read File Bytes               1178      4.72      		App.ExportXML()               18        0.20
Sanitize XML                  18        0.01      		Export VBE Module             48        0.17
Format XML                    18        0.01      		Delete File                   837       0.15
Increment Progress            69        0.48      		Read File DevMode             399       0.15
App.SaveAsText()              761       11.33     		Parse JSON                    1         0.13
Delete File                   813       0.15      		Verify Path                   2607      0.11
Sanitize File                 761       2.88      		Get Modified Date             1108      0.10
Save Query SQL                348       3.51      		Console Updates               7         0.03
Read File DevMode             399       0.15      		Move File                     21        0.02
Get VBA Hash                  447       0.25      		Sanitize XML                  18        0.02
Export VBE Module             48        0.16      		Quick Count Objects           1         0.01
Enc. Windows-1252 as utf-8    48        0.40      		Format XML                    18        0.01
--------------------------------------------------		Quick Count Files             1         0.01
Other Operations                        11.63     		Write Binary File             16        0.00
--------------------------------------------------		Export Theme                  2         0.00
                                                  		Create Folder                 1         0.00
                                                  		Clear Orphaned Folders        1         0.00
                                                  		--------------------------------------------------
                                                  		Other Operations                        0.12
                                                  		--------------------------------------------------

@joyfullservice How much work is still needed before a 4.0.x release? Is it PROD ready yet or should I wait more stability?

@bclothier
Copy link
Contributor

One thing I can't help not noticing is that in 3.4.26, the Other Operations takes 11.63 whereas it's 0.12 in 4.0.10. That may indicate that the higher time reported for some objects such as table might be actually more accurate measurement. The Read File went from 0.30 to 11.00.

I do not know enough about the performance measurement implementation to be sure whether that affects categories. For example, 3.4.26 reports 3.38 seconds whereas 4.0.10 reports 6.10 seconds but is that because it is now measuring the time more accurately than previously? That does have the unfortunate side effect of snowing out where the actual slowdown is. I'm glad it's only 10% slower given the other changes that were added.

@joyfullservice
Copy link
Owner

Thanks for posting the performance reports! That is really helpful, especially on a very large, complex database. One thing that has me a bit mystified is why we see such a performance difference in the Read File function... I have verified in the source code, and the function itself is identical between these versions. 🤔 If you take out this difference, version 4.0.10 is actually a few seconds faster overall, which would make sense to me, given some of the additional optimizations in the newer version. (In my testing I was finding the newer version generally slightly faster.)

I did notice something interesting with the Read File function on my computer yesterday. I noticed that the read times seemed higher than I was expecting, and my computer was doing a lot more with memory, CPU and disk IO. Windows Explorer seemed to be using quite a bit of CPU, so I restarted the process. Subsequent exports went much faster, and the Read File function was back in the expected range. This might have been a fluke thing with my computer, but it was interesting to note.

Regarding the performance tracking, the newer version is going to be more accurate, especially in regard to the Other Operations. I got that cleaned up a bit more in the newer version to ensure we were tracking more operations that were slipping through the cracks in earlier versions.

@joyfullservice
Copy link
Owner

@joyfullservice How much work is still needed before a 4.0.x release? Is it PROD ready yet or should I wait more stability?

Great question! I have been using it in production, and it is working great for me. The main things remaining before release is to finish working through the last few remaining objects to add merge support, then finish out the merge build functionality. (This will allow you to merge in a few changed source files into an existing database without needing to build the entire thing from scratch.) The merge build will be a game-changer in a multi-developer context because it allows you to quickly and easily merge in another developer's changes without having to stop and build everything from source.

The other significant change I am planning to implement before the general 4.0 rollout is the splitting out the VBA code from form and report exports. This is discussed in more detail in #378, and would mean that a form has two source files. One with the object definition, and another corresponding class file with the VBA code. I am pretty close to finishing a way to make this split while still preserving the git history for those using git as their VCS back end.

I am pretty comfortable with v4 at this point, and don't really anticipate any other major breaking changes in this version as we head towards the general release. There is a little fine tuning left on the conflict detection (particularly in relation to orphaned files), but that is all new functionality anyway.

@bgsmase
Copy link

bgsmase commented May 3, 2024

I have re-tested with v4.0.34 rather than 3.4.23 and the export and import of validation rules with '<' characters in now works as well as table field comments with ', ", <, > and & characters. So I think this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants