-
Notifications
You must be signed in to change notification settings - Fork 22
Diff Command
The diff
tool is for comparing GEDCOM files and producing an HTML report.
gedcom diff -left-gedcom file1.ged -right-gedcom file2.ged -output out.html
For a complete list of options use:
gedcom diff -help
- -allow-invalid-indents
- -allow-multi-line
- -google-analytics-id-string
- -hide-equal
- -jobs int
- -left-gedcom string (required)
- -minimum-similarity float
- -minimum-weighted-similarity float
- -name-format string
- -no-censuses
- -no-changes
- -no-duplicate-names
- -no-empty-deaths
- -no-events
- -no-labels
- -no-maps
- -no-objects
- -no-places
- -no-residences
- -no-sources
- -only-official
- -only-vitals
- -output string
- -prefer-pointer-above float
- -progress
- -right-gedcom string (required)
- -show string
- -sort string
When enabled, -allow-invalid-indents
allows a child node to have an indent greater than +1 of the parent. -allow-invalid-indents
is disabled by default because if this happens the GEDCOM file is broken in some possibly serious way and certainly not a valid GEDCOM file.
The biggest problem with having the indents wrongly aligned is that nodes that are expected to be a certain depth (such as NPFX inside a NAME) will probably break or interfere with a traversal algorithm that is not expecting the node to be there/at that level. This may lead to unexpected behavior.
It is not valid for GEDCOM values to contain new lines or carriage returns. However, some application dump data without correctly using the CONT tags.
Strictly speaking we should bail out with an error but there are too many cases that are difficult to clean up for consumers so we offer and option to permit it.
When enabled any line than cannot be parsed will be considered an extension of the previous line (including the new line character).
The Google Analytics ID, like 'UA-78454410-2'.
Hide equal values.
Number of jobs to run in parallel. If you are comparing large trees this will make the process faster but will consume more CPU. (default 1)
Left GEDCOM file.
The minimum similarity is the threshold for matching individuals as the same person. This is used to compare only the individual (not surrounding family) like spouses and children.
This value must be between 0 and 1 and should be set to the same value as "minimum-weighted-similarity" if you are unsure. (default 0.733)
The weighted minimum similarity is the threshold for whether two individuals should be the seen as the same person when the surrounding immediate family is taken into consideration.
This value must be between 0 and 1 and is the primary way to adjust sensitivity of matches. It is best to also set "-minimum-similarity" to the same value.
A higher value means you will get less matches but they will be of higher quality. If you are comparing trees that do not share many of the same individuals you should consider raising this to prevent false-positives. (default 0.733)
The NAME node can be represented a single string, or name parts such as Given name, Surname, Title, etc. When enabled, this option flattens name parts into a single string with the given format:
-
written
(default): Flatten names to their written names, like "John Smith". -
gedcom
: Flatten names to their GEDCOM name, like "John /Smith/". -
index
: Flatten names to their index name, like "Smith, John". -
unmodified
: Do not make any modifications to the name or name parts.
You can also provide a custom format (see NameFormat) by not using one of the presets above. (default "written")
Exclude censuses.
Exclude change timestamps.
Exclude names that are duplicates.
Remove death nodes (DEAT) that do not have children. This is caused by applications signalling that the individual is not living but can lead to unwanted discrepancies in the comparison.
Exclude events.
Exclude labels.
Exclude maps (locations).
Exclude objects.
Exclude places.
Exclude residence events.
Exclude sources.
Only include official GEDCOM tags.
Remove all data except for vital information. The vital nodes are (or multiples in the same individual of): Name, birth, baptism, death and burial. Within these only the date and place is retained.
Output file.
Controls if two individuals should be considered a match by their pointer value.
The default value is 0.733000 which means that the individuals will be considered a match if they share the same pointer and hit the same default minimum similarity.
A value of 1.0 would have to be a perfect match to be considered equal on their pointer, this is the same as disabling the feature.
A value of 0.0 would mean that it always trusts the pointer match, even if the individuals are nothing alike.
This option makes sense when you are comparing documents that have come from the same base and retained the pointers between individuals of the existing data. (default 0.733)
Show progress bar.
Right GEDCOM file.
The "-show" option controls which individuals are shown in the output:
-
all
(default): Default. Show all individuals from both files. -
only-matches
: Only show individuals that match in both files. You can control the threshold with the "-minimum-weighted-similarity" and "-minimum-similarity" options. This is useful when comparing trees that are unlikely to have many matches. -
subset
: The right side will be considered a smaller part of the larger left side. This means that individuals that entirely exist on the left side will not be shown. This is useful when comparing a smaller part of a tree with a larger tree.
Controls how the individuals are sorted in the output:
-
written-name
(default): Sort individuals by written their written name. -
highest-similarity
: Sort the individuals by their match similarity. Highest matches will appear first. (default "written-name")