Meerkat Name Parser

The Meerkat.NameParser library allows you to parse personal names into their constituent parts.

This is most useful when you've been given a list of full names e.g. "Mr Bob Smith BSc" that you then need to store into a domain model and/or use for other purposes such as personalising an email or a website, addressing a letter etc.

Welcome to contributions from anyone.

You can see the version history here.

Build the project

Windows: Run build.cmd

The tooling should be automatically installed by paket/Fake. The default build will compile and test the project, and also produce a nuget package

Library License

The library is available under the MIT License, for more information see the License file in the GitHub repository.

Getting Started

Name Structure

The library is flexible and defines an interface which exposes standardized parts of a name, all of which are optional to allow for flexible handling and also poor data quality:

T)itle - Typically "Mr", "Mrs", "Ms" but we can also recognise professional ("Dr", "Prof" etc) and military ("Captain", "Major", "Lt" etc) titles
G)iven name(s) - The personal names e.g. "Bob", "Jean-Paul"
P)refix - Prefix before the family name e.g. "van", "de", used as we might want/need to collate without the prefix for searching/sorting.
F)amily name - The family name e.g. "Smith", "Gonzalez-Byaz"
S)uffix - Part of the family name typically generational e.g. "Jnr", "II", "III", against might not be needed for collate/search
L)etters - Can be civil awards e.g. "OBE", "CBE", military awards e.g. "VC", "GC", "DSM", "DSO", degrees e.g. "BSc", "PhD", professional awards e.g. "CEng" or membership of professional societies e.g. "FRCS", "FBCS", "MACM" etc

We use given and family name in the library rather than forename/surname to avoid confusion with non-Western names where the display order is "Family Given" rather than "Given Family"

We also have two other properties

Salutation - For the most part this is simple e.g. "Dear {Given}" or "Dear {Title} {Prefix} {Family} {Suffix}" depending on how formal you want to be, however in British English you have anomalies such as "Sir George Bingham" is "Dear Sir George" and the "Bishop Of Canterbury" is "My Lord Bishop"
Envelope - The name you would put on the envelope e.g. "Mr Bob Smith"

These are controlled by SalutationFormat and EnvelopeFormat so that they may be generated from the name parts or set explicitly.

Parsing Methodology

The approach is to split the name into tokens and then to identify the various name parts present using heuristics or recogising explicit tokens. This is probably best explained with a few examples.

Bob Smith -> (GF) : Using the heuristic that without any other information the last token is the Family name and others are Given names
Bob George Smith -> (GGF) : Same as above
Bob Smith-Johnson -> (GF) : The hyphen binds to two family name parts together, then the GF heuristic applies
Jean-Paul Gautier -> (GF) : First heuristic except the hyphen binding working for given names, BTW we ignore whitespace around the hyphen so "Jean - Paul Gautier" etc would parse the same way.

When we analyse the input string we create Symbols which implement IToken

public interface IToken
{
    /// <summary>
    /// Class of the symbol, application dependent
    /// </summary>
    TokenClass Token { get; set; }

    /// <summary>
    /// Value of token, e.g. the keyword/variable name/constant
    /// </summary>
    object Value { get; set; }
}

For titles, suffixes and letters we have explicit keyword lists that can be expanded upon if your data set has values not already handled. These are all stored as NameSymbol entities

public class NameSymbol : Symbol
{
    public NameSymbol()
    {
        Token = TokenClass.Value;
        Properties = new Dictionary<string, object>();
    }

    public NameToken NameType { get; set; }

    public Dictionary<string, object> Properties { get; set; }
}

Rationale of the additional property bag is to allow other information to stored against the NameSymbol which can be used to derive more information about the person or how they are displayed e.g.

{ 
    "NameType": "Title", 
    "Value": "Bishop", 
    "Properties":{ 
        "Gender": "M", 
        "SalutationFormat": "My Lord Bishop" 
    } 
},
{ 
    "NameType": "Title", 
    "Value": "Sir", 
    "Properties":{ 
        "Gender": "M", 
        "SalutationFormat": "DG" 
    },
}

This is only limited by what you can infer from the data and what you need for your application, here's some examples from the Letter entries:-

{ "NameType": "Academic", "Value": "BEcon", "Properties":{ "Degree": "Bachelors", "Subject": "Economics" } },
{ "NameType": "Academic", "Value": "BEd", "Properties":{ "Degree": "Bachelors", "Subject": "Education" } },
{ "NameType": "Academic", "Value": "BEng", "Properties":{ "Degree": "Bachelors", "Subject": "Engineering" } },
{ "NameType": "Academic", "Value": "BPharm", "Properties":{ "Degree": "Bachelors", "Subject": "Pharmacology" } },

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.paket		.paket
.vs/config		.vs/config
code		code
data		data
nuget		nuget
.gitignore		.gitignore
License.md		License.md
Meerkat.NameParser.sln		Meerkat.NameParser.sln
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
build.cmd		build.cmd
build.fsx		build.fsx
build.ps1		build.ps1
paket.dependencies		paket.dependencies
paket.lock		paket.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meerkat Name Parser

Build the project

Library License

Getting Started

Name Structure

Parsing Methodology

About

Releases

Packages

Languages

License

phatcher/Meerkat.NameParser

Folders and files

Latest commit

History

Repository files navigation

Meerkat Name Parser

Build the project

Library License

Getting Started

Name Structure

Parsing Methodology

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages