-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of Shahmukhi rules, fixes #108 #200
Conversation
For the character which does not appear, that is a relatively new addition to Unicode from 2020 - you can see a picture here https://en.wikipedia.org/wiki/Lam_with_tah_above (This is actually the first unicode character added to the extended character set specifically for writing Punjabi in Shahmukhi.) There are a handful of fonts which support it, but the best open source one is Noto Urdu Nastaleeq. Mac OS actually includes this font I think, but you may still need the latest version from https://github.com/notofonts/nastaliq/releases I realize it would definitely be helpful to explain the rest of the rules I've included here - I will share a full write up later tonight when I have time. |
moved contents to where it's used
This pull request introduces 1 alert when merging afb3cd1 into 24a2e06 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging e72e68f into 24a2e06 - view on LGTM.com new alerts:
|
Most of these notes can be observed in these dictionaries:
Note that these are both old enough that they do not include the newer Unicode characters used in Shahmukhi. Punjabi University's dictionary is mostly very good but a small issue is that they do not use the character ۓ where it is supposed to be used, instead they use ئی. There are instances where it is correct to use ۓ however, otherwise there would be no reason for the character to exist, and you can find plenty of examples of the words it is used in searching elsewhere. Going in order from the original issue, then getting into some of the rest:
This is from Gulshan-i-Urdu, published in Malerkotla (predominantly Muslim part of Indian Punjab) which is a short book intended to teach the Urdu writing conventions to people familiar with reading in Gurmukhi by way of Shahmukhi representations of Punjabi words and instructions based on Gurmukhi character combinations. This is quite a good reference table for the vowel combinations which require a hamza. Note however that there are independent unicode characters for each of the letters hamza can attach to; these are used instead of a combining character. The rules for hamza I included are the same as those listed above, including ۓ, but only where appropriate at the end of a word.
-- Further, I do have plans to account to some of the more complex cases beyond these general rules. There are a minority of words which cannot be accounted for by any general rule and just have to be explicitly hardcoded for. This is a longer term project I am working on of creating a table of these words that could be used to fill in the remaining gaps in coverage for transliterating from Gurmukhi to Shahmukhi. The most common of these are words which end in ہ in Shahmukhi, which could have multiple different endings in Gurmukhi. ਇਹ in Shahmukhi is ایہہ, with the letter ہ repeated twice at the ending. ਸਬਜ਼ਾ is سبزہ, ਕਹਿ is کہہ, and so on. There are also several letters (ح خ ث غ ع ط ظ ص ض ق) which do not represent any particular sound used in Punjabi or similar Indic languages but are still used in spelling of various words in Punjabi and Urdu. These are mostly words derived from Persian or Arabic which are not necessarily pronounced like the words they are derived from. Gurmukhi spells these closer to phonetically, but the Shahmukhi spellings preserve the older characters. You can see some examples of these words in Waris Shah's poetry as a reference point. Then there are some very strange words which you can find here and there which just break the rules above, including even an example of ے in the middle of a word. Shahmukhi also has a tendency to break the connection of a character before ਗਾ / گا and its inflections where they appear at the end of verb forms. I am not sure how to implement this yet but I intend to investigate this as well. If you look at the very end of the page of the book I posted a photo of above, you can actually see an example which illustrates multiple elements of the exceptional cases discussed here. ۓ is permitted in the middle of that last word even though it is normally an end-of-word character, because it is followed by گا (ਗਾ). I think it is possible to come up with a way to handle ਗਾ word endings specifically, but I want to think about it more because I want to make sure it only gets applied to the right words. It would also be interesting to try working on a "toSaraiki" version of this for the Saraiki extended Shahmukhi. There are actually a number of words which appear in the Guru Granth Sahib which are no longer used in "standard" Punjabi but which have been preserved in Saraiki. The Saraiki version of Shahmukhi also has additional characters for the rarer consonants like ਞ which are still used in some common Saraiki words. |
Thank you for this incredible write up! It's very comprehensive and helpful. My notes from above:
I see you have a toShahmukhi repo based on this repo. It used to be in MIT and had to be switched to GPLv3. Do you have a preference to work on MIT projects? As an org we started shifting our repos over to MIT (such as our python gurmukhiutils), but have yet to do that with this project. If you are willing to work on the nastaliq functions from scratch (meaning no copying of tests, no copying of this current implementation, and written by hand yourself), then we can start an MIT version of this repo too. I would be happy to help you out with the set up, so just let me know! In completely other news, I'm very happy you're contributing as it's rare for us to have anyone help out. Would you mind sharing a little bit about how you got to this project and why you're willing to work on it? |
So ਲ਼ is actually different from the other letters with bindi in that it is supposed to represent a native Punjabi sound - it's a very slight distinction between ਲ and ਲ਼ but it is one that has been present for a long time. Some common words that have the sound are ਚੌਲ਼ (rice) and ਉਂਗਲ਼ (finger), but I think whether or not someone pronounces the sound differently is dialect-dependent (and what they write may not necessarily correspond to how they pronounce the word). It is definitely optional to use ਲ਼ as many writers have not used it, but this entry in Punjabi University's dictionary is a good example of why they choose to use it in their headwords: Verbs are generally an interesting place to look at Punjabi-specific phonetic tendencies since very few Punjabi verbs are loaned from elsewhere; they typically have a Sanskrit origin and are inflected based on rules internal to the language. The dictionary is trying to represent two different dialectal pronunciations of this word, which you can hear in the audio sample: ਸੁਆਲਣਾ vs. ਸੁਆਲ਼ਨਾ. This is telling about what type of sound ਲ਼ is, because it's presence means that ਣਾ gets replaced with ਨਾ. ਲ਼ involves a "retroflex" sound that requires more tongue movement than ਲ, which makes it harder to pronounce the nasalized ਣਾ afterwards. This occurs in a variety of verbs; for example, ਕਰਨਾ is not ਕਰਣਾ because ਰ represents another sound which makes it hard to pronounce ਣਾ right afterwards. It follows that if you see a word written ending in -ਲਨਾ rather than -ਲਣਾ, that is a hint that at least some speakers are pronouncing ਲ਼ in the word even if they are not writing it. Overall it is a small detail, but I figure if a writer is going out of their way to use ਲ਼ instead of ਲ, for clarifying situations like the one above, it would make sense for that to get converted to the Shahmukhi character for the same purpose. (That it is a Punjabi-specific sound is why it had to be added to the Unicode set, whereas there are already characters corresponding to the other bindi letters in Arabic-based scripts because those are related to loan words.) What I am thinking of doing to leave an option for people who may have concerns about script/font support as this is a newly added character is adding a "toUrdu" function which just wraps around "toShahmukhi" using ن and ل instead of the newer characters (limiting it to just the "original" Urdu alphabet). The goal I would say is just for Punjabi speakers who are not used to reading Gurmukhi to be able to read resources originally written in it (this may make it easier for them to learn Gurmukhi if they are interested, too). Shahmukhi and Arabic-based scripts generally are much less dependent on phonetics, and most of what Shahmukhi essentially is doing is presenting words in a way that is easy to read for people who are used to reading Urdu. Most native Punjabi speakers in Pakistan never write in Punjabi using any script even if that is the language they speak 90% of the time as Urdu is the language required in school / professional settings that involve writing, so it is more a matter of making words people can already pronounce look recognizable to them. If you look at اوہ, you could take apart the letters and say that it is supposed to be pronounced "avo" or "awh" or a number of other things, but that is just what Punjabi speakers in Pakistan would recognize as ਉਹ. They are just looking at what the whole word looks like and the context within the sentence to know how to pronounce it, there's not enough information in the script for Shahmukhi to be readable to someone who doesn't already know Punjabi fluently. I do intend to eventually get to a reverse Shahmukhi to Gurmukhi conversion function working, but I have that on hold for now as fine-tuning the Gurmukhi to Shahmukhi process can be done on a much shorter time scale. The reverse conversion is much more challenging and I don't think there's a tool that can quite do it properly - the biggest issues are that single letters are used to represent several sounds. و is used as both a consonant and a vowel, and to figure out if it should become ਵ or ਔ or ਓ or ਉ or ਊ in Gurmukhi you would have to consider the word in the context of the sentence and/or have a probability table of letter combinations to compare to and determine what the most likely character replacement is. I think this is doable but it is a level of complexity that will take some time to work up to. When I do get to it, the code and test will have to be from scratch because of the different considerations - I really appreciate you offering to incorporate something like it into the project and can let you know once I have a plan for how to implement it. I do slightly prefer the MIT license where possible just because it's more flexible in allowing other projects to use it without conflicting with whatever other existing terms they might have. The initial reason I got to this project and got interested in this problem is that I am learning Punjabi as a second language in order to communicate with my family better. My parents and most of my relatives are native Punjabi speakers of Pakistani heritage. I found it helpful to learn Gurmukhi to start learning the language from books, especially since it includes distinctions about the phonetics of Punjabi, but at the moment I can't share most of what I've been reading with any of the people I'm trying to speak Punjabi with since they've never been exposed to it. Then looking more into it seems like this is a problem that is solveable with an open source tool, but hasn't been taken quite all the way yet - I learned that the Serbian Wikipedia has a transliteration tool to switch between the two scripts Serbians use, Cyrllic and Latin script. That way people contributing only have to write it once in one script, and anybody can read it regardless of which Serbian script they write with. Punjabi doesn't have this yet and any website or software that has both Gurmukhi and Shahmukhi language options has split sets of translation strings, meaning twice the work is required to provide content in one language. My more ambitious goal is to have it be as easy to switch between scripts in an application as it is now for Serbian and likely some other languages which use multiple scripts. Good point on ZWNJ, when I use it I will note it that way. I am about to fix the testing / linting issues. |
|
…indi characters; addition of khw rule
…h for word ending matches since regex \b is apparently inconsistent with non-latin characters
@bgo-eiu apologies I've not been able to reply sooner! Really grateful for you sharing this knowledge and your contribution! Let me know if you need assistance fixing the test cases or linting |
Wow this is great work, this contribution is greatly welcomed! Not being a Shamukhi reader myself, I was not able to work on the issue at all. Quick question, I have been thinking for a while to remove the replacement that happens here: https://github.com/bgo-eiu/gurmukhi-utils/blob/patch-1/lib/toShahmukhi.js#L178. For example, there are words like: ਮੁਖਹੁ or ਸਿਮਰਿ. These are not found in regular Punjabi, but in older texts that employ the Gurmukhi script. Currently, the endings are stripped so they become, for example: ਮੁਖਹ or ਸਿਮਰ. If the endings are not stripped, would that interfere with the transliteration in any way? |
@sarabveer I was actually going to ask about removing strip endings - it would be helpful to know how words like ਮੁਖਹੁ are meant to be pronounced. It is conventional to omit vowel diacritics in regular writing in Arabic-based scripts, and if anything Urdu and Shahmukhi writers do this more than Arabic writers. This can become confusing, because in Arabic, it is not typical for a word to end in a short vowel, but in Punjabi, short vowel endings are common and can be important for distinguishing words. For a transliteration function, it makes sense to preserve those vowels because you can always strip them after if you prefer, but you cannot add them back to an ambiguous word without context. So I would lean towards not stripping the endings preliminarily, but if these particular older words are meant to be pronounced differently than written, I could incorporate some rules for those since it will be necessary to account for some exceptions anyway. Some interesting examples from Punjabi University's dictionary:
In the audio samples, you can hear that the vowel at the end of ਆਦਿ is not really being said, and so there would be no reason to indicate it in Shahmukhi. Words like the first one where the vowel is pronounced are more common though and do benefit from this kind of clarification. ਕਿ is a very common word which has a weird Shahmukhi spelling کہ where the ending vowel is important enough that it is represented by a whole consonant letter choti he ہ so it cannot be omitted. (I had not thought of it before, but this is kind of like the Punjabi equivalent of ta marbouta ة in Arabic where a consonant is used for certain vowel endings.) There is a logic to it and I think it may be as simple as applying to any word ending in ਕ followed by a short vowel but I want to investigate a little bit more to be sure. |
Sorry to have gone quiet for a bit, I realized I had to go back to the drawing board to address some issues in testing it, but it's closer to where I intended now. I've been feeding in various Gurmukhi strings and noticed some details I had missed before. I intend to update the unit tests with some strings extracted from different sources, like the poems on the Punjabi Kavita site in both Gurmukhi and Shahmukhi, as there are some quirks which may not be obvious when writing the strings manually.
|
…mmented out replace endings to see how that looks
This pull request introduces 1 alert when merging 7829ab4 into 24a2e06 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 7c999f7 into 24a2e06 - view on LGTM.com new alerts:
|
Goal of transliteration: To be able to convert gurmukhi script to a different script and then back to gurmukhi as close as possible. It does not have to make 100% sense to the reader, but where it can be sensical it should be. Goal of transcription: To help speakers of a certain language pronounce gurmukhi script. There are questions regarding whether some of the short vowels at the end of words are to be pronounced or not. However by common convention there are many areas they are not pronounced. It's actually funny because gurmukhi script is employed for multiple languages, so to use the rules of punjabi for all the languages is a potential pitfall (though very commonly accepted as non-problematic).
I would recommend portioning out the unit tests and commenting them with the sources. For example if you're getting some unit test examples from the Punjabi Kavita site, group them together and prepend that block with a comment saying so. This is what I've done here for example:
If you're using VS Code, you're not limited to using fixed-width fonts in your editor. I personally use the following fonts: This means if the character is not found in SF Pro, it will try to render it with Sant Lipi. If it's not found in Sant Lipi, then it tries to render it with Noto Sans Gurmukhi. So and on and so forth. You can set your main monospace font at the beginning, and as long as the characters are not existing (I honestly have never heard of a monospace font with nastaliq characters 😆 ), you can set up a fallback font which can be proportional width (not required to be monospace). Just sharing this incase you didn't know, it may help you too.
That is very interesting, I'll have to keep that in mind, thank you.
Avoid writing rules for ਜ + ਼ . The combined character exists as a proper unicode point. The input text should be sanitized with a different function to normalize gurmukhi if needed. In short, assume normalized/proper gurmukhi input for your function.
No need to swap 1-for-1, but rather to be able to go back and forth with whatever rules you come up with. This seems to be a non-issue to me. But a good note to know! |
This pull request introduces 1 alert when merging dc6e851 into 24a2e06 - view on LGTM.com new alerts:
|
These ending vowels indicate grammar in Sri Guru Granth Sahib. At the time, the information I had available made me implement the omission. But after doing research, these vowels are supposed to be part of pronunciation (even if the masses do not pronounce them). So ਆਦਿ is pronounced ਆਦਿ, not ਆਦ. Good to know we agree. |
Also, I am not sure if In pronunciation, these are pronounced as follows:
Not sure how that would work in Shahmukhi. |
These are good examples which I was not aware of. Do you know any text samples which use these words, or describe their meaning? That might help to track down a Shahmukhi source that has used them. My initial thoughts are: ਰਖੵਾ = رکھیا ਤ੍ਵ = توَ |
ਰਖਿਆ means protection, defense. It can also be spelled out as ਰੱਖਿਆ. According to the Punjabi University transliterator, they do as follows: ਰੱਖਿਆ => رکھیا (However this includes the adhak Another example is ਆਗੵਿ => ਆਗਿਆ آگیا What could be done is for this case with ੵ (U+0A75), the word can be transformed into the pronunciation form (ਰਖਿਆ, ਆਗਿਆ, etc) and then put into the Shahmukhi transliteration. I think its better that way as there are edge cases I know of with this character where it may be difficult to implement directly, and it will make it easier for you.
Here is an example: ਤ੍ਵ ਪ੍ਰਸਾਦਿ - means by your(ਤ੍ਵ) grace(ਪ੍ਰਸਾਦਿ). According to the Punjabi University transliterator, they do as follows: ਤ੍ਵ => تو The phrase "ਤ੍ਵ ਪ੍ਰਸਾਦ" might also be in this photo (3rd line from top), seems like they are using تو. Another example is ਬਿਸ਼੍ਵਾਸ => ਬਿਸ਼ੁਆਸ بشواس |
Adhak would be indicated on رکھیا as رکھّیا. Most writers in practice would omit any indication of it unless absolutely necessary. Punjabi University's dictionary and transliteration tool are oddly inconsistent about this - sometimes they include the ّ character, sometimes they do not, and sometimes they put it in the wrong place (کّھ is incorrect and should be کھّ but they have this swapped often.) It makes sense for a transliteration tool to always include it in the output though, since this information makes it easier to convert back and we can just remove this from the output if we want. I do agree converting to the Gurmukhi pronunciation spelling would make things easier as then the yakash words would be covered by more general rules. I am interested in making sure the edge cases work, but it will be easier to spend time on those once the function covers the 95% or so of word forms which can be derived from the general rules. I see تو پرساد there, in both of the ੍ਵ examples I am leaning towards simply transcribing as و because adding anything to it like وَ may be confused for indicating a consonant sound. In context a reader would be able to tell تو is not ਤੂ because of the presence of پرساد. |
Good to know.
Yea I can look into implementing this.
Alright, so و it is. |
Summary
This PR addresses the three concerns listed in #108:
I have also added additional rules for the correct word-position use of nun gunna and alif maddah, and support for the arlam and arnun characters corresponding to their respective Gurmukhi letters ਲ਼ and ਣ.
Test
Duration