Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese Pīnyīn instead of Chinese characters #49

Open
alfons opened this issue Dec 26, 2024 · 6 comments
Open

Chinese Pīnyīn instead of Chinese characters #49

alfons opened this issue Dec 26, 2024 · 6 comments

Comments

@alfons
Copy link

alfons commented Dec 26, 2024

Please provide a setting to display Chinese Pīnyīn (with tonemarks) in the HanziGraph instead of Chinese characters. This would be helpful to get a better feeling for the homonymes and thus the true character of the Chinese language. My reasoning and reason for this feature is the following:

My mother tongue is German and I have been studying Chinese language for almost 20 years, in all settings: university courses in China, living in China with native Chinese romantic partners, private courses, tutors, online tutoring, apps, graded readers (of which I read about 20), CDs, Youtube, everything in all configurations imaginable… yet I still can't even follow a simple conversation. However, last year

  • I have read several dozen books on Chinese grammar and history,
  • as well as on typography and historical script development in Europe,
  • as well as having read about the (real) history of romanisation in Vietnam (pushed by the Vietnamese revolutionaries and Vietnamese leaders, not the European foreigners) and the modern Vietnamese script,
  • as well as having studied Vietnamese for 6 months for 2 hours daily and purchased almost 50 books in Vietnamese language
  • as well has having studied the work of John Taylor Gatto and [moderated], which has provided me with the necessary emotional power

thus I have made the monumental and upright revolutionary choice to disregard Chinese characters forever on, regard them as a specialised study subject such as ancient Chinese history and herbarium science, and to continue studying with Chinese Pīnyīn only. Since then I have made REMARKABLE strides in studying Chinese language in a short time, as it suddenly is as easy as studying Spanish or Italian or any other language. It is truly suddenly a joy and much fun. The only drawback is the lack of reading materials in Chinese Pīnyīn (and the lack of "allies"), and the many writing mistakes Chinese tutors make since most tutors don't know yet the orthography rules of Chinese Pīnyīn, as of the CN Gov Pinyin Rules GB/T 16159-2012 (update of the 1996 version, which is already 28+ years available but still largely unknown.) Luckily ChatGPT is quite strong in Chinese Pīnyīn.

All that said to help your motivation :) Thank you for considering this feature.

@mreichhoff
Copy link
Owner

To check if I'm understanding the idea:

  • the nodes in the graph would be pinyin syllables, e.g., diǎn or guān
  • clicking one would then show the most common characters with that pronunciation, along with definitions, example sentences, etc.
  • edges in the graph would be words formed by connecting the two pronunciations, so as an example, guān is connected to diǎn because of 观点, and other words with pronunciation guāndiǎn or diǎnguān would be shown.
  • the graph's edge behavior would continue as-is, showing definitions and examples for the words.

I should note that showing pinyin in the example sentences is already supported (there's an option in the menu accessible from the upper right), and all definitions include pinyin.

I think adding an option for showing pinyin in the graph could be interesting, though, so I can look into that.

@alfons
Copy link
Author

alfons commented Dec 29, 2024

Glad you might look into it! There's a couple of things that probably need some consideration, since suddenly we will be looking at the vernacular, spoken Chinese language. In Hanzi the morpheme (one character) is the smallest meaningful unit, but in the spoken Chinese language it's the 词 (cí) the word, that forms the smallest meaningful unit.

The rules of Chinese Pīnyīn are defined in GB/T 16159-2012, I'll attach a Chinese Simplified text version of this document that I made (including the corrections of Mark Swofford / pinyin.info).
Pinyin Rules GB:T 16159-2012 simplified formatted.txt

Concerning your bullet points, and additional remarks:

Tone Marks According to the rules in GB/T 16159-2012, Chinese Pīnyīn is written with diacritics, not numbers.

Nodes. Yes, a node would display a syllable in Chinese Pīnyīn. How would you plan to deal with ambiguity? For example 行 (xíng / háng). In the Chinese language the majority of words have either 1 or 2 syllables, so there's a good chance that a node is a word.

Edges Yes, edges would display 2 syllable words. I think it's viable to just put the two "edges" together, without even changing the diacritics. According to the rules (GB/T 16159-2012) the diacritics are always true to the syllable, they do not change when syllables are combined. However, one exception, sometimes the diacritic on the 2nd syllable is dropped (like in 看看 kànkan) but there is no explicit rule for that. In other cases a missing diacritic on the 2nd syllable forms a new word, for example 东西 can mean dōngxī (East-West) or dōngxi (thing).

Examples There's Pin1yin1 with numbers in the examples, but just for the node names, and missing an option to show the correct spelling with diacritics. The example sentences seem to be not written in Chinese Pīnyīn. Unfortunately there's currently no library that implements the rules GB/T 16159-2012 to produce compliant Pīnyīn from Hanzi. Furthermore, even large language models make many mistakes producing Chinese Pīnyīn. I guess there's just not a sufficient volume of Pīnyīn text available (yet) to train models.

Using Chinese Pīnyīn would indeed be very interesting to see on HanziGraph, to engage in a research, and to get a feeling for the vernacular, spoken language, as opposed to the more academic, abstract Chinese Simplified writing system.

@alfons
Copy link
Author

alfons commented Dec 30, 2024

I made a mockup how it could look like, first the original HanziGraph:
hanzgraph_simplified_example
Then my mockup in Pīnyīn. Maybe not as "impressive" and "culturally rich" at first glance, but for sure easier to read (and containing more information about how words are currently spelled and pronounced):
hanzgraph_pinyin_example

I don't know how the edges are made, but diànshāng (diànzǐ shāngwù) would be a nice word too: e-commerce. Which touches on the subject of which meaning is chosen for the word, as diàn can mean "shop, store, location", or "electricity", amongst many others (the PLECO built-in dictionary has 16 entries for "diàn").

Furthermore I'm quite excited about the edges, they really make the short words shine and demonstrate how Standard (Mandarin) Chinese depends on context and a certain amount of syllables/words to be spoken before intended meaning becomes apparent. It also really shows the difference between the vernacular (spoken, living) language and the two very different writing systems. While Pīnyīn aims to accurately record the spoken language for writing and reading, Simplified seems to have completely different goals.

@mreichhoff
Copy link
Owner

yeah, that mockup is what I had imagined would be interesting to build. I'll try to get it prototyped in the next few weeks.

@mreichhoff
Copy link
Owner

I did a bit of work on this today and have an initial graph and wordlist in the linked branch (pinyin-graph). I'm debating the best way to display it; it might be a standalone tool like I did with component breakdowns at first.

@alfons
Copy link
Author

alfons commented Jan 25, 2025

For me, as a language learner and user, it makes sense to be a standalone tool. PinyinGraph might turn out to be quite a different tool than HanziGraph. Getting deeper into it, I guess you might eventually take it into a very different direction, due to its very different nature.

One design question I'm curious to see how you will solve it is this: whether you put Pinyin merely as an extra representation layer on top of Hanzi, or if you will treat the new tool as truly "written spoken Chinese" and group all syllables (and words) that sound alike (and are spelled alike) together.

This comment section might not be the right place, but as it concerns this design choice, and because I've spent considerable time thinking about this, I will share with you my own reasoning:

Here's what we know:

  • Spoken Chinese words can be broken down into syllables. The majority of Chinese words are one to three syllables long, with one-syllable words being very frequent.

  • Chinese has fewer syllables than, for example, English. Chinese has approximately 1,600 distinct syllables, whereas English has potentially tens of thousands of syllables.

What are the implications when we start to write Chinese with similar letters as we use for English?

a) The meaning of words in sentences is defined by their "Part of Speech" (POS) such as noun, verb, adverb, adjective, etc.. Within each POS, a word can have various senses.

  • For example, in English the noun "bank" can mean (1) "a financial institution" or (2) "the side of a river."

b) However, in English this doesn't seem to be too overwhelming, especially if the broader context is at least vaguely known.
I suspect this is due to the vast amount of possible syllables, which help us to distinguish meaning beforehand.

c) In Chinese character-based (Hànzì) writing, many syllables that sound the same but that represent different senses, have been grouped into single characters, which makes looking up written Chinese characters rather straightforward, too.

For example:

  • 行 (xíng) has 6 senses as a verb in the Apple dictionary: (1) go, (2) travel (3) be current, etc, 3 senses as an adjective, 1 sense as a noun, and 1 sense as an adverb.

  • 形 (xíng) has 2 senses as a noun, and 2 senses as a verb.

  • In total there's 7 entries in the Apple dictionary that are all spoken as "xíng" but are assigned distinct characters:刑,行,饧,形,陉,型,邢

d) In Chinese Pinyin writing, however, all definitions for words (made of one or more syllables) that sound alike (and are spelled alike) would be found under the same entry, just like in English.

  • For example "xíng" would contain all definitions from all 7 entries, suddenly making it a rather large page to look through, and we really need to know what we are looking for (from POS and context.)

This means that spoken Chinese relies very heavily on context. Chinese Pinyin, the written counterpart of the spoken, standard, Mandarin Chinese Pǔtōnghuà, relies very heavily on context, too.

From my perspective, as a language learner and user, it will be very, very interesting to see how a dictionary or lookup / categorization tool like PinyinGraph will turn out to work and look like, if we have the courage to look at spoken Chinese through a Chinese writing system like Chinese Pinyin, and which holds true to the spoken, standard Mandarin Chinese language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants