Implement a simple search feature #8

GoogleCodeExporter · 2015-04-26T12:43:33Z

There is existing code for a simple search feature in the search.java file. 
However, it is not yet connected to the Search menu item, and I'm not yet sure 
if the code works.

Let's discuss the status of the code, and any technical difficulties such as 
the Saint ID problems that Aleks mentioned by email.

Original issue reported on code.google.com by ps008v...@gmail.com on 6 Feb 2015 at 5:05

The text was updated successfully, but these errors were encountered:

lemtom · 2020-12-24T12:53:45Z

I'm currently working on this.

I've changed the search results from text to tabular data. Clicking on a row opens the corresponding commemoration in a new window.

Are there any specific features that should be added?

mamyt · 2020-12-28T15:55:02Z

I think this is a great feature that is really needed. Some comments regarding Unicode. Firstly, we must make sure that irrespective of how the user enters the text, it is decomposed so that searching works properly. The problem lies in that diacritical marks (mostly for the Latin and Greek alphabets) can be entered in one of two ways: either as a precomposed character *ä* or as a decomposed character *ä* (that is as *a + ◌̈*). Although visually both look identical, the underlying representation is different. According to Unicode specifics, both should be treated identically. However, this needs to be checked that it has been so implemented. I am afraid that JAVA may not implement this feature correctly. As well, regarding Church Slavonic searching, I think that it would be mandatory to have two options: strict and relaxed. In strict, the search engine searches for the exact spelling of the word. In relaxed, the search engine searches using a normalised form of the word (for example, diacritical marks are stripped and {и, і}, {е, є}, {о, ѻ, Ѡ}, {ꙗ, ѧ} (as examples) are treated within each set as equivalent). As well, superscript letters would need to be handled somehow. Finally, abbreviations could be expanded (I have a list of all (modern) Church Slavonic abbreviations, which would cover us for all cases). The same could also apply to Greek with respect to stripping the diacritical marks. This is especially important since not everyone will necessarily be familiar with exactly how to spell a word in Church Slavonic and the spelling of the word can change during word formation, *e.g.* ѻ҆те́цъ (nominative singular), ѻ҆тє́цъ (genitive plural), and then пра́ѻтецъ, which all should be found if we search for “ѻтецъ”. Normalising the forms would give *отецъ*, *отецъ*, and *праотецъ* which will now be easily found.

…

On Thu, 24 Dec 2020 at 13:54, Tom L. ***@***.***> wrote: I'm currently working on this. [image: afbeelding] <https://user-images.githubusercontent.com/10900989/103088705-e36af200-45eb-11eb-810a-5164c3776410.png> I've changed the search results from text to tabular data. Clicking on a row opens the corresponding commemoration in a new window. Are there any specific features that should be added? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSMKOOOYV5AOESMCLXUMSTSWM2WNANCNFSM4VIFQZQQ> .

lemtom · 2020-12-28T20:42:40Z

I tested with French, and it seems to handle both versions of é fine.

I've currently implemented a checkbox that strips the accents from both the search term and the saint name. So far I've been testing in French, since that's a language I actually know. With the checkbox unchecked, the search term "Melece" doesn't give "St. Mélèce" as a result, with the checkbox it does. I've also added a similar checkbox to ignore capitalization.

The library I'm using (java.text.Normalizer) can probably normalize the church slavonic to some degree, but I'll probably have to find a way to handle the abbreviations (hardcoding per your list, I guess) and the spelling differences related to word formations.

I'm fairly sure the normalization I've implemented so far can handle diacritical marks in Greek, though I'll have to find some examples to be certain.

mamyt · 2020-12-29T06:48:42Z

For polytonic Greek, I can suggest the form ἅγιος (masculine form of *holy*). With diacritical marks stripped, it should also match the monotonic Greek form άγιος (and vice versa). If you need any help with the Church Slavonic, let me know and I can send you the required files. As well, there is the question of Chinese normalisation regarding the two forms of Chinese: simplified and traditional. Can JAVA handle this or not? If it can, then we should enable it; otherwise it makes little point to implement. An example to try: traditional: 格奧爾吉; simplified: 格奥尔吉 (both forms correspond to George in Chinese). Only the middle two characters are different. Another question: do you only search the name of the commemoration or do you search any text in the corresponding html file?

…

On Mon, 28 Dec 2020 at 21:42, Tom L. ***@***.***> wrote: I tested with French, and it seems to handle both versions of é fine. I've currently implemented a checkbox that strips the accents from both the search term and the saint name. So far I've been testing in French, since that's a language I actually know. With the checkbox unchecked, the search term "Melece" doesn't give "St. Mélèce" as a result, with the checkbox it does. I've also added a similar checkbox to ignore capitalization. The library I'm using (java.text.Normalizer) can probably normalize the church slavonic to some degree, but I'll probably have to find a way to handle the abbreviations (hardcoding per your list, I guess) and the spelling differences related to word formations. I'm fairly sure the normalization I've implemented so far can handle diacritical marks in Greek, though I'll have to find some examples to be certain. [image: afbeelding] <https://user-images.githubusercontent.com/10900989/103242123-8194ea00-4955-11eb-839f-058e55da2c83.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSMKOOHD6JCNFRTODTZIALSXDUUZANCNFSM4VIFQZQQ> .

lemtom · 2020-12-29T10:19:46Z

To easily test the cases you give me, I think I'm gonna extract some of the methods I've written to a utility class and write tests for them. I'll probably try to write tests for some of the existing classes as well later on.

I could use help with the church slavonic as well, since I can't even read Cyrillic (I interpreted the і in your equivalent sets as the Latin i at first, and was looking into romanization. I know better now.). Do you know a good source for all the equivalent sets?

I'll implement normalization under the "strip diacritical marks" checkbox in languages that require it, and then the translation strings can be different to indicate it.

Currently I'm only searching for the name, but I can easily add a checkbox to search the getLife() as well.

mamyt · 2020-12-30T08:46:29Z

I can send you the information about equivalent sets and also all the abbreviations in Church Slavonic. Would you mind if I e-mailed the files directly to you? I do not wish them to be made public just yet. Would the e-mail address from your website work? I think searching on the life as an option could be useful, especially if we are trying to weed out any errors that may be found in the texts.

…

On Tue, 29 Dec 2020 at 11:20, Tom L. ***@***.***> wrote: To easily test the cases you give me, I think I'm gonna extract some of the methods I've written to a utility class and write tests for them. I'll probably try to write tests for some of the existing classes as well later on. I could use help with the church slavonic as well, since I can't even read Cyrillic (I interpreted the і in your equivalent sets as the Latin i at first, and was looking into romanization. I know better now.). Do you know a good source for all the equivalent sets? I'll implement normalization under the "strip diacritical marks" checkbox in languages that require it, and then the translation strings can be different to indicate it. Currently I'm only searching for the name, but I can easily add a checkbox to search the getLife() as well. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSMKOPC6TMJ3FPAGXQA2QLSXGUM7ANCNFSM4VIFQZQQ> .

lemtom · 2020-12-30T09:59:14Z

The e-mail on my website should work. My spam filter seems a bit overzealous (it caught e-mails from someone from a different project), so it might be prudent to reply here once you've mailed me, so I know when to check.

Searching through the life is now implemented:

I currently have these test cases based on your comments and my own test in French

//First boolean is ignoreDiacritics and the second is ignoreCapitalization
	@Disabled
	@Test
	void chineseCases(){
		assertTrue(searchName("格奥尔吉", "格奧爾吉", "lang", true, false));
	}

	@Test
	void greekCases(){
		assertTrue(searchName("άγιος", "ἅγιος", "gr", true, false));
		assertFalse(searchName("άγιος", "ἅγιος", "gr", false, false));
	}
	
	@Test
	void slavonicCases(){
		assertTrue(searchName("ѻ҆тє́цъ", "пра́ѻтецъ", "cu", true, false));
		assertFalse(searchName("ѻ҆тє́цъ", "пра́ѻтецъ", "cu", false, false));
	}
	
	void frenchCases(){
		assertTrue(searchName("melece", "Mélèce", "fr", true, true));
		assertFalse(searchName("melece", "Mélèce", "fr", true, false));
		assertFalse(searchName("melece", "Mélèce", "fr", false, true));
	}

I've had to expand the scope of the characters I'm stripping to catch the "COMBINING CYRILLIC PSILI PNEUMATA", but it's caught now.

As expected, there's no easy way to switch from traditional to simplified Chinese and vice versa. There's a library that might handle this, but that seems a bit excessive for such a minor feature (and its documentation is in Chinese).

GoogleCodeExporter added Priority-Medium auto-migrated Type-Enhancement labels Apr 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a simple search feature #8

Implement a simple search feature #8

GoogleCodeExporter commented Apr 26, 2015

lemtom commented Dec 24, 2020

mamyt commented Dec 28, 2020 via email

lemtom commented Dec 28, 2020

mamyt commented Dec 29, 2020 via email

lemtom commented Dec 29, 2020

mamyt commented Dec 30, 2020 via email

lemtom commented Dec 30, 2020

Implement a simple search feature #8

Implement a simple search feature #8

Comments

GoogleCodeExporter commented Apr 26, 2015

lemtom commented Dec 24, 2020

mamyt commented Dec 28, 2020 via email

lemtom commented Dec 28, 2020

mamyt commented Dec 29, 2020 via email

lemtom commented Dec 29, 2020

mamyt commented Dec 30, 2020 via email

lemtom commented Dec 30, 2020