Skip to content

Latest commit

 

History

History
90 lines (87 loc) · 8.95 KB

README.md

File metadata and controls

90 lines (87 loc) · 8.95 KB

php-dom-parser-translation-tool

Maintenance GitHub last commit GitHub language count GitHub top language

This project was planned to demonstrate translation in live web pages by parsing through HTML DOM and extracting the text element and match them with an English to Odia dictionary that is stored in a local database. The complete parsing result will preview as a translated webpage for a website.

Teams

  1. Anshuman Pattnaik GitHub followers Twitter Follow
  2. Deeptiman Pattnaik GitHub followers Twitter Follow

ParserTool

The tool will work to extract texts from a local directory that has collections of XML and HTML files. The tool will write the output in a text file line by line for each folder. The textual results will be useful to train in a Statistical Machine Translation Engine (Moses) for translation accuracy purposes.

Installation

The project has been developed using PHP, HTML, MySQL. So, WampServer 2.0i has required to set up the Web Server environment in the local machine.

Download WampServer 2.0i https://sourceforge.net/projects/wampserver/files/latest/download

Import the databases

There are two different sets of the database for both the tools, so Go to "http://localhost/phpmyadmin" and create "odia_translate", "dom_parser" database in the phpMyAdmin panel. Then, import the corresponding SQLs that are available in this repo into the phpMyAdmin.

Add the folders

Now, place both "_OdiaTranslation", "ParserTool" folders at the installed Wamp Server location inside www folder.

Run the application

Odia Translation

  • Type "http://localhost/Odia_Translation/" in the browser and press Enter. You should see the following screenshot on the screen.
  • Users can type an URL to translate into the Odia language by pressing the "Translate to ODIA" button.
  • In this example- we have used "https://www.news18.com" to translate their home page but Users can use any "URL" to translate.

  • The first translate result will show the dom extraction table with Tag Name and translated Tag Value.

  • Then the user can click on "Click here to see the Translated Webpage" button to the real translated webpage on the screen.

ParserTool

  • Users can click on the "HTML Parser Tool" button from the application home page to use the ParserTool.
  • Enter the local directory path that has a large collection of HTML and XML files and press the "Extract the Directory" button.
  • The extracted text results are written to several "Result-[num].txt" files inside Result folder in ParserTool.

How does it work?

In the ParserTool, there is a section to understand the step by step process of extracting textual elements from each node in an HTML or XML file.

Browse any "HTML" or "XML" file from the local machine or use the sample "_Html Basic.html" file from this repo and upload it into the module.

Step - 1 The user can the parser table with Tag Name and Tag Value with the total count of tags in a separate view. Step - 2 After clicking the "Proceed To Step.2" button, the user can see the possible number of lines that can be formed from the extracted texts. Step - 3 Then, in the last step, the complete list of lines can be shown with the total number of lines count in a separate view.

Important PHP APIs

There are certain PHP APIs the application has used to analyze the DOM input and applied logic to provide Translation on the webpage.

  1. parse_url - https://www.php.net/manual/en/function.parse-url.php
  2. file_get_contents - https://www.php.net/manual/en/function.file-get-contents.php
  3. htmlentities - https://www.php.net/manual/en/function.htmlentities.php
  4. explode - https://www.php.net/manual/en/function.explode.php

Dictionary

There is a small collection of the dictionary (912 words) has been used in this application to replace "English" words into "Odia". The dictionary follows Unicode Collation in the SQL table for the "Odia" word column.

Notice

The project is no longer maintained or supported.

References

  1. Odia Language - https://en.wikipedia.org/wiki/Odia_language
  2. Oriya (Unicode block) - [https://en.wikipedia.org/wiki/Oriya_(Unicode_block)] (https://en.wikipedia.org/wiki/Oriya_(Unicode_block))
  3. MySQL API - https://www.php.net/manual/en/book.mysql.php
  4. PHP API - https://www.php.net/manual/en/mysqlinfo.api.choosing.php
  5. Document Object Model - https://en.wikipedia.org/wiki/Document_Object_Model

License

This project is licensed under the MIT License