Skip to content

Automatically extracting keyphrases that are salient to the document meanings is an essential step to semantic document understanding. An effective keyphrase extraction (KPE) system can benefit a wide range of natural language processing and information retrieval tasks. Recent neural methods formulate the task as a document-to-keyphrase sequence…

License

Notifications You must be signed in to change notification settings

microsoft/OpenKP

Repository files navigation

OpenKP

Automatically extracting keyphrases that are salient to the document meanings is an essential step in semantic document understanding. To facilitate this research area we have created OpenKeyPhrase(OpenKP), a large scale, open domain keyphrase extraction dataset. The dataset features 148,124 real world web documents along with a human annotation indicating the 1-3 most relevant keyphrases. More information about the dataset and our initial experiments can be found in the paper Open Domain Web Keyphrase Extraction Beyond Language Modeling which will be an oral presentation at EMNLP-IJCNLP 2019. It is part of the MSMARCO dataset family and research projects like this power the core document understanding pipeline that Bing uses.

Key Phrase extraction

Keyphrase extraction is a language problem represented as: There is a document D in which there are 1-n key phrases which can be used to understand what the document is about, find other relevant documents, and improve many downstream NLP problems. In OpenKP we have formalized this problem to focus on the general web domain. The corpus consists of websites which were human annotated for their most relevant key phrases. Its worth noting that during the expert annotation, judges only copied the relevant text from the document and thus there is no language generation required.

Corpus Generation

To generate the corpus we sample ~100,000 urls from the Bing Index to get a representative sample of true domain diversity. Additionally, we sampled ~40,000 urls from the MSMARCO QA corpus since it can be considered a representative sample of open domain web document search. Once the urls are selected they are provided to an expert judge who visits the website, explores its content and when they are done annotates 1-3 keyphrases in the document they believe to be most salient to the overall document. This expert judge pool was trained specifically for this task and they received regular quality checks and feedback to ensure there was a consistent understand of what a documents relevant keyphrases may be. Once they judges annotated a website, the HTML was downloaded and parsed and prepared into our CleanBody pipeline. The cleanbody pipeline produces a text representation(without any menu's, ads, images, etc) and then a visual representation of the document, more information and specifics can be found below.

Examples

{
    'url': 'http://1000projects.org/online-doctor-appointment-system-java-project.html', 
    'text': 'April 30 2018 by nikhith P Online Doctor Appointment System Java Project Project Title Secure Web Application for Online Doctor Appointment System Online Doctor appointment is a smart web application this provides a registration and login for both doctors and patients Doctors can register by giving his necessary details like timings fee category etc After successful registration the doctor can log in by giving username and password The doctor can view the booking request by patients and if he accepts the patient requests the status will be shown as booking confirmed to the patient He can also view the feedback given by the patient The patients must be registered and log in to book a doctor basing the category and the type The Application has following modules Admin Doctor Patient Admin Admin needs to login with username and password and in the admin home screen he can see the basic functionalities of admin Admin can view the registered doctors and patients He can also view the patients request and doctors requests and he will confirm the patients and doctors requests Doctor Doctor need to be registered by giving the necessary details like experience timing fees etc After registering he need to log in and in the home screen he can view the basic functionalities He can view the patient request forwarded from admin and he can accept and he can also view the feedback given by patients Patient The patient needs to be registered and log in after logging on he can search for the doctor by giving the location the reason or problem Basing on the doctor availability the admin will confirm the booking request and will send to mail that the booking is confirmed he can also view in the status and he can also give feedback basing the performance of the doctor Existing System In the existing system the patient needs to visit the doctor for booking we need to wait and the booking will be done manually so to maintain everything is always a problem Proposed System In the proposed system the doctors patients are brought to one platform will allow patients to be more flexible they can register and search for the doctors basing on the location the list of doctors will be shown and patient can book by selecting the time slots and the admin will confirm the booking so everything is computerized an done very fast which will save time Software Requirements NetBeans74JDK 17MySQL 55SQL Yog HTML JavaScript and CSS Screens Home Page This screen shows the basic view of the application home page and the list of modules Admin Login Page In this page admin can log in by giving username and password Admin Home Page After successful login the application shows the admin home page in which the basic functionalities are shown View doctors Page In this page admin can view the list of doctors registered View patients Page In this admin can view the list of patients registered Patients request Page In this page admin can view the requests sent by the user for booking a doctor View doctors request Page In this page the request from a doctor is shown and admin will send the confirmation to the user that the booking is confirmed Doctor registration Page In this page the can register into the application by providing all the necessary details like experience fee timings etc Doctor login Page In this page the doctor can login by giving the username and password Doctor home Page After the successful login the doctor home page shows basic functionalities View request Page In this page the doctor can view the patient requests which are forwarded by the admin and he responds to the request View feedback Page In this page the doctor can view the patients feedback Patient registration Page In this page the patient can register into the application by providing all necessary details Patient login Page In this page patient can log in by giving username and password Patient home Page After successful login the application shows the patient homepage with basic functionalities Search Results Page In this page patient can search the doctor by giving the category reason location by selecting on the map In this page after giving the details for searching the doctor the search results will be shown like as in above screen In this page patient can view the status of his booking whether the booking is confirmed or not Feedback Page In this page patient can give the feedback for the doctor based on his performance 201718 Java Projects CSE Projects Java Abstracts Java Based Projects MySQL Projects Previous Venue Booking System Java Project Next Campus Recruitment System Java Project', 
    'VDOM': '[{"Id":0,"text":"April 30 2018","feature":[48.0,115.0,97.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,96.0,18.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":0,"end_idx":3},{"Id":0,"text":"by","feature":[162.0,105.0,97.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,96.0,18.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":3,"end_idx":4},{"Id":0,"text":"nikhith P","feature":[190.0,77.0,97.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,96.0,18.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":4,"end_idx":6},{"Id":0,"text":"Online Doctor Appointment System Java Project","feature":[48.0,619.0,114.0,36.0,1.0,0.0,1.0,0.0,26.0,0.0,48.0,619.0,114.0,36.0,1.0,0.0,1.0,0.0,26.0,0.0],"start_idx":6,"end_idx":12},{"Id":0,"text":"Project Title","feature":[48.0,96.0,174.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,172.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":12,"end_idx":14},{"Id":0,"text":"Secure Web Application for Online Doctor Appointment System","feature":[48.0,619.0,172.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,172.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":14,"end_idx":22},{"Id":0,"text":"Online Doctor appointment is a smart web application this provides a registration and login for both doctors and patients Doctors can register by giving his necessary details like timings fee category etc After successful registration the doctor can log in by giving username and password The doctor can view the booking request by patients and if he accepts the patient requests the status will be shown as booking confirmed to the patient He can also view the feedback given by the patient The patients must be registered and log in to book a doctor basing the category and the type","feature":[48.0,619.0,272.0,336.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,272.0,336.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":22,"end_idx":122},{"Id":0,"text":"The Application has following modules","feature":[48.0,352.0,634.0,23.0,0.0,0.0,0.0,0.0,19.0,1.0,48.0,619.0,632.0,28.0,1.0,0.0,0.0,0.0,19.0,1.0],"start_idx":122,"end_idx":127},{"Id":0,"text":"Admin","feature":[48.0,619.0,684.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,684.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":127,"end_idx":128},{"Id":0,"text":"Doctor","feature":[48.0,619.0,708.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,708.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":128,"end_idx":129},{"Id":0,"text":"Patient","feature":[48.0,619.0,732.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,732.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":129,"end_idx":130},{"Id":0,"text":"Admin","feature":[48.0,56.0,782.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,780.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":130,"end_idx":131},{"Id":0,"text":"Admin needs to login with username and password and in the admin home screen he can see the basic functionalities of admin Admin can view the registered doctors and patients He can also view the patients request and doctors requests and he will confirm the patients and doctors requests","feature":[48.0,619.0,828.0,96.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,828.0,96.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":131,"end_idx":180},{"Id":0,"text":"Doctor","feature":[48.0,59.0,950.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,948.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":180,"end_idx":181},{"Id":0,"text":"Doctor need to be registered by giving the necessary details like experience timing fees etc After registering he need to log in and in the home screen he can view the basic functionalities He can view the patient request forwarded from admin and he can accept and he can also view the feedback given by patients","feature":[48.0,619.0,996.0,96.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,996.0,96.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":181,"end_idx":237},{"Id":0,"text":"Patient","feature":[48.0,62.0,1118.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,1116.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":237,"end_idx":238},{"Id":0,"text":"The patient needs to be registered and log in after logging on he can search for the doctor by giving the location the reason or problem Basing on the doctor availability the admin will confirm the booking request and will send to mail that the booking is confirmed he can also view in the status and he can also give feedback basing the performance of the doctor","feature":[48.0,619.0,1164.0,120.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,1164.0,120.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":238,"end_idx":305},{"Id":0,"text":"Existing System","feature":[48.0,158.0,1310.0,23.0,0.0,0.0,0.0,0.0,19.0,1.0,48.0,619.0,1308.0,28.0,1.0,0.0,0.0,0.0,19.0,1.0],"start_idx":305,"end_idx":307},{"Id":0,"text":"In the existing system the patient needs to visit the doctor for booking we need to wait and the booking will be done manually so to maintain everything is always a problem","feature":[48.0,619.0,1360.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,1360.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":307,"end_idx":339},{"Id":0,"text":"Proposed System","feature":[48.0,170.0,1458.0,23.0,0.0,0.0,0.0,0.0,19.0,1.0,48.0,619.0,1456.0,28.0,1.0,0.0,0.0,0.0,19.0,1.0],"start_idx":339,"end_idx":341},{"Id":0,"text":"In the proposed system the doctors patients are brought to one platform will allow patients to be more flexible they can register and search for the doctors basing on the location the list of doctors will be shown and patient can book by selecting the time slots and the admin will confirm the booking so everything is computerized an done very fast which will save time","feature":[48.0,619.0,1508.0,120.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,1508.0,120.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":341,"end_idx":407},{"Id":0,"text":"Software Requirements","feature":[48.0,227.0,1654.0,23.0,0.0,0.0,0.0,0.0,19.0,1.0,48.0,619.0,1652.0,28.0,1.0,0.0,0.0,0.0,19.0,1.0],"start_idx":407,"end_idx":409},{"Id":0,"text":"NetBeans74JDK 17MySQL 55SQL Yog HTML JavaScript and CSS","feature":[48.0,619.0,1704.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,1704.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":409,"end_idx":417},{"Id":0,"text":"Screens","feature":[48.0,80.0,1754.0,23.0,0.0,0.0,0.0,0.0,19.0,1.0,48.0,619.0,1752.0,28.0,1.0,0.0,0.0,0.0,19.0,1.0],"start_idx":417,"end_idx":418},{"Id":0,"text":"Home Page","feature":[48.0,93.0,1806.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,1804.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":418,"end_idx":420},{"Id":0,"text":"This screen shows the basic view of the application home page and the list of modules","feature":[48.0,619.0,1876.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,1876.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":420,"end_idx":436},{"Id":0,"text":"Admin Login Page","feature":[48.0,146.0,1950.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,1948.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":436,"end_idx":439},{"Id":0,"text":"In this page admin can log in by giving username and password","feature":[48.0,619.0,2020.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2020.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":439,"end_idx":451},{"Id":0,"text":"Admin Home Page","feature":[48.0,148.0,2046.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2020.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":451,"end_idx":454},{"Id":0,"text":"After successful login the application shows the admin home page in which the basic functionalities are shown","feature":[48.0,619.0,2116.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2116.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":454,"end_idx":471},{"Id":0,"text":"View doctors Page","feature":[48.0,150.0,2166.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2116.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":471,"end_idx":474},{"Id":0,"text":"In this page admin can view the list of doctors registered","feature":[48.0,619.0,2236.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2236.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":474,"end_idx":485},{"Id":0,"text":"View patients Page","feature":[48.0,153.0,2262.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2236.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":485,"end_idx":488},{"Id":0,"text":"In this admin can view the list of patients registered","feature":[48.0,619.0,2332.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2332.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":488,"end_idx":498},{"Id":0,"text":"Patients request Page","feature":[48.0,182.0,2358.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2332.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":498,"end_idx":501},{"Id":0,"text":"In this page admin can view the requests sent by the user for booking a doctor","feature":[48.0,619.0,2428.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2428.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":501,"end_idx":517},{"Id":0,"text":"View doctors request Page","feature":[48.0,218.0,2454.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2428.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":517,"end_idx":521},{"Id":0,"text":"In this page the request from a doctor is shown and admin will send the confirmation to the user that the booking is confirmed","feature":[48.0,619.0,2524.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2524.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":521,"end_idx":545},{"Id":0,"text":"Doctor registration Page","feature":[48.0,198.0,2574.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2524.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":545,"end_idx":548},{"Id":0,"text":"In this page the can register into the application by providing all the necessary details like experience fee timings etc","feature":[48.0,619.0,2644.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2644.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":548,"end_idx":568},{"Id":0,"text":"Doctor login Page","feature":[48.0,143.0,2694.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2644.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":568,"end_idx":571},{"Id":0,"text":"In this page the doctor can login by giving the username and password","feature":[48.0,619.0,2764.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2764.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":571,"end_idx":584},{"Id":0,"text":"Doctor home Page","feature":[48.0,149.0,2814.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2812.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":584,"end_idx":587},{"Id":0,"text":"After the successful login the doctor home page shows basic functionalities","feature":[48.0,619.0,2884.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2884.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":587,"end_idx":598},{"Id":0,"text":"View request Page","feature":[48.0,149.0,2910.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2884.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":598,"end_idx":601},{"Id":0,"text":"In this page the doctor can view the patient requests which are forwarded by the admin and he responds to the request","feature":[48.0,619.0,2980.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,2980.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":601,"end_idx":623},{"Id":0,"text":"View feedback Page","feature":[48.0,161.0,3030.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,2980.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":623,"end_idx":626},{"Id":0,"text":"In this page the doctor can view the patients feedback","feature":[48.0,619.0,3100.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3100.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":626,"end_idx":636},{"Id":0,"text":"Patient registration Page","feature":[48.0,201.0,3126.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,3100.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":636,"end_idx":639},{"Id":0,"text":"In this page the patient can register into the application by providing all necessary details","feature":[48.0,619.0,3196.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3196.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":639,"end_idx":654},{"Id":0,"text":"Patient login Page","feature":[48.0,146.0,3270.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,3268.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":654,"end_idx":657},{"Id":0,"text":"In this page patient can log in by giving username and password","feature":[48.0,619.0,3340.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3340.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":657,"end_idx":669},{"Id":0,"text":"Patient home Page","feature":[48.0,152.0,3366.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,3340.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":669,"end_idx":672},{"Id":0,"text":"After successful login the application shows the patient homepage with basic functionalities","feature":[48.0,619.0,3436.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3436.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":672,"end_idx":684},{"Id":0,"text":"Search Results Page","feature":[48.0,165.0,3582.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,3580.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":684,"end_idx":687},{"Id":0,"text":"In this page patient can search the doctor by giving the category reason location by selecting on the map","feature":[48.0,619.0,3652.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3652.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":687,"end_idx":706},{"Id":0,"text":"In this page after giving the details for searching the doctor the search results will be shown like as in above screen","feature":[48.0,619.0,3724.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3724.0,72.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":706,"end_idx":728},{"Id":0,"text":"In this page patient can view the status of his booking whether the booking is confirmed or not","feature":[48.0,619.0,3844.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3844.0,48.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":728,"end_idx":746},{"Id":0,"text":"Feedback Page","feature":[48.0,123.0,3918.0,19.0,0.0,0.0,0.0,0.0,16.0,1.0,48.0,619.0,3916.0,24.0,1.0,0.0,0.0,0.0,16.0,1.0],"start_idx":746,"end_idx":748},{"Id":0,"text":"In this page patient can give the feedback for the doctor based on his performance","feature":[48.0,619.0,3988.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0,48.0,619.0,3988.0,24.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":748,"end_idx":763},{"Id":0,"text":"201718 Java Projects","feature":[76.0,184.0,4102.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,4068.0,98.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":763,"end_idx":766},{"Id":0,"text":"CSE Projects","feature":[268.0,111.0,4102.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,4068.0,98.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":766,"end_idx":768},{"Id":0,"text":"Java Abstracts","feature":[387.0,136.0,4102.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,4068.0,98.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":768,"end_idx":770},{"Id":0,"text":"Java Based Projects","feature":[76.0,549.0,4102.0,30.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,4068.0,98.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":770,"end_idx":773},{"Id":0,"text":"MySQL Projects","feature":[161.0,135.0,4118.0,14.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,619.0,4068.0,98.0,1.0,0.0,0.0,0.0,11.0,0.0],"start_idx":773,"end_idx":775},{"Id":0,"text":"Previous","feature":[48.0,309.0,5379.0,16.0,0.0,0.0,0.0,0.0,11.0,0.0,48.0,309.0,5379.0,51.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":775,"end_idx":776},{"Id":0,"text":"Venue Booking System Java Project","feature":[48.0,266.0,5409.0,18.0,0.0,0.0,0.0,0.0,15.0,0.0,48.0,309.0,5379.0,51.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":776,"end_idx":781},{"Id":0,"text":"Next","feature":[358.0,309.0,5379.0,16.0,0.0,0.0,0.0,0.0,11.0,0.0,358.0,309.0,5379.0,75.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":781,"end_idx":782},{"Id":0,"text":"Campus Recruitment System Java Project","feature":[413.0,254.0,5409.0,42.0,0.0,0.0,0.0,0.0,15.0,0.0,358.0,309.0,5379.0,75.0,1.0,0.0,0.0,0.0,16.0,0.0],"start_idx":782,"end_idx":787}]', 
    'KeyPhrases': '[["Doctor", "Appointment"], ["Web", "Application"], []]'
}

Text

The text in this corpus is a represenation of what Bing considers to be the cleanbody text of a document. Think of this like a reader view for a webpage where all ads, menus, footers and other general webpage content is removed and just the core content is left.

VDOM

VDOM is a textual virtual representation of the web documents deruved frin the DOM of a website at the time of the snapshot. In order to do this the VDOM provides a rich markup to the textual data in the cleanbody text. Each item will contain its id, the text referenced, start_idx, end_idx and a array of 20 features that represent the text. The first 10 represent the nodes values while 11-20 represent the same values for the text node's parent. { "Id":0, "text":"Average Monthly Salary for 72 Countries in the World", "feature": [46.0,915.0,384.0,106.0,1.0,0.0,1.0,0.0,36.0,0.0,46.0,915.0,384.0,106.0,1.0,0.0,1.0,0.0,36.0,0.0], "start_idx":0, "end_idx":9 } The what the feature array values represent can be found below. For Booleans (false -> 0.0, true -> 1.0).

  1. Node X pixel position:Upper Left X Coord
  2. Node width pixel:Width of Node
  3. Node Y pixel position:Upper Left Y Coord
  4. Node height (x,y are upper left corner): Height of Node
  5. Is the node a HTML block element(Boolean):Refer https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements. A node is Block Element if it's an Element Node and its tag is one of the following: { "address", "article", "aside", "blockquote", "canvas", "dd", "div", "dl", "dt", "fieldset", "figcaption", "caption", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "li", "main", "nav", "noscript", "ol", "output", "p", "pre", "section", "table", "tr", "td", "th", "tbody", "thead", "tfoot", "ul", "video" }
  6. Is the node an HTML Inline element(Boolean):Refer see https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements. A node is Inline Block Element if it's an Element Node and its tag is one of the following: { "select", "button" }
  7. Does the node having a heading tag(Boolean): We check recursively started from the node and going up its parent whether each node is a Heading Element. We stop when the parent becomes null indicating we have reached the topmost ancestor or when we encounter a heading element.
  8. Is the node a leaf element(Boolean):It’s a Block Element and does not contain any Block Element Nodes i.e. none of its Children are Block Element Nodes
  9. Text Font size: For Text Nodes it’s the font size of the text it contains. Only text node have font size so we take the first text node font size property as the font size of the entire block node.
  10. Is the text bolded(Boolean): We check recursively started from the first non empty text node and going up its parent whether each node tag is either "b" or "strong". We stop when the parent becomes null indicating we have reached the topmost ancestor or if we find a bolded tag in one of the nodes.
  11. Node's Parent X pixel position
  12. Node's Parent width pixel
  13. Node's Parent Y pixel positions
  14. Node's Parent height
  15. Is the node's parent an HTML block element?
  16. Is the node's parent an HTML inline element?
  17. Does the node's parent have a heading tag?
  18. Is the node's parent a leaf element?
  19. Node's Parent text font size
  20. Is the node's parent text bolded?

Definitions

  1. Text Node:Innermost Nodes. Smallest Unit in Node Hierarchy.
  2. Element Node: Comprises of Most Standard HTML Tags eg p,span etc
  3. Parent Node: The parent node of a node is one level higher and has the node as its child.
  4. Heading Element: A Heading Element is defined as a node which is an Element Node and its tag is one of the following:{h1,h2,h3,h4,h5,h6}
  5. Non Empty Text Node:We ignore nodes if they satisfy one of the following conditions: a. TagValue i.e. the text of the node is empty or whitespace b. If Node is Invisible c. If parent is an Element Node and its tag is one of the following : { "script", "noscript", "style", "xmp", "embed", "noembed", "object" } i.e. we ignore service tags
  6. First Non Empty Text Node:Each Block Node consist of a sequence of text Nodes. We iterate over the sequence and return the first text node which is Non Empty.
  7. Visible Node:Nodes with Height and Width greater than 2px are considered visible We also include title nodes under head tag as visible.

Evaluation

To evaluate KeyPhrase Extraction systems we ask researchers to submit their systems prediction in the following format:

{"url": "http://54pizzaexpress.com/", "KeyPhrases": ["54 Pizza Express", "wholesome quality ingredient", "homemade product"]}
{"url": "http://admissions.ucr.edu/visit-ucr/campus-events.html", "KeyPhrases": ["UNIVERSITY OF CALIFORNIA", "Campus Events", "Transfer Day"]}

Where each line is a json object with a url as key and an array of keyphrases(max 3 keyphrases per URL).

python evaluate.py <candidate file> <reference file>

To test please use the evaluation script on the Dev set. If your pipeline is running correctly you score will be similair to that shown below.

(base) spacemanidol@spacemanidol:/mnt/c/Users/dacamp/Documents/MSMARCO-OpenKP$ python evaluate.py eval_data/candidate.json eval_data/reference.json
########################
Metrics
@1
F1:0.0
P:0.0
R:0.0
@3
F1:0.6666666666666666
P:0.6666666666666666
R:0.6666666666666666
@5
F1:0.5
P:0.4
R:0.6666666666666666
#########################

Download the Dataset

To Download the OpenKP Dataset please navigate to msmarco.org and agree to our Terms and Conditions. If there is some data you think we are missing and would be useful please open an issue in this repo.

FAQ

What is the input for keyphrase extraction?

The keyphrases are all derived from the textual data but the visual markup(aka VDOM) can be used to produce a more accurate model.

Is the released dataset is processed.

Yes. The rawtext has been processed. The only tokenization should be on the "SPACE" character.

Is there punctuation in the document text?

No. All punctuation was removed as part of the cleanbody text processing.

Are all keyphrases short?

No! While many keyphrases can be very short(average length is 2) there are documents where the most salient keyphrases are quite long.

Is the TITLE attribute of the document considered in the processed text?

No. It is not.

In evaluation do you perform stemming to compute the exact match?

No. In order to focus this task on keyphrase prediction and thus matches are evaluated on full phrases. Please see the eval script for further questions.

MSMARCO Dataset Family

A Family of datasets built using technology and Data from Microsoft's Bing.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset Paper URL : https://arxiv.org/abs/1611.09268

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking, Keyphrase Extraction, and Conversational Search Studies, or what the community thinks would be useful.

First released at NIPS 2016, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The dataset started off focusing on QnA but has since evolved to focus on any problem related to search. For task specifics please explore some of the tasks that have been built out of the dataset. If you think there is a relevant task we have missed please open an issue explaining your ideas?

For more information about TREC 2019 Deep Learning

For more information about Q&A

For more information about Ranking

For more information about Conversational Search

For more information about Polite Crawling

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

About

Automatically extracting keyphrases that are salient to the document meanings is an essential step to semantic document understanding. An effective keyphrase extraction (KPE) system can benefit a wide range of natural language processing and information retrieval tasks. Recent neural methods formulate the task as a document-to-keyphrase sequence…

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages