A Python and SQL implementation to find patterns in 7000 publication metadata from SCOPUS
This is a part of Wicked Science project from United Nations University - FLORES, Dresden.
We were curious about how the different problem sructure keywords (Wicked, Complex, Uncertain and Conflict) trends with Social Science Dimension (Policy and Governance) in Resource Nexus (Water,Soil, Food, Waste and Energy) around different regions in the world.
We first downloaded the csv data from Scopus using the following keyword combination:
( AUTHKEY ( wicked* ) OR AUTHKEY ( uncertain* ) OR AUTHKEY ( complex* ) OR AUTHKEY ( conflict* ) AND AUTHKEY ( "Water" ) OR AUTHKEY ( "Soil" ) OR AUTHKEY ( "Waste" ) OR AUTHKEY ( "Energy" ) OR AUTHKEY ( "Food" ) AND AUTHKEY ( "Governance" ) OR AUTHKEY ( "Policy" ) ) OR ( TITLE ( wicked* ) OR TITLE ( uncertain* ) OR TITLE ( complex* ) OR TITLE ( conflict* ) AND TITLE ( "Water" ) OR TITLE ( "Soil" ) OR TITLE ( "Waste" ) OR TITLE ( "Energy" ) OR TITLE ( "Food" ) AND TITLE ( "Governance" ) OR TITLE ( "Policy" ) ) OR ( ABS ( wicked* ) OR ABS ( uncertain* ) OR ABS ( complex* ) OR ABS ( conflict* ) AND ABS ( "Water" ) OR ABS ( "Soil" ) OR ABS ( "Waste" ) OR ABS ( "Energy" ) OR ABS ( "Food" ) AND ABS ( "Governance" ) OR ABS ( "Policy" ) ) AND ( LIMIT-TO ( SUBJAREA , "SOCI" ) ) AND ( EXCLUDE ( PUBYEAR , 2021 ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )
This Generated a 7041 results!
We download the following data
Data Cleaning and concatnation(Tool: Python, Compiler: Google Colab)
However we mostly need the author name, author keywords, title, abstract and their affiliation. Scopus allows only data of 2000 at a time thus the data was downloaded in 4 year wise chunks
- 1953-2010.csv
- 2011-2014.csv
- 2015-2017.csv
- 2018-2010.csv A python joiner was built using pandas framework to join the files The joined file is named as: 1953-2020.csv 1953-2020.xlsx
assign an index and outputting a csv file for pattern matching in SQL
SQL Pattern Matching (Tool: PostGreSQL)
the pattern matching in intended to find co-occurance or the intersections of the keywords of Problem Structure, Social Science Dimensions and Resource Nexus. The implementation
Problem Structure:
- Wicked
- Conflict
- Complex
- Uncertain
Social Science Dimensions:
- Governance
- Policy
Resource Nexus:
- Water
- Soil
- Food
- Energy
- Waste
The details of column creation, pattern matching codes, formulas are listed in this file
The results are included in the results folder
Country/ Regional Diffusion (Tool: Google Sheet)
We were curious about how over the time the keywords diffused in different regions globally to find a trend in publication. The results was generated from the affiliation column by detecting countries abd joining the results from the SQL outputs