Skip to content

Commit

Permalink
Updated logs and folder structure.
Browse files Browse the repository at this point in the history
  • Loading branch information
mihaeladuta committed Feb 8, 2024
1 parent e7c7304 commit d41791b
Show file tree
Hide file tree
Showing 15 changed files with 205 additions and 155 deletions.
Binary file not shown.
Binary file not shown.
15 changes: 11 additions & 4 deletions logs/EnvironmentStatementsInstitutionLevel.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
2024-02-06 17:05:39,048 [INFO] EnvironmentStatementsInstitutionLevel - read data from 'data/processed/environment_statements/extracted/institution/'
2024-02-06 17:05:39,048 [INFO] EnvironmentStatementsInstitutionLevel - statements: 143, sections: 4
2024-02-06 17:05:39,485 [INFO] EnvironmentStatementsInstitutionLevel - prepared institution statements: 143 records, 5 columns
2024-02-06 17:05:39,508 [INFO] EnvironmentStatementsInstitutionLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsInstitutionLevel.parquet'
2024-02-08 19:40:09,081 [INFO] EnvironmentStatementsInstitutionLevel - read data from 'data/processed/environment_statements/extracted/institution/'
2024-02-08 19:40:09,081 [INFO] EnvironmentStatementsInstitutionLevel - statements: 143, sections: 4
2024-02-08 19:40:09,082 [INFO] EnvironmentStatementsInstitutionLevel - split statements into lines
2024-02-08 19:40:09,082 [INFO] EnvironmentStatementsInstitutionLevel - deleted empty lines
2024-02-08 19:40:09,082 [INFO] EnvironmentStatementsInstitutionLevel - replaced tabs with spaces
2024-02-08 19:40:09,082 [INFO] EnvironmentStatementsInstitutionLevel - replaced multiple spaces with a single space
2024-02-08 19:40:09,083 [INFO] EnvironmentStatementsInstitutionLevel - deleted lines with page numbers
2024-02-08 19:40:09,084 [INFO] EnvironmentStatementsInstitutionLevel - deleted lines equal to any of ['Institutional level environment template (REF5a)', 'Institutional level environment template (REF5b)', 'Unit-level environment template (REF5a)', 'Unit-level environment template (REF5b)', 'REF5a - Institution Environment Statement', 'Institutional-Level Environment Statement (REF5a)']
2024-02-08 19:40:09,558 [INFO] EnvironmentStatementsInstitutionLevel - processed all 143 available statements
2024-02-08 19:40:09,559 [INFO] EnvironmentStatementsInstitutionLevel - make categorical ['Institution name']
2024-02-08 19:40:09,609 [INFO] EnvironmentStatementsInstitutionLevel - write dataset to 'data/processed/environment_statements/EnvironmentStatementsInstitutionLevel.parquet'
15 changes: 11 additions & 4 deletions logs/EnvironmentStatementsUnitLevel.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
2024-02-06 17:05:39,030 [INFO] EnvironmentStatementsUnitLevel - read data from 'data/processed/environment_statements/extracted/unit/'
2024-02-06 17:05:39,030 [INFO] EnvironmentStatementsUnitLevel - statements: 1874, sections: 4
2024-02-06 17:05:57,989 [INFO] EnvironmentStatementsUnitLevel - prepared statements: 1874 records
2024-02-06 17:05:58,524 [INFO] EnvironmentStatementsUnitLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsUnitLevel.parquet'
2024-02-08 19:40:09,059 [INFO] EnvironmentStatementsUnitLevel - read data from 'data/processed/environment_statements/extracted/unit/'
2024-02-08 19:40:09,060 [INFO] EnvironmentStatementsUnitLevel - statements: 1874, sections: 4
2024-02-08 19:40:09,061 [INFO] EnvironmentStatementsUnitLevel - split statements into lines
2024-02-08 19:40:09,061 [INFO] EnvironmentStatementsUnitLevel - deleted empty lines
2024-02-08 19:40:09,061 [INFO] EnvironmentStatementsUnitLevel - replaced tabs with spaces
2024-02-08 19:40:09,062 [INFO] EnvironmentStatementsUnitLevel - replaced multiple spaces with a single space
2024-02-08 19:40:09,063 [INFO] EnvironmentStatementsUnitLevel - deleted lines with page numbers
2024-02-08 19:40:09,063 [INFO] EnvironmentStatementsUnitLevel - deleted lines equal to any of ['Institutional level environment template (REF5a)', 'Institutional level environment template (REF5b)', 'Unit-level environment template (REF5a)', 'Unit-level environment template (REF5b)', 'REF5a - Institution Environment Statement', 'Institutional-Level Environment Statement (REF5a)']
2024-02-08 19:40:28,754 [INFO] EnvironmentStatementsUnitLevel - processed all 1874 available statements
2024-02-08 19:40:28,754 [INFO] EnvironmentStatementsUnitLevel - make categorical ['Institution name', 'Unit of assessment name', 'Multiple submission letter']
2024-02-08 19:40:29,145 [INFO] EnvironmentStatementsUnitLevel - write dataset to 'data/processed/environment_statements/EnvironmentStatementsUnitLevel.parquet'
20 changes: 10 additions & 10 deletions logs/ImpactCaseStudies.log
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
2024-02-06 17:05:42,732 [INFO] ImpactCaseStudies - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:43,321 [INFO] ImpactCaseStudies - parsed sheet: 6361 records
2024-02-06 17:05:43,322 [INFO] ImpactCaseStudies - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:43,324 [INFO] ImpactCaseStudies - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:43,325 [INFO] ImpactCaseStudies - add columns for panel names
2024-02-06 17:05:43,326 [INFO] ImpactCaseStudies - shift columns from title to the left to fix raw data issue
2024-02-06 17:05:46,539 [INFO] ImpactCaseStudies - replace styling characters in ['1. Summary of the impact', '2. Underpinning research', '3. References to the research', '4. Details of the impact', '5. Sources to corroborate the impact']
2024-02-06 17:05:46,543 [INFO] ImpactCaseStudies - drop columns '['Researcher ORCIDs', 'Institution UKPRN code', '5. Sources to corroborate the impact', 'Unit of assessment number', 'Global research identifiers', 'Main panel code', '3. References to the research', 'Formal partners', 'Is continued from 2014', 'Grant funding', '2. Underpinning research', 'Countries']'
2024-02-06 17:05:46,566 [INFO] ImpactCaseStudies - make categorical ['Institution name', 'Main panel name', 'Joint submission', 'Unit of assessment name', 'Multiple submission letter', 'Multiple submission name']
2024-02-06 17:05:46,765 [INFO] ImpactCaseStudies - write dataset to 'data/processed/sheets/ImpactCaseStudies.parquet'
2024-02-08 19:40:12,766 [INFO] ImpactCaseStudies - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:13,319 [INFO] ImpactCaseStudies - parsed sheet: 6361 records
2024-02-08 19:40:13,319 [INFO] ImpactCaseStudies - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:13,321 [INFO] ImpactCaseStudies - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:13,323 [INFO] ImpactCaseStudies - add columns for panel names
2024-02-08 19:40:13,324 [INFO] ImpactCaseStudies - shift columns from title to the left to fix raw data issue
2024-02-08 19:40:16,184 [INFO] ImpactCaseStudies - replace styling characters in ['1. Summary of the impact', '2. Underpinning research', '3. References to the research', '4. Details of the impact', '5. Sources to corroborate the impact']
2024-02-08 19:40:16,187 [INFO] ImpactCaseStudies - drop columns '['3. References to the research', '5. Sources to corroborate the impact', 'Researcher ORCIDs', 'Institution UKPRN code', 'Is continued from 2014', 'Unit of assessment number', 'Countries', 'Grant funding', 'Main panel code', 'Formal partners', 'Global research identifiers', '2. Underpinning research']'
2024-02-08 19:40:16,215 [INFO] ImpactCaseStudies - make categorical ['Main panel name', 'Institution name', 'Multiple submission letter', 'Unit of assessment name', 'Multiple submission name', 'Joint submission']
2024-02-08 19:40:16,429 [INFO] ImpactCaseStudies - write dataset to 'data/processed/sheets/ImpactCaseStudies.parquet'
24 changes: 12 additions & 12 deletions logs/Outputs.log
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
2024-02-06 17:05:42,713 [INFO] Outputs - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:59,201 [INFO] Outputs - parsed sheet: 185353 records
2024-02-06 17:05:59,221 [INFO] Outputs - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:59,252 [INFO] Outputs - rename 'Output type' to 'Output type code'
2024-02-06 17:05:59,298 [INFO] Outputs - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:59,338 [INFO] Outputs - add columns for panel names
2024-02-06 17:06:00,062 [INFO] Outputs - replace styling characters in ['Title']
2024-02-06 17:06:00,097 [INFO] Outputs - add columns for output types names
2024-02-06 17:06:00,098 [INFO] Outputs - make output year categorical
2024-02-06 17:06:00,141 [INFO] Outputs - drop columns '['Main panel code', 'Institution UKPRN code', 'Unit of assessment number', 'Output type code']'
2024-02-06 17:06:00,155 [INFO] Outputs - make categorical ['Institution name', 'Multiple submission letter', 'Research group', 'Delayed by COVID19', 'Interdisciplinary', 'Output type', 'Joint submission', 'Is reserve output', 'Open access status', 'Cross-referral requested', 'Propose double weighting', 'Citations applicable', 'Main panel name', 'Unit of assessment name', 'Multiple submission name', 'Forensic science', 'Non-English', 'Criminology']
2024-02-06 17:06:00,469 [INFO] Outputs - write dataset to 'data/processed/sheets/Outputs.parquet'
2024-02-08 19:40:12,758 [INFO] Outputs - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:29,215 [INFO] Outputs - parsed sheet: 185353 records
2024-02-08 19:40:29,240 [INFO] Outputs - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:29,273 [INFO] Outputs - rename 'Output type' to 'Output type code'
2024-02-08 19:40:29,319 [INFO] Outputs - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:29,363 [INFO] Outputs - add columns for panel names
2024-02-08 19:40:30,092 [INFO] Outputs - replace styling characters in ['Title']
2024-02-08 19:40:30,128 [INFO] Outputs - add columns for output types names
2024-02-08 19:40:30,132 [INFO] Outputs - make output year categorical
2024-02-08 19:40:30,175 [INFO] Outputs - drop columns '['Institution UKPRN code', 'Unit of assessment number', 'Output type code', 'Main panel code']'
2024-02-08 19:40:30,189 [INFO] Outputs - make categorical ['Unit of assessment name', 'Institution name', 'Multiple submission name', 'Interdisciplinary', 'Cross-referral requested', 'Main panel name', 'Multiple submission letter', 'Forensic science', 'Open access status', 'Citations applicable', 'Propose double weighting', 'Non-English', 'Is reserve output', 'Delayed by COVID19', 'Joint submission', 'Research group', 'Output type', 'Criminology']
2024-02-08 19:40:30,511 [INFO] Outputs - write dataset to 'data/processed/sheets/Outputs.parquet'
18 changes: 9 additions & 9 deletions logs/ResearchDoctoralDegreesAwarded.log
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
2024-02-06 17:05:42,746 [INFO] ResearchDoctoralDegreesAwarded - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:42,829 [INFO] ResearchDoctoralDegreesAwarded - parsed sheet: 1888 records
2024-02-06 17:05:42,830 [INFO] ResearchDoctoralDegreesAwarded - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:42,831 [INFO] ResearchDoctoralDegreesAwarded - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:42,832 [INFO] ResearchDoctoralDegreesAwarded - add columns for panel names
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - calculate total number of degrees awarded
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - drop columns '['Unit of assessment number', 'Main panel code', 'Institution UKPRN code']'
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - make categorical ['Multiple submission name', 'Joint submission', 'Main panel name', 'Multiple submission letter', 'Unit of assessment name', 'Institution name']
2024-02-06 17:05:42,851 [INFO] ResearchDoctoralDegreesAwarded - write dataset to 'data/processed/sheets/ResearchDoctoralDegreesAwarded.parquet'
2024-02-08 19:40:12,755 [INFO] ResearchDoctoralDegreesAwarded - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:12,839 [INFO] ResearchDoctoralDegreesAwarded - parsed sheet: 1888 records
2024-02-08 19:40:12,839 [INFO] ResearchDoctoralDegreesAwarded - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:12,840 [INFO] ResearchDoctoralDegreesAwarded - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:12,841 [INFO] ResearchDoctoralDegreesAwarded - add columns for panel names
2024-02-08 19:40:12,842 [INFO] ResearchDoctoralDegreesAwarded - calculate total number of degrees awarded
2024-02-08 19:40:12,842 [INFO] ResearchDoctoralDegreesAwarded - drop columns '['Unit of assessment number', 'Institution UKPRN code', 'Main panel code']'
2024-02-08 19:40:12,843 [INFO] ResearchDoctoralDegreesAwarded - make categorical ['Unit of assessment name', 'Multiple submission name', 'Joint submission', 'Main panel name', 'Multiple submission letter', 'Institution name']
2024-02-08 19:40:12,856 [INFO] ResearchDoctoralDegreesAwarded - write dataset to 'data/processed/sheets/ResearchDoctoralDegreesAwarded.parquet'
18 changes: 9 additions & 9 deletions logs/ResearchGroups.log
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
2024-02-06 17:05:42,734 [INFO] ResearchGroups - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:42,800 [INFO] ResearchGroups - parsed sheet: 2036 records
2024-02-06 17:05:42,800 [INFO] ResearchGroups - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:42,801 [INFO] ResearchGroups - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:42,802 [INFO] ResearchGroups - add columns for panel names
2024-02-06 17:05:42,803 [INFO] ResearchGroups - make group code categorical
2024-02-06 17:05:42,803 [INFO] ResearchGroups - drop columns '['Main panel code', 'Unit of assessment number', 'Institution UKPRN code']'
2024-02-06 17:05:42,803 [INFO] ResearchGroups - make categorical ['Institution name', 'Joint submission', 'Main panel name', 'Unit of assessment name', 'Multiple submission name', 'Multiple submission letter']
2024-02-06 17:05:42,817 [INFO] ResearchGroups - write dataset to 'data/processed/sheets/ResearchGroups.parquet'
2024-02-08 19:40:12,754 [INFO] ResearchGroups - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:12,808 [INFO] ResearchGroups - parsed sheet: 2036 records
2024-02-08 19:40:12,809 [INFO] ResearchGroups - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:12,810 [INFO] ResearchGroups - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:12,811 [INFO] ResearchGroups - add columns for panel names
2024-02-08 19:40:12,811 [INFO] ResearchGroups - make group code categorical
2024-02-08 19:40:12,812 [INFO] ResearchGroups - drop columns '['Institution UKPRN code', 'Unit of assessment number', 'Main panel code']'
2024-02-08 19:40:12,812 [INFO] ResearchGroups - make categorical ['Unit of assessment name', 'Main panel name', 'Multiple submission name', 'Joint submission', 'Multiple submission letter', 'Institution name']
2024-02-08 19:40:12,824 [INFO] ResearchGroups - write dataset to 'data/processed/sheets/ResearchGroups.parquet'
18 changes: 9 additions & 9 deletions logs/ResearchIncome.log
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
2024-02-06 17:05:42,714 [INFO] ResearchIncome - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:43,917 [INFO] ResearchIncome - parsed sheet: 28637 records
2024-02-06 17:05:43,918 [INFO] ResearchIncome - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:43,926 [INFO] ResearchIncome - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:43,931 [INFO] ResearchIncome - add columns for panel names
2024-02-06 17:05:43,934 [INFO] ResearchIncome - make income source categorical
2024-02-06 17:05:43,936 [INFO] ResearchIncome - drop columns '['Unit of assessment number', 'Institution UKPRN code', 'Main panel code']'
2024-02-06 17:05:43,936 [INFO] ResearchIncome - make categorical ['Unit of assessment name', 'Multiple submission name', 'Joint submission', 'Multiple submission letter', 'Main panel name', 'Institution name']
2024-02-06 17:05:43,965 [INFO] ResearchIncome - write dataset to 'data/processed/sheets/ResearchIncome.parquet'
2024-02-08 19:40:12,755 [INFO] ResearchIncome - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:13,935 [INFO] ResearchIncome - parsed sheet: 28637 records
2024-02-08 19:40:13,936 [INFO] ResearchIncome - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:13,942 [INFO] ResearchIncome - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:13,945 [INFO] ResearchIncome - add columns for panel names
2024-02-08 19:40:13,947 [INFO] ResearchIncome - make income source categorical
2024-02-08 19:40:13,948 [INFO] ResearchIncome - drop columns '['Institution UKPRN code', 'Main panel code', 'Unit of assessment number']'
2024-02-08 19:40:13,949 [INFO] ResearchIncome - make categorical ['Unit of assessment name', 'Multiple submission letter', 'Joint submission', 'Multiple submission name', 'Main panel name', 'Institution name']
2024-02-08 19:40:13,973 [INFO] ResearchIncome - write dataset to 'data/processed/sheets/ResearchIncome.parquet'
18 changes: 9 additions & 9 deletions logs/ResearchIncomeInKind.log
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
2024-02-06 17:05:42,756 [INFO] ResearchIncomeInKind - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-06 17:05:42,964 [INFO] ResearchIncomeInKind - parsed sheet: 4093 records
2024-02-06 17:05:42,965 [INFO] ResearchIncomeInKind - rename 'Main panel' to 'Main panel code'
2024-02-06 17:05:42,967 [INFO] ResearchIncomeInKind - replace '['/', ':']' with '_' in 'Institution name'
2024-02-06 17:05:42,968 [INFO] ResearchIncomeInKind - add columns for panel names
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - make income source categorical
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - drop columns '['Unit of assessment number', 'Main panel code', 'Institution UKPRN code']'
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - make categorical ['Multiple submission name', 'Main panel name', 'Joint submission', 'Unit of assessment name', 'Multiple submission letter', 'Institution name']
2024-02-06 17:05:42,983 [INFO] ResearchIncomeInKind - write dataset to 'data/processed/sheets/ResearchIncomeInKind.parquet'
2024-02-08 19:40:12,767 [INFO] ResearchIncomeInKind - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx'
2024-02-08 19:40:12,974 [INFO] ResearchIncomeInKind - parsed sheet: 4093 records
2024-02-08 19:40:12,974 [INFO] ResearchIncomeInKind - rename 'Main panel' to 'Main panel code'
2024-02-08 19:40:12,975 [INFO] ResearchIncomeInKind - replace '['/', ':']' with '_' in 'Institution name'
2024-02-08 19:40:12,976 [INFO] ResearchIncomeInKind - add columns for panel names
2024-02-08 19:40:12,977 [INFO] ResearchIncomeInKind - make income source categorical
2024-02-08 19:40:12,977 [INFO] ResearchIncomeInKind - drop columns '['Unit of assessment number', 'Institution UKPRN code', 'Main panel code']'
2024-02-08 19:40:12,977 [INFO] ResearchIncomeInKind - make categorical ['Unit of assessment name', 'Multiple submission letter', 'Joint submission', 'Institution name', 'Main panel name', 'Multiple submission name']
2024-02-08 19:40:12,989 [INFO] ResearchIncomeInKind - write dataset to 'data/processed/sheets/ResearchIncomeInKind.parquet'
Loading

0 comments on commit d41791b

Please sign in to comment.