Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding first draft of the API data standards #93

Merged
merged 21 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f1615f2
First draft of API standards guidance, including section on tidy data.
rmbielby Sep 12, 2024
9fc0db0
Rearranged good and bad examples of tidy data structures
rmbielby Sep 12, 2024
2c7682c
Added api section on filter standardisation
rmbielby Sep 12, 2024
e33d96c
Added filter standardisation to API guidance
rmbielby Sep 12, 2024
a61c142
Added additional col_name examples to filter standardisation in the A…
rmbielby Sep 12, 2024
f21fa08
Added indicator naming examples and character limits on different ele…
rmbielby Sep 12, 2024
d1fe036
Few updates based on first PR comments
rmbielby Sep 13, 2024
3a6fad6
Adjusting col_names in examples
rmbielby Sep 13, 2024
9938749
Added example on hierarchical filtering
rmbielby Sep 13, 2024
0f0b042
Adjusting language around standardised filter sets...
rmbielby Sep 13, 2024
1a844e9
Merge branch 'main' into api-data-standards
rmbielby Sep 13, 2024
9a3cf40
Shifting tidy data examples around a little more and renaming examples
rmbielby Sep 13, 2024
188161c
Removing rogue extra table
rmbielby Sep 13, 2024
40ca12c
Adding example of pivoted data leading to unecessary levels of not ap…
rmbielby Sep 13, 2024
e723ac4
Updated title on badly pivoted data
rmbielby Sep 13, 2024
adc14bd
Merge branch 'main' into api-data-standards
rmbielby Sep 13, 2024
2f3c066
Removing extra table and adding location code limit
rmbielby Sep 13, 2024
a7d3fb6
Moving some text around due to paragraph line spacing not looking gre…
rmbielby Sep 13, 2024
82f8dc8
Merge branch 'api-data-standards' of https://github.com/dfe-analytica…
rmbielby Sep 13, 2024
40d38d4
Minor rewording
rmbielby Sep 13, 2024
69b29ac
More tweaking of text structure around examples
rmbielby Sep 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _quarto.yml
cjrace marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ website:
- statistics-production/pub.qmd
- RAP/rap-statistics.qmd
- statistics-production/ud.qmd
- statistics-production/api-data-standards.qmd
- statistics-production/ees.qmd
- statistics-production/examples.qmd
- statistics-production/embedded-charts.qmd
Expand Down
3 changes: 3 additions & 0 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ We hope it can prove a useful community driven resource for everyone from the mo
[Open data standards](statistics-production/ud.html)
- Guidance on how to structure data files

[Statistics API data standards](statistics-production/api-data-standards.html)
- Guidance on the standards to meet for API data sets

[Explore education statistics (EES)](statistics-production/ees.html)
- Tips on using the explore education statistics service

Expand Down
262 changes: 262 additions & 0 deletions statistics-production/api-data-standards.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: "Statistics API Data Standards"
---

```{r include=FALSE}
```

<p class="text-muted">Guidance on how to structure data files specifically for the EES API</p>

---

## Introduction

The API offers analysts, both internal to the DfE and external consumers and communicators of education statistics,
a way to programmatically access data on EES. However, in order to ensure a fit for purpose service, not all EES
data will be accessible via the API, and any that is will need to pass a higher bar for quality. In effect API data
**must** meet all the criteria laid out in our [Open data standards guidance](../statistics-production/ud.qmd).

Whilst the EES data screener tests for a significant base level of data quality and consistency, there are some
additional criteria that are either too awkward to test for rigorously using the screener or are tested for but
returned as warnings. Data intended for the EES API must pass all the base level screener tests, plus a number
that only return warnings, plus manual inspection by the platform gatekeepers. These are primarily:

- Strict tidy data structures - i.e. appropriate use of filters and indicators.
- Standardised filter col_names and items consistent with the harmonised standards.
- Standardised indicator col_names meeting the naming standards.
- Character limits for col_names and filter items.

Examples of these that do and don't meet the API data standards are provided in the following sections.

## Character limits for col_names and filter items

Character limits for fields in data uploaded to the API are:

::: {.table-responsive}

| Element | Character limit |
|---------------------------------|-----------------|
|Filter / indicator column names | 50 characters |
cjrace marked this conversation as resolved.
Show resolved Hide resolved
|Filter / indicator column labels | 80 characters |
|Filter items / location names | 120 characters |

: Character limits on column names, column labels and filter items.

:::

## Tidy data structure

The key thing on tidy data structure is to avoid filter items being included within indicator col_names. Where
you have collections of related terms appearing in indicator names (e.g. male, female, total), then these
should be translated into a filter column, with the data being pivoted.

All data uploaded to EES should be in a tidy data structure form, but this is more strictly regulated for data intended for use with the API. More information on building tidy data structures can be found in the [tidy data structure section](../statistics-production/ud.html#tidy-data-structure).

The following give examples of how different examples of data structures could be adapted.


### Example 1 - Three metrics with a single filter

#### Example of bad practice

::: {.table-responsive}

| school_count | pupil_count_male | pupil_count_female | pupil_count_total | pupil_percent_male | pupil_percent_female | pupil_percent_total |
|---------------|-------------------|---------------------|--------------------|---------------------|-----------------------|----------------------|
| 2 | 120 | 130 | 250 | 48 | 52 | 100 |

: Pupil counts and percentages in non-tidy format

:::

#### Example of good practice

::: {.table-responsive}

| sex | school_count | pupil_count | pupil_percent |
|--------------------|---------------|--------------|----------------|
| Male | 2 | 30 | 60 |
| Female | 2 | 40 | 80 |
| Total | 2 | 50 | 100 |

: Pupil counts and percentages in tidy format

:::

### Example 2 - Metrics with hierarchical filters

#### Example of bad practice

The following would not be accepted for publication via the API.

::: {.table-responsive}

| attendance |overall_absence | authorised_absence | unauthorised_absence | attendance_percent |overall_absence_percent | authorised_absence_percent | unauthorised_absence_percent |
|------------|----------------|--------------------|----------------------|------------|----------------|-------------------|----------------------|
| 180 | 20 | 12 | 8 | 90 | 10 | 6 | 4 |

: Attendance statistics in non-tidy format

:::

#### Example of good practice

The following would be accepted for publication via the API. In this case, creating a hierarchical filter combination allows a clear representation of the data.

::: {.table-responsive}

| attendance_status | attendance_type | session_count | session_percent |
cjrace marked this conversation as resolved.
Show resolved Hide resolved
|--------------------|----------------------|----------------|-----------------|
| Attendance | Total | 180 | 90 |
| Absence | Total | 20 | 10 |
| Absence | Authorised absence | 12 | 6 |
| Absence | Unauthorised absence | 8 | 4 |

: Attendance statistics in tidy format with hierarchical filters

:::

### Example 3 - Metrics with non-compatible filters

#### Example of bad practice

The following would not be accepted for publication via the API.

::: {.table-responsive}

| pupil_count_grade9to5 | pupil_count_grade9to4 | pupil_count_grade9to1 | pupil_percent_grade9to5 | pupil_percent_grade9to4 | pupil_percent_grade9to1 | progress8_score_male | progress8_score_female | progress8_score |attainment8_score_male | attainment8_score_female | attainment8_score |
|------------------------|------------------------|------------------------|--------------------------|--------------------------|-------------------------|----------------------|------------------------|-----------------|----------------------|------------------------|-----------------|
| 30 | 40 | 50 | 60 | 80 | 100 | 0.2 | 0.21 | 0.21 |0.09 | 0.08 | 0.10 |

: Attainment grade rates and scores in non-tidy format

:::

cjrace marked this conversation as resolved.
Show resolved Hide resolved
In this case, the different metrics contain different types of values that are split by very different filters. Specifically pupil counts and pupil percents are split into grade thresholds, whereas the score based metrics are not. If we were to try and pivot this data as one file, it would lead to an unreasonably large number of cells with no valid entries (i.e. large numbers of z's). For example, pivoting might create something like the following table, which suffers from both a large number of not applicable columns and duplication of data unecessarily.

::: {.table-responsive}

| sex | grade_range |accountability_measure | pupil_count | pupil_percent | score_average |
|--------|--------------------|-----------------------|--------------|----------------|---------------|
| Total | Grades 9-5 | z | 30 | 60 | z |
| Total | Grades 9-4 | z | 40 | 80 | z |
| Total | Grades 9-1 | z | 50 | 100 | z |
| Total | z | Attainment 8 | 50 | 100 | 0.21 |
| Female | z | Attainment 8 | 50 | 100 | 0.21 |
| Male | z | Attainment 8 | 50 | 100 | 0.20 |
| Total | z | Progress 8 | 50 | 100 | 0.08 |
| Female | z | Progress 8 | 50 | 100 | 0.08 |
| Male | z | Progress 8 | 50 | 100 | 0.09 |

: Example of pivoted data showing excessive duplicated and not applicable fields.

:::

::: {.table-responsive}
cjrace marked this conversation as resolved.
Show resolved Hide resolved

| sex | accountability_measure | score_average |
|--------|------------------------|----------------|
| Female | Progress 8 | 0.21 |
| Male | Progress 8 | 0.20 |
| Total | Progress 8 | 0.21 |
| Female | Attainment 8 | 0.08 |
| Male | Attainment 8 | 0.09 |
| Total | Attainment 8 | 0.08 |

: Attainment scores in tidy format

:::


#### Example of good practice

The following would be accepted for publication via the API. In this case, splitting the data into separate data files is required in order to create tidy data structures.
cjrace marked this conversation as resolved.
Show resolved Hide resolved

::: {.table-responsive}

| grade_range | pupil_count | pupil_percent |
|--------------------|--------------|----------------|
| Grades 9-5 | 30 | 60 |
| Grades 9-4 | 40 | 80 |
| Grades 9-1 | 50 | 100 |

: Attainment grade rates in tidy format

:::

::: {.table-responsive}

| sex | accountability_measure | score_average |
|--------|------------------------|----------------|
| Female | Progress 8 | 0.21 |
| Male | Progress 8 | 0.20 |
| Total | Progress 8 | 0.21 |
| Female | Attainment 8 | 0.08 |
| Male | Attainment 8 | 0.09 |
| Total | Attainment 8 | 0.08 |

: Attainment scores in tidy format

:::

## Standardised filter col_names and items

The explore education and statistics platforms team alongside the data harmonisation champions group and publication teams are developing a series of [standardised filters](../statistics-production/ud.html#common-harmonised-variables) that teams are required to use when creating data for the API. These are being built iteratively as more data is put forward for the API, so if the current standards don't cater to your data set, you can contribute to building the harmonised standards for others to follow.

The standards can be used to create individual filter columns or combined filters (i.e. breakdown_topic / breakdown_topic).

Areas for which harmonised standards are currently available are:

- [establishment / school / provider characteristics](../statistics-production/ud.html#establishment-characteristics)
- [ethnicity](../statistics-production/ud.html#ethnicity)
- [sex and gender](../statistics-production/ud.html#sex-and-gender)
- [special educational needs](../statistics-production/ud.html#special-educational-needs)

Areas which are currently under development are:

- attainment metrics
- disadvantaged status
- free school meal status

We encourage contributions to and feedback on all of the above and any other filter topic.

### Examples of common non-standard filter col_names

::: {.table-responsive}

| Non-standard | Potential standard equivalents |
|-----------------------------|------------------------------------|
| ethnicity | ethnicity_major, ethnicity_minor |
| characteristic_sex | sex |
| school_type | establishment_type, establishment_type_group or education_phase |
| pupil_sen_status | sen_status |
| characteristic_primary_need | sen_primary_need |
| characteristic_topic | breakdown_topic, breakdown_topic_establishment |
| characteristic | breakdown, breakdown_establishment |

: Example non-standard col_names and their potential equivalents in the standardised framework.

:::

## Standardised indicator col_names

Indicators should be named in line with the [indicator naming conventions set out in the open data standards](../statistics-production/ud.html#indicator-names).

### Examples of common non-standard indicator col_names

::: {.table-responsive}

| Non-standard | Potential standard equivalents |
|----------------------------------------|------------------------------------------|
| number_of_pupils | pupil_count |
| NumberOfLearners, NumLearners | pupil_count, learner_count |
| total_male, total_female | pupil_count (plus sex filter) |
| pt_SEN_support | pupil_percent (plus SEN status filter) |
| num_provider, num_providers | establishment_count |
| no_schools, num_schools, total_schools | establishment_count |
| num_inst, total_institutions, number_institutions, inst_count | establishment_count |

: Example non-standard indicator col_names and their potential equivalents in the standardised framework.

:::

2 changes: 1 addition & 1 deletion statistics-production/ud.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1260,7 +1260,7 @@ knitr::include_graphics("../images/change_date_format.PNG")
The indicators are the variables showing the measurements/statistics themselves, such as the number of pupils. These can be of different formats (e.g. text, numeric), although are numeric by default. The number of indicators will vary across publications and data files.

<div class="alert alert-dismissible alert-danger">
<p> **Every variable in your dataset should have its own column, and each column should be a single data type**. E.g. do not create an indicator column called "pupils" that has both the number and percentage of pupils in it. Instead, create two separate columns for each measure.</p>
<p> **Every variable in your data set should have its own column, and each column should be a single data type**. E.g. do not create an indicator column called "pupils" that has both the number and percentage of pupils in it. Instead, create two separate columns for each measure.</p>
</div>

As an example, the number and percentage of pupil enrolments are the indicators in this dataset:
Expand Down