Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into nb/manipulation_funct…
Browse files Browse the repository at this point in the history
…ion_basics
  • Loading branch information
nathanrboyer committed Oct 10, 2024
2 parents d9864ba + 85815e4 commit efde542
Show file tree
Hide file tree
Showing 16 changed files with 610 additions and 307 deletions.
9 changes: 9 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "monthly"
labels:
- "dependencies"
- "no changelog"
16 changes: 10 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,31 +22,35 @@ jobs:
- os: windows-latest
version: '1'
arch: x86
- os: macos-latest
version: '1'
arch: aarch64
- os: ubuntu-latest
version: 'nightly'
arch: x64
allow_failure: true
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@v1
- uses: actions/checkout@v4
- uses: julia-actions/setup-julia@v2
with:
version: ${{ matrix.version }}
arch: ${{ matrix.arch }}
- uses: julia-actions/cache@v1
- uses: julia-actions/cache@v2
- uses: julia-actions/julia-buildpkg@v1
- uses: julia-actions/julia-runtest@v1
env:
JULIA_NUM_THREADS: 4,1
- uses: julia-actions/julia-processcoverage@v1
- uses: codecov/codecov-action@v1
- uses: codecov/codecov-action@v4
with:
file: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
docs:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/cache@v1
- uses: actions/checkout@v4
- uses: julia-actions/cache@v2
- uses: julia-actions/julia-buildpkg@latest
- uses: julia-actions/julia-docdeploy@latest
env:
Expand Down
10 changes: 4 additions & 6 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,12 @@ PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
REPL = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"
Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
SentinelArrays = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
SortingAlgorithms = "a2af1166-a08f-5f64-846c-94a0d3cef48c"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
TableTraits = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
SentinelArrays = "91c51154-3ec4-41a3-a24f-3f23e20d615c"
Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[compat]
Expand All @@ -46,6 +45,7 @@ Reexport = "1"
SentinelArrays = "1.2"
ShiftedArrays = "1, 2"
SortingAlgorithms = "0.3, 1"
Statistics = "1"
TableTraits = "0.4, 1"
Tables = "1.9.0"
Unitful = "1"
Expand All @@ -58,12 +58,10 @@ DataValues = "e7dc6d0d-1eca-5fa6-8ad6-5aecde8b7ea5"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Unitful = "1986cc42-f94f-5a68-af5c-568840ba703d"
ShiftedArrays = "1277b4bf-5013-50f5-be3d-901d8477a67a"

[targets]
test = ["CategoricalArrays", "Combinatorics", "DataValues",
"Dates", "Logging", "OffsetArrays", "Test",
"Unitful", "ShiftedArrays", "SparseArrays"]
test = ["CategoricalArrays", "Combinatorics", "DataValues", "Dates", "Logging", "OffsetArrays", "Test", "Unitful", "ShiftedArrays", "SparseArrays"]
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
DataFrames.jl
=============

[![Coverage Status](http://codecov.io/github/JuliaData/DataFrames.jl/coverage.svg?branch=main)](http://codecov.io/github/JuliaData/DataFrames.jl?branch=main)
[![codecov](https://codecov.io/gh/JuliaData/DataFrames.jl/graph/badge.svg?token=DHYzeKcumV)](https://codecov.io/gh/JuliaData/DataFrames.jl)
[![CI Testing](https://github.com/JuliaData/DataFrames.jl/workflows/CI/badge.svg)](https://github.com/JuliaData/DataFrames.jl/actions?query=workflow%3ACI+branch%3Amain)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7632427.svg)](https://doi.org/10.5281/zenodo.7632427)

Expand Down
1 change: 1 addition & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"

[compat]
Documenter = "1"
139 changes: 139 additions & 0 deletions docs/src/man/querying_frameworks.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,145 @@ DataFramesMeta.jl, DataFrameMacros.jl and Query.jl. They implement a functionali
These frameworks are designed both to make it easier for new users to start working with data frames in Julia
and to allow advanced users to write more compact code.

## TidierData.jl
[TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/), part of
the [Tidier](https://tidierorg.github.io/Tidier.jl/dev/) ecosystem, is a macro-based
data analysis interface that wraps DataFrames.jl. The instructions below are for version
0.16.0 of TidierData.jl.

First, install the TidierData.jl package:

```julia
using Pkg
Pkg.add("TidierData")
```

TidierData.jl enables clean, readable, and fast code for all major data transformation
functions including
[aggregating](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/summarize/),
[pivoting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/pivots/),
[nesting](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/nesting/),
and [joining](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/joins/)
data frames. TidierData re-exports `DataFrame` from DataFrames.jl, `@chain` from Chain.jl, and
Statistics.jl to streamline data operations.

TidierData.jl is heavily inspired by the `dplyr` and `tidyr` R packages (part of the R
`tidyverse`), which it aims to implement using pure Julia by wrapping DataFrames.jl. While
TidierData.jl borrows conventions from the `tidyverse`, it is important to note that the
`tidyverse` itself is often not considered idiomatic R code. TidierData.jl brings
data analysis conventions from `tidyverse` into Julia to have the best of both worlds:
tidy syntax and the speed and flexibility of the Julia language.

TidierData.jl has two major differences from other macro-based packages. First, TidierData.jl
uses tidy expressions. An example of a tidy expression is `a = mean(b)`, where `b` refers
to an existing column in the data frame, and `a` refers to either a new or existing column.
Referring to variables outside of the data frame requires prefixing variables with `!!`.
For example, `a = mean(!!b)` refers to a variable `b` outside the data frame. Second,
TidierData.jl aims to make broadcasting mostly invisible through
[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/). TidierData.jl currently uses a lookup table to decide which functions not to
vectorize; all other functions are automatically vectorized. This allows for
writing of concise expressions: `@mutate(df, a = a - mean(a))` transforms the `a` column
by subtracting each value by the mean of the column. Behind the scenes, the right-hand
expression is converted to `a .- mean(a)` because `mean()` is in the lookup table as a
function that should not be vectorized. Take a look at the
[auto-vectorization](https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/) documentation for details.

One major benefit of combining tidy expressions with auto-vectorization is that
TidierData.jl code (which uses DataFrames.jl as its backend) can work directly on
databases using [TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl),
which converts tidy expressions into SQL, supporting DuckDB and several other backends.

```jldoctest tidierdata
julia> using TidierData
julia> df = DataFrame(
name = ["John", "Sally", "Roger"],
age = [54.0, 34.0, 79.0],
children = [0, 2, 4]
)
3×3 DataFrame
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ John 54.0 0
2 │ Sally 34.0 2
3 │ Roger 79.0 4
julia> @chain df begin
@filter(children != 2)
@select(name, num_children = children)
end
2×2 DataFrame
Row │ name num_children
│ String Int64
─────┼──────────────────────
1 │ John 0
2 │ Roger 4
```

Below are examples showcasing `@group_by` with `@summarize` or `@mutate` - analagous to the split, apply, combine pattern.

```jldoctest tidierdata
julia> df = DataFrame(
groups = repeat('a':'e', inner = 2),
b_col = 1:10,
c_col = 11:20,
d_col = 111:120
)
10×4 DataFrame
Row │ groups b_col c_col d_col
│ Char Int64 Int64 Int64
─────┼─────────────────────────────
1 │ a 1 11 111
2 │ a 2 12 112
3 │ b 3 13 113
4 │ b 4 14 114
5 │ c 5 15 115
6 │ c 6 16 116
7 │ d 7 17 117
8 │ d 8 18 118
9 │ e 9 19 119
10 │ e 10 20 120
julia> @chain df begin
@filter(b_col > 2)
@group_by(groups)
@summarise(median_b = median(b_col),
across((b_col:d_col), mean))
end
4×5 DataFrame
Row │ groups median_b b_col_mean c_col_mean d_col_mean
│ Char Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────
1 │ b 3.5 3.5 13.5 113.5
2 │ c 5.5 5.5 15.5 115.5
3 │ d 7.5 7.5 17.5 117.5
4 │ e 9.5 9.5 19.5 119.5
julia> @chain df begin
@filter(b_col > 4 && c_col <= 18)
@group_by(groups)
@mutate(
new_col = b_col + maximum(d_col),
new_col2 = c_col - maximum(d_col),
new_col3 = case_when(c_col >= 18 => "high",
c_col > 15 => "medium",
true => "low"))
@select(starts_with("new"))
@ungroup # required because `@mutate` does not ungroup
end
4×4 DataFrame
Row │ groups new_col new_col2 new_col3
│ Char Int64 Int64 String
─────┼─────────────────────────────────────
1 │ c 121 -101 low
2 │ c 122 -100 medium
3 │ d 125 -101 medium
4 │ d 126 -100 high
```

For more examples, please visit the [TidierData.jl](https://tidierorg.github.io/TidierData.jl/latest/) documentation.

## DataFramesMeta.jl

The [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl) package
Expand Down
95 changes: 93 additions & 2 deletions docs/src/man/working_with_dataframes.md
Original file line number Diff line number Diff line change
Expand Up @@ -812,14 +812,21 @@ julia> df = DataFrame(A=1:4, B=4.0:-1.0:1.0)
3 │ 3 2.0
4 │ 4 1.0
julia> combine(df, names(df) .=> sum)
julia> combine(df, All() .=> sum)
1×2 DataFrame
Row │ A_sum B_sum
│ Int64 Float64
─────┼────────────────
1 │ 10 10.0
julia> combine(df, names(df) .=> sum, names(df) .=> prod)
julia> combine(df, All() .=> sum, All() .=> prod)
1×4 DataFrame
Row │ A_sum B_sum A_prod B_prod
│ Int64 Float64 Int64 Float64
─────┼─────────────────────────────────
1 │ 10 10.0 24 24.0
julia> combine(df, All() .=> [sum prod]) # the same using 2-dimensional broadcasting
1×4 DataFrame
Row │ A_sum B_sum A_prod B_prod
│ Int64 Float64 Int64 Float64
Expand All @@ -830,6 +837,90 @@ julia> combine(df, names(df) .=> sum, names(df) .=> prod)
If you would prefer the result to have the same number of rows as the source
data frame, use `select` instead of `combine`.

In the remainder of this section we will discuss more advanced topics related
to the operation specification syntax, so you may decide to skip them if you
want to focus on the most common usage patterns.

A `DataFrame` can store values of any type as its columns, for example
below we show how one can store a `Tuple`:

```
julia> df2 = combine(df, All() .=> extrema)
1×2 DataFrame
Row │ A_extrema B_extrema
│ Tuple… Tuple…
─────┼───────────────────────
1 │ (1, 4) (1.0, 4.0)
```

Later you might want to expand the tuples into separate columns storing the computed
minima and maxima. This can be achieved by passing multiple columns for the output.
Here is an example of how this can be done by writing the column names by-hand for a single
input column:

```
julia> combine(df2, "A_extrema" => identity => ["A_min", "A_max"])
1×2 DataFrame
Row │ A_min A_max
│ Int64 Int64
─────┼──────────────
1 │ 1 4
```

You can extend it to handling all columns in `df2` using broadcasting:

```
julia> combine(df2, All() .=> identity .=> [["A_min", "A_max"], ["B_min", "B_max"]])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

This approach works, but can be improved. Instead of writing all the column names
manually we can instead use a function as a way to specify target column names
based on source column names:

```
julia> combine(df2, All() .=> identity .=> c -> first(c) .* ["_min", "_max"])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

Note that in this example we needed to pass `identity` explicitly since with
`All() => (c -> first(c) .* ["_min", "_max"])` the right-hand side part would be
treated as a transformation and not as a rule for target column names generation.

You might want to perform the transformation of the source data frame into the result
we have just shown in one step. This can be achieved with the following expression:

```
julia> combine(df, All() .=> Ref∘extrema .=> c -> c .* ["_min", "_max"])
1×4 DataFrame
Row │ A_min A_max B_min B_max
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 1 4 1.0 4.0
```

Note that in this case we needed to add a `Ref` call in the `Ref∘extrema` operation specification.
Without `Ref`, `combine` iterates the contents of the value returned by the operation specification function,
which in our case is a tuple of numbers, and tries to expand it assuming that each produced value represents one row,
so one gets an error:

```
julia> combine(df, All() .=> extrema .=> [c -> c .* ["_min", "_max"]])
ERROR: ArgumentError: 'Tuple{Int64, Int64}' iterates 'Int64' values,
which doesn't satisfy the Tables.jl `AbstractRow` interface
```

Note that we used `Ref` as it is a container that is typically used in DataFrames.jl when one
wants to store one row, however, in general it could be another iterator (e.g. a tuple).

## Handling of Columns Stored in a `DataFrame`

Functions that transform a `DataFrame` to produce a
Expand Down
2 changes: 1 addition & 1 deletion src/DataFrames.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
module DataFrames

using Statistics, Printf, REPL
using Statistics, Printf
using Reexport, SortingAlgorithms, Compat, Unicode, PooledArrays
@reexport using Missings, InvertedIndices
using Base.Sort, Base.Order, Base.Iterators, Base.Threads
Expand Down
Loading

0 comments on commit efde542

Please sign in to comment.