From 5785c72b960a979a5aad8e86411bf96749f9abe1 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Tue, 18 Jul 2023 15:50:24 -0400 Subject: [PATCH 01/30] Initial commit --- docs/src/man/basics.md | 1221 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 1183 insertions(+), 38 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 9ddede8cf..ad68bf691 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1565,40 +1565,1187 @@ julia> german[Not(5), r"S"] 984 rows omitted ``` -## Basic Usage of Transformation Functions - -In DataFrames.jl we have five functions that we can be used to perform -transformations of columns of a data frame: - -- `combine`: creates a new data frame populated with columns that are results of - transformation applied to the source data frame columns, potentially combining - its rows; -- `select`: creates a new data frame that has the same number of rows as the - source data frame populated with columns that are results of transformations - applied to the source data frame columns; -- `select!`: the same as `select` but updates the passed data frame in place; -- `transform`: the same as `select` but keeps the columns that were already - present in the data frame (note though that these columns can be potentially - modified by the transformation passed to `transform`); -- `transform!`: the same as `transform` but updates the passed data frame in - place. - -The fundamental ways to specify a transformation are: - -- `source_column => transformation => target_column_name`; In this scenario the - `source_column` is passed as an argument to `transformation` function and - stored in `target_column_name` column. -- `source_column => transformation`; In this scenario we apply the - transformation function to `source_column` and the target column names is - automatically generated. -- `source_column => target_column_name` renames the `source_column` to - `target_column_name`. -- `source_column` just keep the source column as is in the result without any - transformation; - -These rules are typically called transformation mini-language. - -Let us move to the examples of application of these rules +## Basic Usage of Manipulation Functions + +In DataFrames.jl there are seven functions that can be used +to manipulate data frame columns: + +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | -------------------------------------------- | ------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. | +| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. | +| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. | +| `select!` | Modifies an existing data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. | +| `subset` | Creates a new data frame. | Retains only source columns. | Number of rows is determined by the manipulation. | +| `subset!` | Modifies an existing data frame. | Retains only source columns. | Number of rows is determined by the manipulation. | +| `combine` | Creates a new data frame. | Retains only manipulated columns. | Number of rows is determined by the manipulation. | + +### Constructing Operation Pairs +All of the functions above use the same syntax which is commonly +`manipulation_function(dataframe, operation)`. +The `operation` argument is a `Pair` which defines the +operation to be applied to the source `dataframe`, +and it can take any of the following common forms explained below: + +`source_column_selector` +: selects source column(s) without manipulating or renaming them + +`source_column_selector => operation_function` +: passes source column(s) as arguments to a function +and automatically names the resulting column(s) + +`source_column_selector => operation_function => new_column_names` +: passes source column(s) as arguments to a function +and names the resulting column(s) `new_column_names` + +`source_column_selector => new_column_names` +: renames a source column, +or splits a column containing collection elements into multiple new columns + +!!! Note + The `source_column_selector` + and the `source_column_selector => new_column_names` operation forms + are not available for the `subset` and `subset!` manipulation functions. + +#### `source_column_selector` +Inside an `operation`, `source_column_selector` is usually a column name +or column index which identifies a data frame column. +`source_column_selector` may be used as the entire `operation` +with `select` or `select!` to isolate or reorder columns. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) +3×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 + +julia> select(df, :b) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, "b") +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, 2) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 +``` + +`source_column_selector` may also be a collection of columns such as a vector, +a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), +a `Not`, `Between`, `All`, or `Cols` expression, +or a `:`. +See the [Indexing](@ref) API for the full list of possible values with references. + +!!! Note + The Julia parser sometimes prevents `:` from being used by itself. + `ERROR: syntax: whitespace not allowed after ":" used for quoting` + means your `:` must be wrapped in either `(:)` or `Cols(:)` + to be properly interpreted. + +```julia +julia> df = DataFrame( + id = [1, 2, 3], + first_name = ["José", "Emma", "Nathan"], + last_name = ["Garcia", "Marino", "Boyer"], + age = [61, 24, 33] + ) +3×4 DataFrame + Row │ id first_name last_name age + │ Int64 String String Int64 +─────┼───────────────────────────────────── + 1 │ 1 José Garcia 61 + 2 │ 2 Emma Marino 24 + 3 │ 3 Nathan Boyer 33 + +julia> select(df, [:last_name, :first_name]) +3×2 DataFrame + Row │ last_name first_name + │ String String +─────┼─────────────────────── + 1 │ Garcia José + 2 │ Marino Emma + 3 │ Boyer Nathan + +julia> select(df, r"name") +3×2 DataFrame + Row │ first_name last_name + │ String String +─────┼─────────────────────── + 1 │ José Garcia + 2 │ Emma Marino + 3 │ Nathan Boyer + +julia> select(df, Not(:id)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> select(df, Between(2,4)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 +``` + +`AsTable(source_column_selector)` is a special `source_column_selector` +that can be used to select multiple columns into a single `NamedTuple`. +This is not useful on its own, so the function of this selector +will be explained in the next section. + + +#### `operation_function` +Inside an `operation` pair, `operation_function` is a function +which operates on data frame columns passed as vectors. +When multiple columns are selected by `source_column_selector`, +the `operation_function` will receive the columns as multiple positional arguments +in the order they were selected, e.g. `f(column1, column2, column3)`. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 4 + +julia> combine(df, :a => sum) +1×1 DataFrame + Row │ a_sum + │ Int64 +─────┼─────── + 1 │ 6 + +julia> transform(df, :b => maximum) # `transform` and `select` copy result to all rows +3×3 DataFrame + Row │ a b b_maximum + │ Int64 Int64 Int64 +─────┼───────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 5 + 3 │ 3 4 5 + +julia> transform(df, [:b, :a] => -) # vector subtraction is okay +3×3 DataFrame + Row │ a b b_a_- + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 3 + 2 │ 2 5 3 + 3 │ 3 4 1 + +julia> transform(df, [:a, :b] => *) # vector multiplication is not defined +ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) +``` + +Don't worry! There is a quick fix for the previous error. +If you want to apply a function to each element in a column +instead of to the entire column vector, +then you can wrap your element-wise function in `ByRow` like +`ByRow(my_elementwise_function)`. +This will apply `my_elementwise_function` to every element in the column +and then collect the results back into a vector. + +```julia +julia> transform(df, [:a, :b] => ByRow(*)) +3×3 DataFrame + Row │ a b a_b_* + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 4 + 2 │ 2 5 10 + 3 │ 3 4 12 + +julia> transform(df, Cols(:) => ByRow(max)) +3×3 DataFrame + Row │ a b a_b_max + │ Int64 Int64 Int64 +─────┼─────────────────────── + 1 │ 1 4 4 + 2 │ 2 5 5 + 3 │ 3 4 4 + +julia> f(x) = x + 1 +f (generic function with 1 method) + +julia> transform(df, :a => ByRow(f)) +3×3 DataFrame + Row │ a b a_f + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +Alternatively, you may just want to define the function itself so it +[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +over vectors. + +```julia +julia> g(x) = x .+ 1 +g (generic function with 1 method) + +julia> transform(df, :a => g) +3×3 DataFrame + Row │ a b a_g + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) +are a convenient way to define and use an `operation_function` +all within the manipulation function call. + +```julia +julia> select(df, :a => ByRow(x -> x + 1)) +3×1 DataFrame + Row │ a_function + │ Int64 +─────┼──────────── + 1 │ 2 + 2 │ 3 + 3 │ 4 + +julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) +3×3 DataFrame + Row │ a b a_b_function + │ Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 + +julia> subset(df, :b => ByRow(x -> x < 5)) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 + +julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 +``` + +!!! Note + `operation_functions` within `subset` or `subset!` function calls + must return a boolean vector. + `true` elements in the boolean vector will determine + which rows are retained in the resulting data frame. + +As demonstrated above, `DataFrame` columns are usually passed +from `source_column_selector` to `operation_function` as one or more +vector arguments. +However, when `AsTable(source_column_selector)` is used, +the selected columns are collected and passed as a single `NamedTuple` +to `operation_function`. + +This is often useful when your `operation_function` is defined to operate +on a single collection argument rather than on multiple positional arguments. +The distinction is somewhat similar to the difference between the built-in +`min` and `minimum` functions. +`min` is defined to find the minimum value among multiple positional arguments, +while `minimum` is defined to find the minimum value +among the elements of a single collection argument. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 2 + 2 │ 2 4 6 1 + +julia> select(df, Cols(:) => ByRow(min)) # min works on multiple arguments +2×1 DataFrame + Row │ a_b_etc_min + │ Int64 +─────┼───────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum works on a collection +2×1 DataFrame + Row │ a_b_etc_minimum + │ Int64 +─────┼───────────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, [:a,:b] => ByRow(+)) # `+` works on a multiple arguments +2×1 DataFrame + Row │ a_b_+ + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 6 + +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` works on a collection +2×1 DataFrame + Row │ a_b_sum + │ Int64 +─────┼───────── + 1 │ 4 + 2 │ 6 + +julia> using Statistics # contains the `mean` function + +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) +2×1 DataFrame + Row │ b_c_d_mean + │ Float64 +─────┼──────────── + 1 │ 3.33333 + 2 │ 3.66667 +``` + +`AsTable` can also be used to pass columns to a function which operates +on fields of a `NamedTuple`. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 7 + 2 │ 2 4 6 8 + +julia> f(nt) = nt.a + nt.d +f (generic function with 1 method) + +julia> transform(df, AsTable(:) => ByRow(f)) +2×5 DataFrame + Row │ a b c d a_b_etc_f + │ Int64 Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────── + 1 │ 1 3 5 7 8 + 2 │ 2 4 6 8 10 +``` + +As demonstrated above, +in the `source_column_selector => operation_function` operation pair form, +the results of an operation will be placed into a new column with an +automatically-generated name based on the operation; +the new column name will be the `operation_function` name +appended to the source column name(s) with an underscore. + +This automatic column naming behavior can be avoided in two ways. +First, the operation result can be placed back into the original column +with the original column name by switching the keyword argument `renamecols` +from its default value (`true`) to `renamecols=false`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 11 5 + 2 │ 12 6 + 3 │ 13 7 + 4 │ 14 8 +``` + +The second method to avoid the default manipulation column naming is to +specify your own `new_column_names`. + +#### `new_column_names` + +`new_column_names` can be included at the end of an `operation` pair to specify +the name of the new column(s). +`new_column_names` may be a symbol or a string. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, Cols(:) => ByRow(+) => :c) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, Cols(:) => ByRow(+) => "a+b") +4×3 DataFrame + Row │ a b a+b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, :a => ByRow(x->x+10) => "a+10") +4×3 DataFrame + Row │ a b a+10 + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 11 + 2 │ 2 6 12 + 3 │ 3 7 13 + 4 │ 4 8 14 +``` + +The `source_column_selector => new_column_names` operation form +can be used to rename columns without an intermediate function. +However, there are `rename` and `rename!` functions, +which accept the same syntax, +that tend to be more useful for this operation. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => :α) # adds column α +4×3 DataFrame + Row │ a b α + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 + +julia> select(df, :a => :α) # retains only column α +4×1 DataFrame + Row │ α + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + 4 │ 4 + +julia> rename(df, :a => :α) # renames column α in-place +4×2 DataFrame + Row │ α b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` + +Additionally, in the +`source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may be a renaming function which operates on a string +to create the destination column names programmatically. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> add_prefix(s) = "new_" * s +add_prefix (generic function with 1 method) + +julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 + +julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 +``` + +Note that a renaming function will not work in the +`source_column_selector => new_column_names` operation form +because a function in the second element of the operation pair is assumed to take +the `source_column_selector => operation_function` operation form. +To work around this limitation, use the +`source_column_selector => operation_function => new_column_names` operation form +with `identity` as the `operation_function`. + +```julia +julia> transform(df, :a => add_prefix) +ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) + +julia> transform(df, :a => identity => add_prefix) +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 +``` + +!!! Note + Renaming functions are not currently supported within `Pair` arguments + to the `rename` and `rename!` functions. + However, renaming functions can be applied to an entire data frame + with the `rename(renaming_function, dataframe)` method. + +In the `source_column_selector => new_column_names` operation form, +only a single source column may be selected per operation, +so why is `new_column_names` plural? +It is possible to split the data contained inside a single column +into multiple new columns by supplying a vector of strings or symbols +as `new_column_names`. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> transform(df, :data => [:first, :second]) # manual naming +2×3 DataFrame + Row │ data first second + │ Tuple… Int64 Int64 +─────┼─────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +This kind of data splitting can even be done automatically with `AsTable`. + +```julia +julia> transform(df, :data => AsTable) # default automatic naming with tuples +2×3 DataFrame + Row │ data x1 x2 + │ Tuple… Int64 Int64 +─────┼────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +If a data frame column contains `NamedTuple`s, +then `AsTable` will preserve the field names. +```julia +julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples +2×1 DataFrame + Row │ data + │ NamedTup… +─────┼──────────────── + 1 │ (a = 1, b = 2) + 2 │ (a = 3, b = 4) + +julia> transform(df, :data => AsTable) # keeps names from named tuples +2×3 DataFrame + Row │ data a b + │ NamedTup… Int64 Int64 +─────┼────────────────────────────── + 1 │ (a = 1, b = 2) 1 2 + 2 │ (a = 3, b = 4) 3 4 +``` + +!!! Note + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. + +Renaming functions also work for multi-column transformations, +but they must operate on a vector of strings. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> new_names(v) = ["primary ", "secondary "] .* v +new_names (generic function with 1 method) + +julia> transform(df, :data => identity => new_names) +2×3 DataFrame + Row │ data primary data secondary data + │ Tuple… Int64 Int64 +─────┼────────────────────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +#### Multiple Operations per Manipulation +All data frame manipulation functions can accept multiple `operation` pairs +at once using any of the following methods: +- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments +- `manipulation_function(dataframe, [operation1, operation2])` : vector argument +- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument + +Passing multiple operations is especially useful for the `select`, `select!`, +and `combine` manipulation functions, +since they only retain columns which are a result of the passed operations. + +```julia +julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 1 50 hat + 2 │ 2 50 bat + 3 │ 3 60 cat + 4 │ 4 60 dog + +julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations +1×3 DataFrame + Row │ a_maximum b_sum c_join + │ Int64 Int64 String +─────┼──────────────────────────────── + 1 │ 4 220 hatbatcatdog + +julia> select(df, :c, :b, :a) # re-order columns +4×3 DataFrame + Row │ c b a + │ String Int64 Int64 +─────┼────────────────────── + 1 │ hat 50 1 + 2 │ bat 50 2 + 3 │ cat 60 3 + 4 │ dog 60 4 + +ulia> select(df, :b, :) # `:` here means all other columns +4×3 DataFrame + Row │ b a c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 50 1 hat + 2 │ 50 2 bat + 3 │ 60 3 cat + 4 │ 60 4 dog + +julia> select( + df, + :c => (x -> "a " .* x) => :one_c, + :a => (x -> 100x), + :b, + renamecols=false + ) # can mix operation forms +4×3 DataFrame + Row │ one_c a b + │ String Int64 Int64 +─────┼────────────────────── + 1 │ a hat 100 50 + 2 │ a bat 200 50 + 3 │ a cat 300 60 + 4 │ a dog 400 60 + +julia> select( + df, + :c => ByRow(reverse), + :c => ByRow(uppercase) + ) # multiple operations on same column +4×2 DataFrame + Row │ c_reverse c_uppercase + │ String String +─────┼──────────────────────── + 1 │ tah HAT + 2 │ tab BAT + 3 │ tac CAT + 4 │ god DOG +``` + +In the last two examples, +the manipulation function arguments were split across multiple lines. +This is a good way to make manipulations with many operations more readable. + +Passing multiple operations to `subset` or `subset!` is an easy way to narrow in +on a particular row of data. + +```julia +julia> subset( + df, + :b => ByRow(==(60)), + :c => ByRow(contains("at")) + ) # rows with 60 and "at" +1×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 3 60 cat +``` + +Note that all operations within a single manipulation must use the data +as it existed before the function call +i.e. you cannot use newly created columns for subsequent operations +within the same manipulation. + +```julia +julia> transform( + df, + [:a, :b] => ByRow(+) => :d, + :d => (x -> x ./ 2), + ) # requires two separate transformations +ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c + +julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) +4×4 DataFrame + Row │ a b c d + │ Int64 Int64 String Int64 +─────┼───────────────────────────── + 1 │ 1 50 hat 51 + 2 │ 2 50 bat 52 + 3 │ 3 60 cat 63 + 4 │ 4 60 dog 64 + +julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) +4×5 DataFrame + Row │ a b c d d_2 + │ Int64 Int64 String Int64 Float64 +─────┼────────────────────────────────────── + 1 │ 1 50 hat 51 25.5 + 2 │ 2 50 bat 52 26.0 + 3 │ 3 60 cat 63 31.5 + 4 │ 4 60 dog 64 32.0 +``` + + +#### Broadcasting Operation Pairs + +[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +pairs with `.=>` is often a convenient way to generate multiple +similar `operation`s to be applied within a single manipulation. +Broadcasting within the `Pair` of an `operation` is no different than +broadcasting in base Julia. +The broadcasting `.=>` will be expanded into a vector of pairs +(`[operation1, operation2, ...]`), +and this expansion will occur before the manipulation function is invoked. +Then the manipulation function will use the +`manipulation_function(dataframe, [operation1, operation2, ...])` method. +This process will be explained in more detail below. + +To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. +In DataFrames.jl, a symbol, string, or integer +may be used to select a single column. +Some `Pair`s with these types are below. + +```julia +julia> typeof(:x => :a) +Pair{Symbol, Symbol} + +julia> typeof("x" => "a") +Pair{String, String} + +julia> typeof(1 => "a") +Pair{Int64, String} +``` + +Any of the `Pair`s above could be used to rename the first column +of the data frame below to `a`. + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> select(df, :x => :a) +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> select(df, 1 => "a") +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 +``` + +What should we do if we want to keep and rename both the `x` and `y` column? +One option is to supply a `Vector` of operation `Pair`s to `select`. +`select` will process all of these operations in order. + +```julia +julia> ["x" => "a", "y" => "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x" => "a", "y" => "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +We can use broadcasting to simplify the syntax above. + +```julia +julia> ["x", "y"] .=> ["a", "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x", "y"] .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Notice that `select` sees the same `Vector{Pair{String, String}}` operation +argument whether the individual pairs are written out explicitly or +constructed with broadcasting. +The broadcasting is applied before the call to `select`. + +```julia +julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) +true +``` + +!!! Note + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` + +If a function is used as part of a transformation `Pair`, +like in the `source_column_selector => function => new_column_names` form, +then the function is repeated in each pair of the resultant vector. +This is an easy way to apply a function to multiple columns at the same time. + +```julia +julia> f(x) = 2 * x +f (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> ["a", "b"] +2-element Vector{Pair{String, Pair{typeof(f), String}}}: + "x" => (f => "a") + "y" => (f => "b") + +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + ``` + +A renaming function can be applied to multiple columns in the same way. +It will also be repeated in each operation `Pair`. + +```julia +julia> newname(s::String) = s * "_new" +newname (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> newname +2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: + "x" => (f => newname) + "y" => (f => newname) + +julia> select(df, ["x", "y"] .=> f .=> newname) +3×2 DataFrame + Row │ x_new y_new + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +You can see from the type output above +that a three element pair does not actually exist. +A `Pair` (as the name implies) can only contain two elements. +Thus, `:x => :y => :z` becomes a nested `Pair`, +where `:x` is the first element and points to the `Pair` `:y => :z`, +which is the second element. + +```julia +julia> p = :x => :y => :z +:x => (:y => :z) + +julia> p[1] +:x + +julia> p[2] +:y => :z + +julia> p[2][1] +:y + +julia> p[2][2] +``` + +In the previous examples, the source columns have been individually selected. +When broadcasting multiple columns to the same function, +often similarities in the column names or position can be exploited to avoid +tedious selection. +Consider a data frame with temperature data at three different locations +taken over time. +```julia +julia> df = DataFrame(Time = 1:4, + Temperature1 = [20, 23, 25, 28], + Temperature2 = [33, 37, 41, 44], + Temperature3 = [15, 10, 4, 0]) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 20 33 15 + 2 │ 2 23 37 10 + 3 │ 3 25 41 4 + 4 │ 4 28 44 0 +``` + +To convert all of the temperature data in one transformation, +we just need to define a conversion function and broadcast +it to all of the "Temperature" columns. + +```julia +julia> celsius_to_kelvin(x) = x + 273 +celsius_to_kelvin (generic function with 1 method) + +julia> transform( + df, + Cols(r"Temp") .=> ByRow(celsius_to_kelvin), + renamecols = false + ) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` +Or, simultaneously changing the column names: + +```julia +julia> rename_function(s) = "Temperature $(last(s)) (°K)" +rename_function (generic function with 1 method) + +julia> select( + df, + "Time", + Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function + ) +4×4 DataFrame + Row │ Time Temperature 1 (°K) Temperature 2 (°K) Temperature 3 (°K) + │ Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` + +!!! Note Notes + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. + +You could also broadcast different columns to different functions +by supplying a vector of functions. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> f1(x) = x .+ 1 +f1 (generic function with 1 method) + +julia> f2(x) = x ./ 10 +f2 (generic function with 1 method) + +julia> transform(df, [:a, :b] .=> [f1, f2]) +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +However, this form is not much more convenient than supplying +multiple individual operations. + +```julia +julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +Perhaps more useful for broadcasting syntax +is to apply multiple functions to multiple columns +by changing the vector of functions to a 1-by-x matrix of functions. +(Recall that a list, a vector, or a matrix of operation pairs are all valid +for passing to the manipulation functions.) + +```julia +julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 +2×2 Matrix{Pair{Symbol}}: + :a=>f1 :a=>f2 + :b=>f1 :b=>f2 + +julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 +4×6 DataFrame + Row │ a b a_f1 b_f1 a_f2 b_f2 + │ Int64 Int64 Int64 Int64 Float64 Float64 +─────┼────────────────────────────────────────────── + 1 │ 1 5 2 6 0.1 0.5 + 2 │ 2 6 3 7 0.2 0.6 + 3 │ 3 7 4 8 0.3 0.7 + 4 │ 4 8 5 9 0.4 0.8 +``` + +In this way, every combination of selected columns and functions will be applied. + +Pair broadcasting is a simple but powerful tool +that can be used in any of the manipulation functions listed under +[Basic Usage of Manipulation Functions](@ref). +Experiment for yourself to discover other useful operations. + +#### Additional Resources +The operation pair syntax is sometimes referred to as the DataFrames mini-language +or domain-specific language (DSL). +More details and examples of the opertation mini-language can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. + +For additional practice, +an interactive tutorial is provided by the DataFrames package author +[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). + +#### More Manipulation Examples with the German Dataset + +Let us move to the examples of application of these rules using the German dataset. ```jldoctest dataframe julia> using Statistics @@ -2162,7 +3309,5 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) ``` In the examples given in this introductory tutorial we did not cover all -options of the transformation mini-language. More advanced examples, in particular -showing how to pass or produce multiple columns using the `AsTable` operation -(which you might have seen in some DataFrames.jl demos) are given in the later -sections of the manual. +options of the DataFrames.jl operation mini-language. +More advanced examples, are given in the later sections of the manual. From f5cc15fec921aac9097cf9d2d5cfd904b648a071 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Wed, 19 Jul 2023 11:32:33 -0400 Subject: [PATCH 02/30] assumes requested method is added --- docs/src/man/basics.md | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index ad68bf691..f8fdada79 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2159,10 +2159,29 @@ julia> transform(df, :a => identity => add_prefix) ``` !!! Note - Renaming functions are not currently supported within `Pair` arguments - to the `rename` and `rename!` functions. - However, renaming functions can be applied to an entire data frame - with the `rename(renaming_function, dataframe)` method. + The `rename` and `rename!` functions are a simpler way + to apply a renaming function without an intermediate `operation_function`. + ```julia + julia> rename(df, :a => add_prefix) # rename some columns + 4×2 DataFrame + Row │ new_a b + │ Int64 Int64 + ─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + + julia> rename(add_prefix, df) # rename all columns + 4×2 DataFrame + Row │ new_a new_b + │ Int64 Int64 + ─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + ``` In the `source_column_selector => new_column_names` operation form, only a single source column may be selected per operation, From 7c3db8a82a86eb98ecf6656a38783ab9bc5ab19a Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Wed, 19 Jul 2023 13:30:32 -0400 Subject: [PATCH 03/30] Typo: missing :z --- docs/src/man/basics.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index f8fdada79..9842033b4 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2595,6 +2595,7 @@ julia> p[2][1] :y julia> p[2][2] +:z ``` In the previous examples, the source columns have been individually selected. From 27d7e3220133b3230cd0bf3019efb3b3a1c04ae6 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Tue, 25 Jul 2023 12:10:53 -0400 Subject: [PATCH 04/30] added subset(df, source_column_selector) --- docs/src/man/basics.md | 56 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 50 insertions(+), 6 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 9842033b4..55981cb50 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1601,15 +1601,12 @@ and names the resulting column(s) `new_column_names` `source_column_selector => new_column_names` : renames a source column, or splits a column containing collection elements into multiple new columns - -!!! Note - The `source_column_selector` - and the `source_column_selector => new_column_names` operation forms - are not available for the `subset` and `subset!` manipulation functions. +(not available for `subset` or `subset!`) #### `source_column_selector` Inside an `operation`, `source_column_selector` is usually a column name or column index which identifies a data frame column. + `source_column_selector` may be used as the entire `operation` with `select` or `select!` to isolate or reorder columns. @@ -1651,7 +1648,33 @@ julia> select(df, 2) 3 │ 6 ``` -`source_column_selector` may also be a collection of columns such as a vector, +`source_column_selector` may also be used as the entire `operation` +with `subset` or `subset!` if the source column contains `Bool` values. + +```julia +julia> df = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + ) +4×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Scott false + 2 │ Jill true + 3 │ Erica false + 4 │ Jimmy true + +julia> subset(df, :minor) +2×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Jill true + 2 │ Jimmy true +``` + +`source_column_selector` may instead be a collection of columns such as a vector, a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), a `Not`, `Between`, `All`, or `Cols` expression, or a `:`. @@ -1713,6 +1736,27 @@ julia> select(df, Between(2,4)) 1 │ José Garcia 61 2 │ Emma Marino 24 3 │ Nathan Boyer 33 + +julia> df2 = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + male = [true, false, false, true], + ) +4×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼────────────────────── + 1 │ Scott false true + 2 │ Jill true false + 3 │ Erica false false + 4 │ Jimmy true true + +julia> subset(df2, [:minor, :male]) +1×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼───────────────────── + 1 │ Jimmy true true ``` `AsTable(source_column_selector)` is a special `source_column_selector` From be5fa9e9a940228b19d61ee43abde17ef39c526a Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Tue, 25 Jul 2023 12:17:44 -0400 Subject: [PATCH 05/30] Added italics --- docs/src/man/basics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 55981cb50..e6d1e1bd5 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1601,7 +1601,7 @@ and names the resulting column(s) `new_column_names` `source_column_selector => new_column_names` : renames a source column, or splits a column containing collection elements into multiple new columns -(not available for `subset` or `subset!`) +(*not available for `subset` or `subset!`*) #### `source_column_selector` Inside an `operation`, `source_column_selector` is usually a column name From b1b3babd6188ec18e6b6a193641865ef237481bd Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Wed, 2 Aug 2023 14:23:01 -0400 Subject: [PATCH 06/30] Moved note to main text --- docs/src/man/basics.md | 49 +++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index e6d1e1bd5..518d72450 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2202,30 +2202,31 @@ julia> transform(df, :a => identity => add_prefix) 4 │ 4 8 4 ``` -!!! Note - The `rename` and `rename!` functions are a simpler way - to apply a renaming function without an intermediate `operation_function`. - ```julia - julia> rename(df, :a => add_prefix) # rename some columns - 4×2 DataFrame - Row │ new_a b - │ Int64 Int64 - ─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - - julia> rename(add_prefix, df) # rename all columns - 4×2 DataFrame - Row │ new_a new_b - │ Int64 Int64 - ─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - ``` +In this case though, +it is probably again more useful to use the `rename` or `rename!` function +rather than one of the manipulation functions +in order to rename in-place and avoid the intermediate `operation_function`. +```julia +julia> rename(df, :a => add_prefix) # rename some columns +4×2 DataFrame +Row │ new_a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> rename(add_prefix, df) # rename all columns +4×2 DataFrame +Row │ new_a new_b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` In the `source_column_selector => new_column_names` operation form, only a single source column may be selected per operation, From da6607d27eae9b4a6ff38b896e55293ba1866465 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Wed, 2 Aug 2023 15:02:46 -0400 Subject: [PATCH 07/30] =?UTF-8?q?Added=20error=20example=20and=20removed?= =?UTF-8?q?=20=C2=B0=20symbol?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/src/man/basics.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 518d72450..08e3dc7a8 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2641,6 +2641,9 @@ julia> p[2][1] julia> p[2][2] :z + +julia> p[3] # there is no index 3 for a pair +ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] ``` In the previous examples, the source columns have been individually selected. @@ -2689,7 +2692,7 @@ julia> transform( Or, simultaneously changing the column names: ```julia -julia> rename_function(s) = "Temperature $(last(s)) (°K)" +julia> rename_function(s) = "Temperature $(last(s)) (K)" rename_function (generic function with 1 method) julia> select( @@ -2698,13 +2701,13 @@ julia> select( Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function ) 4×4 DataFrame - Row │ Time Temperature 1 (°K) Temperature 2 (°K) Temperature 3 (°K) - │ Int64 Int64 Int64 Int64 -─────┼─────────────────────────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 + Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 ``` !!! Note Notes From 0c47d10c66d89b58d6b8d11420920c8259a2b23e Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 17 Aug 2023 10:10:38 -0400 Subject: [PATCH 08/30] Moved Additional Resources to the end and cleaned --- docs/src/man/basics.md | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 08e3dc7a8..bf5caf9c4 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2795,22 +2795,6 @@ that can be used in any of the manipulation functions listed under [Basic Usage of Manipulation Functions](@ref). Experiment for yourself to discover other useful operations. -#### Additional Resources -The operation pair syntax is sometimes referred to as the DataFrames mini-language -or domain-specific language (DSL). -More details and examples of the opertation mini-language can be found in -[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). - -For additional syntax niceties, -many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) -and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) -packages useful -to help simplify manipulations that may be tedious with operation pairs alone. - -For additional practice, -an interactive tutorial is provided by the DataFrames package author -[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). - #### More Manipulation Examples with the German Dataset Let us move to the examples of application of these rules using the German dataset. @@ -3376,6 +3360,26 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) 985 rows omitted ``` -In the examples given in this introductory tutorial we did not cover all -options of the DataFrames.jl operation mini-language. -More advanced examples, are given in the later sections of the manual. +This concludes the introductory explaination of data frame manipulations. +For more advanced examples, +see later sections of the manual or the additional resources below. + +#### Additional Resources +More details and examples of operation pair syntax can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). +(The official wording describing the syntax has changed since the blog post was written, +but the examples are still illustrative. +The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language +or Domain-Specific Language.) + +For additional practice, +an interactive tutorial is provided on a variety of introductory topics +by the DataFrames.jl package author +[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). + + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. \ No newline at end of file From 6f5dfc5835daeaf261c611092f80abbdebe9dae0 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 17 Aug 2023 10:16:29 -0400 Subject: [PATCH 09/30] Capitalized Boolean --- docs/src/man/basics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index bf5caf9c4..a8f2a9892 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1911,8 +1911,8 @@ julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous !!! Note `operation_functions` within `subset` or `subset!` function calls - must return a boolean vector. - `true` elements in the boolean vector will determine + must return a Boolean vector. + `true` elements in the Boolean vector will determine which rows are retained in the resulting data frame. As demonstrated above, `DataFrame` columns are usually passed From 886d9980f6d592180a28e5032f982575c9a748b0 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 17 Aug 2023 10:28:25 -0400 Subject: [PATCH 10/30] Removed extra space character --- docs/src/man/basics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index a8f2a9892..f8f8d74fe 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2595,7 +2595,7 @@ julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) 1 │ 2 8 2 │ 4 10 3 │ 6 12 - ``` +``` A renaming function can be applied to multiple columns in the same way. It will also be repeated in each operation `Pair`. From cabd73fc7275b539d7f446bb93c7845d115a4253 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 18 Sep 2023 12:02:11 -0400 Subject: [PATCH 11/30] Change function broadcasting to avoid old language --- docs/src/man/basics.md | 42 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index f8f8d74fe..4626d2d4a 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2573,21 +2573,47 @@ true select(df, operation) # manipulate `df` with `operation` ``` -If a function is used as part of a transformation `Pair`, -like in the `source_column_selector => function => new_column_names` form, -then the function is repeated in each pair of the resultant vector. -This is an easy way to apply a function to multiple columns at the same time. +In Julia, +a non-vector broadcasted with a vector will be repeated in each resultant pair element. + +```julia +julia> ["x", "y"] .=> :a # :a is repeated +2-element Vector{Pair{String, Symbol}}: + "x" => :a + "y" => :a + +julia> 1 .=> [:a, :b] # 1 is repeated +2-element Vector{Pair{Int64, Symbol}}: + 1 => :a + 1 => :b +``` + +We can use this fact to easily broadcast an `operation_function` to multiple columns. ```julia julia> f(x) = 2 * x f (generic function with 1 method) -julia> ["x", "y"] .=> f .=> ["a", "b"] +julia> ["x", "y"] .=> f # f is repeated +2-element Vector{Pair{String, typeof(f)}}: + "x" => f + "y" => f + +julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming +3×2 DataFrame + Row │ x_f y_f + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + +julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated 2-element Vector{Pair{String, Pair{typeof(f), String}}}: "x" => (f => "a") "y" => (f => "b") -julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming 3×2 DataFrame Row │ a b │ Int64 Int64 @@ -2604,12 +2630,12 @@ It will also be repeated in each operation `Pair`. julia> newname(s::String) = s * "_new" newname (generic function with 1 method) -julia> ["x", "y"] .=> f .=> newname +julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated 2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: "x" => (f => newname) "y" => (f => newname) -julia> select(df, ["x", "y"] .=> f .=> newname) +julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname 3×2 DataFrame Row │ x_new y_new │ Int64 Int64 From 2e9d2aff807983152b1a77f735441ebe424e2bf8 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 18 Sep 2023 12:31:49 -0400 Subject: [PATCH 12/30] Made consistent with current proposal #3361 --- docs/src/man/basics.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 4626d2d4a..12fea13fe 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2207,7 +2207,7 @@ it is probably again more useful to use the `rename` or `rename!` function rather than one of the manipulation functions in order to rename in-place and avoid the intermediate `operation_function`. ```julia -julia> rename(df, :a => add_prefix) # rename some columns +julia> rename(df, :a => add_prefix) # rename one column 4×2 DataFrame Row │ new_a b │ Int64 Int64 @@ -2226,6 +2226,9 @@ Row │ new_a new_b 2 │ 2 6 3 │ 3 7 4 │ 4 8 + +# Broadcasting syntax can be used to rename only some columns. +# See the Broadcasting Operation Pairs section below. ``` In the `source_column_selector => new_column_names` operation form, From b0777b180c02aa6e197d1a57eac8de2092745aa7 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 21 Sep 2023 11:12:53 -0400 Subject: [PATCH 13/30] =?UTF-8?q?Change=20=CE=B1=20to=20apple=20and=20make?= =?UTF-8?q?=20consistent=20with=20#3380?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/src/man/basics.md | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 12fea13fe..0f432a90b 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2108,9 +2108,9 @@ julia> df = DataFrame(a=1:4, b=5:8) 3 │ 3 7 4 │ 4 8 -julia> transform(df, :a => :α) # adds column α +julia> transform(df, :a => :apple) # adds column `apple` 4×3 DataFrame - Row │ a b α + Row │ a b apple │ Int64 Int64 Int64 ─────┼───────────────────── 1 │ 1 5 1 @@ -2118,9 +2118,9 @@ julia> transform(df, :a => :α) # adds column α 3 │ 3 7 3 4 │ 4 8 4 -julia> select(df, :a => :α) # retains only column α +julia> select(df, :a => :apple) # retains only column `apple` 4×1 DataFrame - Row │ α + Row │ apple │ Int64 ─────┼─────── 1 │ 1 @@ -2128,9 +2128,9 @@ julia> select(df, :a => :α) # retains only column α 3 │ 3 4 │ 4 -julia> rename(df, :a => :α) # renames column α in-place +julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place 4×2 DataFrame - Row │ α b + Row │ apple b │ Int64 Int64 ─────┼────────────── 1 │ 1 5 @@ -2207,28 +2207,25 @@ it is probably again more useful to use the `rename` or `rename!` function rather than one of the manipulation functions in order to rename in-place and avoid the intermediate `operation_function`. ```julia -julia> rename(df, :a => add_prefix) # rename one column +julia> rename(add_prefix, df) # rename all columns with a function 4×2 DataFrame -Row │ new_a b - │ Int64 Int64 + Row │ new_a new_b + │ Int64 Int64 ─────┼────────────── 1 │ 1 5 2 │ 2 6 3 │ 3 7 4 │ 4 8 -julia> rename(add_prefix, df) # rename all columns +julia> rename(add_prefix, df; cols=:a) # rename some columns with a function 4×2 DataFrame -Row │ new_a new_b - │ Int64 Int64 + Row │ new_a b + │ Int64 Int64 ─────┼────────────── 1 │ 1 5 2 │ 2 6 3 │ 3 7 4 │ 4 8 - -# Broadcasting syntax can be used to rename only some columns. -# See the Broadcasting Operation Pairs section below. ``` In the `source_column_selector => new_column_names` operation form, From a111ef8a1e6ecd437832ef18e4afc5b022e7e997 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 28 Sep 2023 16:26:37 -0400 Subject: [PATCH 14/30] First round review corrections --- docs/src/man/basics.md | 117 ++++++++++++++++++++++++++++++----------- 1 file changed, 87 insertions(+), 30 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 0f432a90b..485175ffa 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1570,38 +1570,63 @@ julia> german[Not(5), r"S"] In DataFrames.jl there are seven functions that can be used to manipulate data frame columns: -| Function | Memory Usage | Column Retention | Row Retention | -| ------------ | -------------------------------- | -------------------------------------------- | ------------------------------------------------- | -| `transform` | Creates a new data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. | -| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. | -| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. | -| `select!` | Modifies an existing data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. | -| `subset` | Creates a new data frame. | Retains only source columns. | Number of rows is determined by the manipulation. | -| `subset!` | Modifies an existing data frame. | Retains only source columns. | Number of rows is determined by the manipulation. | -| `combine` | Creates a new data frame. | Retains only manipulated columns. | Number of rows is determined by the manipulation. | +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | +| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | +| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | ### Constructing Operation Pairs + All of the functions above use the same syntax which is commonly `manipulation_function(dataframe, operation)`. -The `operation` argument is a `Pair` which defines the +The `operation` argument defines the operation to be applied to the source `dataframe`, and it can take any of the following common forms explained below: `source_column_selector` : selects source column(s) without manipulating or renaming them + Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` + `source_column_selector => operation_function` : passes source column(s) as arguments to a function and automatically names the resulting column(s) + Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` + `source_column_selector => operation_function => new_column_names` : passes source column(s) as arguments to a function and names the resulting column(s) `new_column_names` + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` + + (*Not available for `subset` or `subset!`*) + `source_column_selector => new_column_names` : renames a source column, or splits a column containing collection elements into multiple new columns -(*not available for `subset` or `subset!`*) + + Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` + + (*Not available for `subset` or `subset!`*) + +The `=>` operator constructs a +[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), +which is a type to link one object to another. +(Pairs are commonly used to create elements of a +[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) +In DataFrames.jl manipulation functions, +`Pair` arguments are used to define column `operations` to be performed. +The provided examples will be explained in more detail below. + +The manipulation functions also have methods for applying multiple operations. +See the later sections [Multiple Operations per Manipulation](@ref) +and [Broadcasting Operation Pairs](@ref) for more information. #### `source_column_selector` Inside an `operation`, `source_column_selector` is usually a column name @@ -1682,9 +1707,9 @@ See the [Indexing](@ref) API for the full list of possible values with reference !!! Note The Julia parser sometimes prevents `:` from being used by itself. - `ERROR: syntax: whitespace not allowed after ":" used for quoting` - means your `:` must be wrapped in either `(:)` or `Cols(:)` - to be properly interpreted. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. ```julia julia> df = DataFrame( @@ -1759,17 +1784,11 @@ julia> subset(df2, [:minor, :male]) 1 │ Jimmy true true ``` -`AsTable(source_column_selector)` is a special `source_column_selector` -that can be used to select multiple columns into a single `NamedTuple`. -This is not useful on its own, so the function of this selector -will be explained in the next section. - - #### `operation_function` Inside an `operation` pair, `operation_function` is a function which operates on data frame columns passed as vectors. When multiple columns are selected by `source_column_selector`, -the `operation_function` will receive the columns as multiple positional arguments +the `operation_function` will receive the columns as separate positional arguments in the order they were selected, e.g. `f(column1, column2, column3)`. ```julia @@ -1789,7 +1808,7 @@ julia> combine(df, :a => sum) ─────┼─────── 1 │ 6 -julia> transform(df, :b => maximum) # `transform` and `select` copy result to all rows +julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows 3×3 DataFrame Row │ a b b_maximum │ Int64 Int64 Int64 @@ -1867,6 +1886,18 @@ julia> transform(df, :a => g) 1 │ 1 4 2 2 │ 2 5 3 3 │ 3 4 4 + +julia> h(x, y) = 2x .+ y +h (generic function with 1 method) + +julia> transform(df, [:a, :b] => h) +3×3 DataFrame + Row │ a b a_b_h + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 ``` [Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) @@ -1939,7 +1970,7 @@ julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) 1 │ 1 3 5 2 2 │ 2 4 6 1 -julia> select(df, Cols(:) => ByRow(min)) # min works on multiple arguments +julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments 2×1 DataFrame Row │ a_b_etc_min │ Int64 @@ -1947,7 +1978,7 @@ julia> select(df, Cols(:) => ByRow(min)) # min works on multiple arguments 1 │ 1 2 │ 1 -julia> select(df, AsTable(:) => ByRow(minimum)) # minimum works on a collection +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection 2×1 DataFrame Row │ a_b_etc_minimum │ Int64 @@ -1955,7 +1986,7 @@ julia> select(df, AsTable(:) => ByRow(minimum)) # minimum works on a collection 1 │ 1 2 │ 1 -julia> select(df, [:a,:b] => ByRow(+)) # `+` works on a multiple arguments +julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments 2×1 DataFrame Row │ a_b_+ │ Int64 @@ -1963,7 +1994,7 @@ julia> select(df, [:a,:b] => ByRow(+)) # `+` works on a multiple arguments 1 │ 4 2 │ 6 -julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` works on a collection +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection 2×1 DataFrame Row │ a_b_sum │ Int64 @@ -1973,7 +2004,7 @@ julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` works on a collection julia> using Statistics # contains the `mean` function -julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection 2×1 DataFrame Row │ b_c_d_mean │ Float64 @@ -2047,7 +2078,7 @@ specify your own `new_column_names`. `new_column_names` can be included at the end of an `operation` pair to specify the name of the new column(s). -`new_column_names` may be a symbol or a string. +`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. ```julia julia> df = DataFrame(a=1:4, b=5:8) @@ -2094,7 +2125,7 @@ julia> transform(df, :a => ByRow(x->x+10) => "a+10") The `source_column_selector => new_column_names` operation form can be used to rename columns without an intermediate function. However, there are `rename` and `rename!` functions, -which accept the same syntax, +which accept similar syntax, that tend to be more useful for this operation. ```julia @@ -2179,7 +2210,33 @@ julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous 4 │ 4 8 40 ``` -Note that a renaming function will not work in the +!!! Note + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` + +A renaming function will not work in the `source_column_selector => new_column_names` operation form because a function in the second element of the operation pair is assumed to take the `source_column_selector => operation_function` operation form. From ce55607655f6136060abf21d1fbbbf4023e63c27 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 29 Sep 2023 11:07:18 -0400 Subject: [PATCH 15/30] Move to its own section --- docs/src/man/basics.md | 1361 +----------------------- docs/src/man/manipulation_functions.md | 1345 +++++++++++++++++++++++ 2 files changed, 1369 insertions(+), 1337 deletions(-) create mode 100644 docs/src/man/manipulation_functions.md diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 485175ffa..daadf7a00 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1567,1320 +1567,28 @@ julia> german[Not(5), r"S"] ## Basic Usage of Manipulation Functions -In DataFrames.jl there are seven functions that can be used -to manipulate data frame columns: - -| Function | Memory Usage | Column Retention | Row Retention | -| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | -| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | -| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | -| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | - -### Constructing Operation Pairs - -All of the functions above use the same syntax which is commonly -`manipulation_function(dataframe, operation)`. -The `operation` argument defines the -operation to be applied to the source `dataframe`, -and it can take any of the following common forms explained below: - -`source_column_selector` -: selects source column(s) without manipulating or renaming them - - Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` - -`source_column_selector => operation_function` -: passes source column(s) as arguments to a function -and automatically names the resulting column(s) - - Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` - -`source_column_selector => operation_function => new_column_names` -: passes source column(s) as arguments to a function -and names the resulting column(s) `new_column_names` - - Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` - - (*Not available for `subset` or `subset!`*) - -`source_column_selector => new_column_names` -: renames a source column, -or splits a column containing collection elements into multiple new columns - - Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` - - (*Not available for `subset` or `subset!`*) - -The `=>` operator constructs a -[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), -which is a type to link one object to another. -(Pairs are commonly used to create elements of a -[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) -In DataFrames.jl manipulation functions, -`Pair` arguments are used to define column `operations` to be performed. -The provided examples will be explained in more detail below. - -The manipulation functions also have methods for applying multiple operations. -See the later sections [Multiple Operations per Manipulation](@ref) -and [Broadcasting Operation Pairs](@ref) for more information. - -#### `source_column_selector` -Inside an `operation`, `source_column_selector` is usually a column name -or column index which identifies a data frame column. - -`source_column_selector` may be used as the entire `operation` -with `select` or `select!` to isolate or reorder columns. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) -3×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 7 - 2 │ 2 5 8 - 3 │ 3 6 9 - -julia> select(df, :b) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, "b") -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, 2) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 -``` - -`source_column_selector` may also be used as the entire `operation` -with `subset` or `subset!` if the source column contains `Bool` values. - -```julia -julia> df = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - ) -4×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Scott false - 2 │ Jill true - 3 │ Erica false - 4 │ Jimmy true - -julia> subset(df, :minor) -2×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Jill true - 2 │ Jimmy true -``` - -`source_column_selector` may instead be a collection of columns such as a vector, -a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), -a `Not`, `Between`, `All`, or `Cols` expression, -or a `:`. -See the [Indexing](@ref) API for the full list of possible values with references. - -!!! Note - The Julia parser sometimes prevents `:` from being used by itself. - If you get - `ERROR: syntax: whitespace not allowed after ":" used for quoting`, - try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. - -```julia -julia> df = DataFrame( - id = [1, 2, 3], - first_name = ["José", "Emma", "Nathan"], - last_name = ["Garcia", "Marino", "Boyer"], - age = [61, 24, 33] - ) -3×4 DataFrame - Row │ id first_name last_name age - │ Int64 String String Int64 -─────┼───────────────────────────────────── - 1 │ 1 José Garcia 61 - 2 │ 2 Emma Marino 24 - 3 │ 3 Nathan Boyer 33 - -julia> select(df, [:last_name, :first_name]) -3×2 DataFrame - Row │ last_name first_name - │ String String -─────┼─────────────────────── - 1 │ Garcia José - 2 │ Marino Emma - 3 │ Boyer Nathan - -julia> select(df, r"name") -3×2 DataFrame - Row │ first_name last_name - │ String String -─────┼─────────────────────── - 1 │ José Garcia - 2 │ Emma Marino - 3 │ Nathan Boyer - -julia> select(df, Not(:id)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> select(df, Between(2,4)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> df2 = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - male = [true, false, false, true], - ) -4×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼────────────────────── - 1 │ Scott false true - 2 │ Jill true false - 3 │ Erica false false - 4 │ Jimmy true true - -julia> subset(df2, [:minor, :male]) -1×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼───────────────────── - 1 │ Jimmy true true -``` - -#### `operation_function` -Inside an `operation` pair, `operation_function` is a function -which operates on data frame columns passed as vectors. -When multiple columns are selected by `source_column_selector`, -the `operation_function` will receive the columns as separate positional arguments -in the order they were selected, e.g. `f(column1, column2, column3)`. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 4 - -julia> combine(df, :a => sum) -1×1 DataFrame - Row │ a_sum - │ Int64 -─────┼─────── - 1 │ 6 - -julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows -3×3 DataFrame - Row │ a b b_maximum - │ Int64 Int64 Int64 -─────┼───────────────────────── - 1 │ 1 4 5 - 2 │ 2 5 5 - 3 │ 3 4 5 - -julia> transform(df, [:b, :a] => -) # vector subtraction is okay -3×3 DataFrame - Row │ a b b_a_- - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 3 - 2 │ 2 5 3 - 3 │ 3 4 1 - -julia> transform(df, [:a, :b] => *) # vector multiplication is not defined -ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) -``` - -Don't worry! There is a quick fix for the previous error. -If you want to apply a function to each element in a column -instead of to the entire column vector, -then you can wrap your element-wise function in `ByRow` like -`ByRow(my_elementwise_function)`. -This will apply `my_elementwise_function` to every element in the column -and then collect the results back into a vector. - -```julia -julia> transform(df, [:a, :b] => ByRow(*)) -3×3 DataFrame - Row │ a b a_b_* - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 4 - 2 │ 2 5 10 - 3 │ 3 4 12 - -julia> transform(df, Cols(:) => ByRow(max)) -3×3 DataFrame - Row │ a b a_b_max - │ Int64 Int64 Int64 -─────┼─────────────────────── - 1 │ 1 4 4 - 2 │ 2 5 5 - 3 │ 3 4 4 - -julia> f(x) = x + 1 -f (generic function with 1 method) - -julia> transform(df, :a => ByRow(f)) -3×3 DataFrame - Row │ a b a_f - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 -``` - -Alternatively, you may just want to define the function itself so it -[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -over vectors. - -```julia -julia> g(x) = x .+ 1 -g (generic function with 1 method) - -julia> transform(df, :a => g) -3×3 DataFrame - Row │ a b a_g - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 - -julia> h(x, y) = 2x .+ y -h (generic function with 1 method) - -julia> transform(df, [:a, :b] => h) -3×3 DataFrame - Row │ a b a_b_h - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 6 - 2 │ 2 5 9 - 3 │ 3 4 10 -``` - -[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) -are a convenient way to define and use an `operation_function` -all within the manipulation function call. - -```julia -julia> select(df, :a => ByRow(x -> x + 1)) -3×1 DataFrame - Row │ a_function - │ Int64 -─────┼──────────── - 1 │ 2 - 2 │ 3 - 3 │ 4 - -julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) -3×3 DataFrame - Row │ a b a_b_function - │ Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 4 6 - 2 │ 2 5 9 - 3 │ 3 4 10 - -julia> subset(df, :b => ByRow(x -> x < 5)) -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 - -julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 -``` - -!!! Note - `operation_functions` within `subset` or `subset!` function calls - must return a Boolean vector. - `true` elements in the Boolean vector will determine - which rows are retained in the resulting data frame. - -As demonstrated above, `DataFrame` columns are usually passed -from `source_column_selector` to `operation_function` as one or more -vector arguments. -However, when `AsTable(source_column_selector)` is used, -the selected columns are collected and passed as a single `NamedTuple` -to `operation_function`. - -This is often useful when your `operation_function` is defined to operate -on a single collection argument rather than on multiple positional arguments. -The distinction is somewhat similar to the difference between the built-in -`min` and `minimum` functions. -`min` is defined to find the minimum value among multiple positional arguments, -while `minimum` is defined to find the minimum value -among the elements of a single collection argument. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 2 - 2 │ 2 4 6 1 - -julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments -2×1 DataFrame - Row │ a_b_etc_min - │ Int64 -─────┼───────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection -2×1 DataFrame - Row │ a_b_etc_minimum - │ Int64 -─────┼───────────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments -2×1 DataFrame - Row │ a_b_+ - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 6 - -julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection -2×1 DataFrame - Row │ a_b_sum - │ Int64 -─────┼───────── - 1 │ 4 - 2 │ 6 - -julia> using Statistics # contains the `mean` function - -julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection -2×1 DataFrame - Row │ b_c_d_mean - │ Float64 -─────┼──────────── - 1 │ 3.33333 - 2 │ 3.66667 -``` - -`AsTable` can also be used to pass columns to a function which operates -on fields of a `NamedTuple`. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 7 - 2 │ 2 4 6 8 - -julia> f(nt) = nt.a + nt.d -f (generic function with 1 method) - -julia> transform(df, AsTable(:) => ByRow(f)) -2×5 DataFrame - Row │ a b c d a_b_etc_f - │ Int64 Int64 Int64 Int64 Int64 -─────┼─────────────────────────────────────── - 1 │ 1 3 5 7 8 - 2 │ 2 4 6 8 10 -``` - -As demonstrated above, -in the `source_column_selector => operation_function` operation pair form, -the results of an operation will be placed into a new column with an -automatically-generated name based on the operation; -the new column name will be the `operation_function` name -appended to the source column name(s) with an underscore. - -This automatic column naming behavior can be avoided in two ways. -First, the operation result can be placed back into the original column -with the original column name by switching the keyword argument `renamecols` -from its default value (`true`) to `renamecols=false`. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 11 5 - 2 │ 12 6 - 3 │ 13 7 - 4 │ 14 8 -``` - -The second method to avoid the default manipulation column naming is to -specify your own `new_column_names`. - -#### `new_column_names` - -`new_column_names` can be included at the end of an `operation` pair to specify -the name of the new column(s). -`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, Cols(:) => ByRow(+) => :c) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, Cols(:) => ByRow(+) => "a+b") -4×3 DataFrame - Row │ a b a+b - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, :a => ByRow(x->x+10) => "a+10") -4×3 DataFrame - Row │ a b a+10 - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 11 - 2 │ 2 6 12 - 3 │ 3 7 13 - 4 │ 4 8 14 -``` - -The `source_column_selector => new_column_names` operation form -can be used to rename columns without an intermediate function. -However, there are `rename` and `rename!` functions, -which accept similar syntax, -that tend to be more useful for this operation. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => :apple) # adds column `apple` -4×3 DataFrame - Row │ a b apple - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 - -julia> select(df, :a => :apple) # retains only column `apple` -4×1 DataFrame - Row │ apple - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - 4 │ 4 - -julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place -4×2 DataFrame - Row │ apple b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -Additionally, in the -`source_column_selector => operation_function => new_column_names` operation form, -`new_column_names` may be a renaming function which operates on a string -to create the destination column names programmatically. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> add_prefix(s) = "new_" * s -add_prefix (generic function with 1 method) - -julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 - -julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 -``` - -!!! Note - It is a good idea to wrap anonymous functions in parentheses - to avoid the `=>` operator accidently becoming part of the anonymous function. - The examples above do not work correctly without the parentheses! - ```julia - julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼──────────────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>add_prefix - 2 │ 2 6 [10, 20, 30, 40]=>add_prefix - 3 │ 3 7 [10, 20, 30, 40]=>add_prefix - 4 │ 4 8 [10, 20, 30, 40]=>add_prefix - - julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼───────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>#18 - 2 │ 2 6 [10, 20, 30, 40]=>#18 - 3 │ 3 7 [10, 20, 30, 40]=>#18 - 4 │ 4 8 [10, 20, 30, 40]=>#18 - ``` - -A renaming function will not work in the -`source_column_selector => new_column_names` operation form -because a function in the second element of the operation pair is assumed to take -the `source_column_selector => operation_function` operation form. -To work around this limitation, use the -`source_column_selector => operation_function => new_column_names` operation form -with `identity` as the `operation_function`. - -```julia -julia> transform(df, :a => add_prefix) -ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) - -julia> transform(df, :a => identity => add_prefix) -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 -``` - -In this case though, -it is probably again more useful to use the `rename` or `rename!` function -rather than one of the manipulation functions -in order to rename in-place and avoid the intermediate `operation_function`. -```julia -julia> rename(add_prefix, df) # rename all columns with a function -4×2 DataFrame - Row │ new_a new_b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> rename(add_prefix, df; cols=:a) # rename some columns with a function -4×2 DataFrame - Row │ new_a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -In the `source_column_selector => new_column_names` operation form, -only a single source column may be selected per operation, -so why is `new_column_names` plural? -It is possible to split the data contained inside a single column -into multiple new columns by supplying a vector of strings or symbols -as `new_column_names`. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> transform(df, :data => [:first, :second]) # manual naming -2×3 DataFrame - Row │ data first second - │ Tuple… Int64 Int64 -─────┼─────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -This kind of data splitting can even be done automatically with `AsTable`. - -```julia -julia> transform(df, :data => AsTable) # default automatic naming with tuples -2×3 DataFrame - Row │ data x1 x2 - │ Tuple… Int64 Int64 -─────┼────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -If a data frame column contains `NamedTuple`s, -then `AsTable` will preserve the field names. -```julia -julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples -2×1 DataFrame - Row │ data - │ NamedTup… -─────┼──────────────── - 1 │ (a = 1, b = 2) - 2 │ (a = 3, b = 4) - -julia> transform(df, :data => AsTable) # keeps names from named tuples -2×3 DataFrame - Row │ data a b - │ NamedTup… Int64 Int64 -─────┼────────────────────────────── - 1 │ (a = 1, b = 2) 1 2 - 2 │ (a = 3, b = 4) 3 4 -``` - -!!! Note - To pack multiple columns into a single column of `NamedTuple`s - (reverse of the above operation) - apply the `identity` function `ByRow`, e.g. - `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. - -Renaming functions also work for multi-column transformations, -but they must operate on a vector of strings. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> new_names(v) = ["primary ", "secondary "] .* v -new_names (generic function with 1 method) - -julia> transform(df, :data => identity => new_names) -2×3 DataFrame - Row │ data primary data secondary data - │ Tuple… Int64 Int64 -─────┼────────────────────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -#### Multiple Operations per Manipulation -All data frame manipulation functions can accept multiple `operation` pairs -at once using any of the following methods: -- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments -- `manipulation_function(dataframe, [operation1, operation2])` : vector argument -- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument - -Passing multiple operations is especially useful for the `select`, `select!`, -and `combine` manipulation functions, -since they only retain columns which are a result of the passed operations. - -```julia -julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 1 50 hat - 2 │ 2 50 bat - 3 │ 3 60 cat - 4 │ 4 60 dog - -julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations -1×3 DataFrame - Row │ a_maximum b_sum c_join - │ Int64 Int64 String -─────┼──────────────────────────────── - 1 │ 4 220 hatbatcatdog - -julia> select(df, :c, :b, :a) # re-order columns -4×3 DataFrame - Row │ c b a - │ String Int64 Int64 -─────┼────────────────────── - 1 │ hat 50 1 - 2 │ bat 50 2 - 3 │ cat 60 3 - 4 │ dog 60 4 - -ulia> select(df, :b, :) # `:` here means all other columns -4×3 DataFrame - Row │ b a c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 50 1 hat - 2 │ 50 2 bat - 3 │ 60 3 cat - 4 │ 60 4 dog - -julia> select( - df, - :c => (x -> "a " .* x) => :one_c, - :a => (x -> 100x), - :b, - renamecols=false - ) # can mix operation forms -4×3 DataFrame - Row │ one_c a b - │ String Int64 Int64 -─────┼────────────────────── - 1 │ a hat 100 50 - 2 │ a bat 200 50 - 3 │ a cat 300 60 - 4 │ a dog 400 60 - -julia> select( - df, - :c => ByRow(reverse), - :c => ByRow(uppercase) - ) # multiple operations on same column -4×2 DataFrame - Row │ c_reverse c_uppercase - │ String String -─────┼──────────────────────── - 1 │ tah HAT - 2 │ tab BAT - 3 │ tac CAT - 4 │ god DOG -``` - -In the last two examples, -the manipulation function arguments were split across multiple lines. -This is a good way to make manipulations with many operations more readable. - -Passing multiple operations to `subset` or `subset!` is an easy way to narrow in -on a particular row of data. - -```julia -julia> subset( - df, - :b => ByRow(==(60)), - :c => ByRow(contains("at")) - ) # rows with 60 and "at" -1×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 3 60 cat -``` - -Note that all operations within a single manipulation must use the data -as it existed before the function call -i.e. you cannot use newly created columns for subsequent operations -within the same manipulation. - -```julia -julia> transform( - df, - [:a, :b] => ByRow(+) => :d, - :d => (x -> x ./ 2), - ) # requires two separate transformations -ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c - -julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) -4×4 DataFrame - Row │ a b c d - │ Int64 Int64 String Int64 -─────┼───────────────────────────── - 1 │ 1 50 hat 51 - 2 │ 2 50 bat 52 - 3 │ 3 60 cat 63 - 4 │ 4 60 dog 64 - -julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) -4×5 DataFrame - Row │ a b c d d_2 - │ Int64 Int64 String Int64 Float64 -─────┼────────────────────────────────────── - 1 │ 1 50 hat 51 25.5 - 2 │ 2 50 bat 52 26.0 - 3 │ 3 60 cat 63 31.5 - 4 │ 4 60 dog 64 32.0 -``` - - -#### Broadcasting Operation Pairs - -[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -pairs with `.=>` is often a convenient way to generate multiple -similar `operation`s to be applied within a single manipulation. -Broadcasting within the `Pair` of an `operation` is no different than -broadcasting in base Julia. -The broadcasting `.=>` will be expanded into a vector of pairs -(`[operation1, operation2, ...]`), -and this expansion will occur before the manipulation function is invoked. -Then the manipulation function will use the -`manipulation_function(dataframe, [operation1, operation2, ...])` method. -This process will be explained in more detail below. - -To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. -In DataFrames.jl, a symbol, string, or integer -may be used to select a single column. -Some `Pair`s with these types are below. - -```julia -julia> typeof(:x => :a) -Pair{Symbol, Symbol} - -julia> typeof("x" => "a") -Pair{String, String} - -julia> typeof(1 => "a") -Pair{Int64, String} -``` - -Any of the `Pair`s above could be used to rename the first column -of the data frame below to `a`. - -```julia -julia> df = DataFrame(x = 1:3, y = 4:6) -3×2 DataFrame - Row │ x y - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 - -julia> select(df, :x => :a) -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - -julia> select(df, 1 => "a") -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 -``` - -What should we do if we want to keep and rename both the `x` and `y` column? -One option is to supply a `Vector` of operation `Pair`s to `select`. -`select` will process all of these operations in order. - -```julia -julia> ["x" => "a", "y" => "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x" => "a", "y" => "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -We can use broadcasting to simplify the syntax above. - -```julia -julia> ["x", "y"] .=> ["a", "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x", "y"] .=> ["a", "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -Notice that `select` sees the same `Vector{Pair{String, String}}` operation -argument whether the individual pairs are written out explicitly or -constructed with broadcasting. -The broadcasting is applied before the call to `select`. - -```julia -julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) -true -``` - -!!! Note - These operation pairs (or vector of pairs) can be given variable names. - This is uncommon in practice but could be helpful for intermediate - inspection and testing. - ```julia - df = DataFrame(x = 1:3, y = 4:6) # create data frame - operation = ["x", "y"] .=> ["a", "b"] # save operation to variable - typeof(operation) # check type of operation - first(operation) # check first pair in operation - last(operation) # check last pair in operation - select(df, operation) # manipulate `df` with `operation` - ``` - -In Julia, -a non-vector broadcasted with a vector will be repeated in each resultant pair element. - -```julia -julia> ["x", "y"] .=> :a # :a is repeated -2-element Vector{Pair{String, Symbol}}: - "x" => :a - "y" => :a - -julia> 1 .=> [:a, :b] # 1 is repeated -2-element Vector{Pair{Int64, Symbol}}: - 1 => :a - 1 => :b -``` - -We can use this fact to easily broadcast an `operation_function` to multiple columns. - -```julia -julia> f(x) = 2 * x -f (generic function with 1 method) - -julia> ["x", "y"] .=> f # f is repeated -2-element Vector{Pair{String, typeof(f)}}: - "x" => f - "y" => f - -julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming -3×2 DataFrame - Row │ x_f y_f - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 - -julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated -2-element Vector{Pair{String, Pair{typeof(f), String}}}: - "x" => (f => "a") - "y" => (f => "b") - -julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -A renaming function can be applied to multiple columns in the same way. -It will also be repeated in each operation `Pair`. - -```julia -julia> newname(s::String) = s * "_new" -newname (generic function with 1 method) - -julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated -2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: - "x" => (f => newname) - "y" => (f => newname) - -julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname -3×2 DataFrame - Row │ x_new y_new - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -You can see from the type output above -that a three element pair does not actually exist. -A `Pair` (as the name implies) can only contain two elements. -Thus, `:x => :y => :z` becomes a nested `Pair`, -where `:x` is the first element and points to the `Pair` `:y => :z`, -which is the second element. - -```julia -julia> p = :x => :y => :z -:x => (:y => :z) - -julia> p[1] -:x - -julia> p[2] -:y => :z - -julia> p[2][1] -:y - -julia> p[2][2] -:z - -julia> p[3] # there is no index 3 for a pair -ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] -``` - -In the previous examples, the source columns have been individually selected. -When broadcasting multiple columns to the same function, -often similarities in the column names or position can be exploited to avoid -tedious selection. -Consider a data frame with temperature data at three different locations -taken over time. -```julia -julia> df = DataFrame(Time = 1:4, - Temperature1 = [20, 23, 25, 28], - Temperature2 = [33, 37, 41, 44], - Temperature3 = [15, 10, 4, 0]) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 20 33 15 - 2 │ 2 23 37 10 - 3 │ 3 25 41 4 - 4 │ 4 28 44 0 -``` - -To convert all of the temperature data in one transformation, -we just need to define a conversion function and broadcast -it to all of the "Temperature" columns. - -```julia -julia> celsius_to_kelvin(x) = x + 273 -celsius_to_kelvin (generic function with 1 method) - -julia> transform( - df, - Cols(r"Temp") .=> ByRow(celsius_to_kelvin), - renamecols = false - ) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` -Or, simultaneously changing the column names: - -```julia -julia> rename_function(s) = "Temperature $(last(s)) (K)" -rename_function (generic function with 1 method) - -julia> select( - df, - "Time", - Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function - ) -4×4 DataFrame - Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` - -!!! Note Notes - * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. - * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. - Without `ByRow`, the manipulations above would have thrown - `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. - * Regular expression (`r""`) and `:` `source_column_selectors` - must be wrapped in `Cols` to be properly broadcasted - because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. - -You could also broadcast different columns to different functions -by supplying a vector of functions. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> f1(x) = x .+ 1 -f1 (generic function with 1 method) - -julia> f2(x) = x ./ 10 -f2 (generic function with 1 method) - -julia> transform(df, [:a, :b] .=> [f1, f2]) -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -However, this form is not much more convenient than supplying -multiple individual operations. - -```julia -julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -Perhaps more useful for broadcasting syntax -is to apply multiple functions to multiple columns -by changing the vector of functions to a 1-by-x matrix of functions. -(Recall that a list, a vector, or a matrix of operation pairs are all valid -for passing to the manipulation functions.) - -```julia -julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 -2×2 Matrix{Pair{Symbol}}: - :a=>f1 :a=>f2 - :b=>f1 :b=>f2 - -julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 -4×6 DataFrame - Row │ a b a_f1 b_f1 a_f2 b_f2 - │ Int64 Int64 Int64 Int64 Float64 Float64 -─────┼────────────────────────────────────────────── - 1 │ 1 5 2 6 0.1 0.5 - 2 │ 2 6 3 7 0.2 0.6 - 3 │ 3 7 4 8 0.3 0.7 - 4 │ 4 8 5 9 0.4 0.8 -``` - -In this way, every combination of selected columns and functions will be applied. - -Pair broadcasting is a simple but powerful tool -that can be used in any of the manipulation functions listed under -[Basic Usage of Manipulation Functions](@ref). -Experiment for yourself to discover other useful operations. - -#### More Manipulation Examples with the German Dataset - -Let us move to the examples of application of these rules using the German dataset. +In DataFrames.jl there are seven functions +which can be used to perform operations on data frame columns: + +- `combine`: creates a new data frame populated with columns that result from + operations applied to the source data frame columns, potentially combining + its rows; +- `select`: creates a new data frame that has the same number of rows as the + source data frame populated with columns that result from operations + applied to the source data frame columns; +- `select!`: the same as `select` but updates the passed data frame in place; +- `transform`: the same as `select` but keeps the columns that were already + present in the data frame (note though that these columns can be potentially + modified by the transformation passed to `transform`); +- `transform!`: the same as `transform` but updates the passed data frame in + place. +- `subset`: creates a new data frame populated with the same columns +as the source data frame, but with only the rows where the passed operations are true; +- `subset!`: the same as `subset` but updates the passed data frame in place; + +These functions and their methods are explained in more detail in the section +[Data Frame Manipulation Functions](@ref). +In this section, we will move straight to examples using the German dataset. ```jldoctest dataframe julia> using Statistics @@ -3443,26 +2151,5 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) 985 rows omitted ``` -This concludes the introductory explaination of data frame manipulations. -For more advanced examples, -see later sections of the manual or the additional resources below. - -#### Additional Resources -More details and examples of operation pair syntax can be found in -[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). -(The official wording describing the syntax has changed since the blog post was written, -but the examples are still illustrative. -The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language -or Domain-Specific Language.) - -For additional practice, -an interactive tutorial is provided on a variety of introductory topics -by the DataFrames.jl package author -[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). - - -For additional syntax niceties, -many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) -and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) -packages useful -to help simplify manipulations that may be tedious with operation pairs alone. \ No newline at end of file +This concludes the introductory explanation of data frame manipulations. +For more advanced examples, see later sections of the manual. diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md new file mode 100644 index 000000000..da4fb1e63 --- /dev/null +++ b/docs/src/man/manipulation_functions.md @@ -0,0 +1,1345 @@ +# Data Frame Manipulation Functions + +The seven functions below can be used to manipulate data frames +by applying operations to them. + +The functions without a `!` in their name +will create a new data frame based on the source data frame, +so you will probably want to store the new data frame to a new variable name, +e.g. `new_df = transform(source_df, operation)`. +The functions with a `!` at the end of their name +will modify an existing data frame in-place, +so there is typically no need to assign the result to a variable, +e.g. `transform!(source_df, operation)` instead of +`source_df = transform(source_df, operation)`. + +The number of columns and rows in the resultant data frame varies +depending on the manipulation function employed. + +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | +| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | +| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | + +## Constructing Operations + +All of the functions above use the same syntax which is commonly +`manipulation_function(dataframe, operation)`. +The `operation` argument defines the +operation to be applied to the source `dataframe`, +and it can take any of the following common forms explained below: + +`source_column_selector` +: selects source column(s) without manipulating or renaming them + + Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` + +`source_column_selector => operation_function` +: passes source column(s) as arguments to a function +and automatically names the resulting column(s) + + Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` + +`source_column_selector => operation_function => new_column_names` +: passes source column(s) as arguments to a function +and names the resulting column(s) `new_column_names` + + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` + + *(Not available for `subset` or `subset!`)* + +`source_column_selector => new_column_names` +: renames a source column, +or splits a column containing collection elements into multiple new columns + + Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` + + (*Not available for `subset` or `subset!`*) + +The `=>` operator constructs a +[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), +which is a type to link one object to another. +(Pairs are commonly used to create elements of a +[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) +In DataFrames.jl manipulation functions, +`Pair` arguments are used to define column `operations` to be performed. +The provided examples will be explained in more detail below. + +The manipulation functions also have methods for applying multiple operations. +See the later sections [Multiple Operations per Manipulation](@ref) +and [Broadcasting Operation Pairs](@ref) for more information. + +### `source_column_selector` +Inside an `operation`, `source_column_selector` is usually a column name +or column index which identifies a data frame column. + +`source_column_selector` may be used as the entire `operation` +with `select` or `select!` to isolate or reorder columns. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) +3×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 + +julia> select(df, :b) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, "b") +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, 2) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 +``` + +`source_column_selector` may also be used as the entire `operation` +with `subset` or `subset!` if the source column contains `Bool` values. + +```julia +julia> df = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + ) +4×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Scott false + 2 │ Jill true + 3 │ Erica false + 4 │ Jimmy true + +julia> subset(df, :minor) +2×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Jill true + 2 │ Jimmy true +``` + +`source_column_selector` may instead be a collection of columns such as a vector, +a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), +a `Not`, `Between`, `All`, or `Cols` expression, +or a `:`. +See the [Indexing](@ref) API for the full list of possible values with references. + +!!! Note + The Julia parser sometimes prevents `:` from being used by itself. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. + +```julia +julia> df = DataFrame( + id = [1, 2, 3], + first_name = ["José", "Emma", "Nathan"], + last_name = ["Garcia", "Marino", "Boyer"], + age = [61, 24, 33] + ) +3×4 DataFrame + Row │ id first_name last_name age + │ Int64 String String Int64 +─────┼───────────────────────────────────── + 1 │ 1 José Garcia 61 + 2 │ 2 Emma Marino 24 + 3 │ 3 Nathan Boyer 33 + +julia> select(df, [:last_name, :first_name]) +3×2 DataFrame + Row │ last_name first_name + │ String String +─────┼─────────────────────── + 1 │ Garcia José + 2 │ Marino Emma + 3 │ Boyer Nathan + +julia> select(df, r"name") +3×2 DataFrame + Row │ first_name last_name + │ String String +─────┼─────────────────────── + 1 │ José Garcia + 2 │ Emma Marino + 3 │ Nathan Boyer + +julia> select(df, Not(:id)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> select(df, Between(2,4)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> df2 = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + male = [true, false, false, true], + ) +4×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼────────────────────── + 1 │ Scott false true + 2 │ Jill true false + 3 │ Erica false false + 4 │ Jimmy true true + +julia> subset(df2, [:minor, :male]) +1×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼───────────────────── + 1 │ Jimmy true true +``` + +### `operation_function` +Inside an `operation` pair, `operation_function` is a function +which operates on data frame columns passed as vectors. +When multiple columns are selected by `source_column_selector`, +the `operation_function` will receive the columns as separate positional arguments +in the order they were selected, e.g. `f(column1, column2, column3)`. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 4 + +julia> combine(df, :a => sum) +1×1 DataFrame + Row │ a_sum + │ Int64 +─────┼─────── + 1 │ 6 + +julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows +3×3 DataFrame + Row │ a b b_maximum + │ Int64 Int64 Int64 +─────┼───────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 5 + 3 │ 3 4 5 + +julia> transform(df, [:b, :a] => -) # vector subtraction is okay +3×3 DataFrame + Row │ a b b_a_- + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 3 + 2 │ 2 5 3 + 3 │ 3 4 1 + +julia> transform(df, [:a, :b] => *) # vector multiplication is not defined +ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) +``` + +Don't worry! There is a quick fix for the previous error. +If you want to apply a function to each element in a column +instead of to the entire column vector, +then you can wrap your element-wise function in `ByRow` like +`ByRow(my_elementwise_function)`. +This will apply `my_elementwise_function` to every element in the column +and then collect the results back into a vector. + +```julia +julia> transform(df, [:a, :b] => ByRow(*)) +3×3 DataFrame + Row │ a b a_b_* + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 4 + 2 │ 2 5 10 + 3 │ 3 4 12 + +julia> transform(df, Cols(:) => ByRow(max)) +3×3 DataFrame + Row │ a b a_b_max + │ Int64 Int64 Int64 +─────┼─────────────────────── + 1 │ 1 4 4 + 2 │ 2 5 5 + 3 │ 3 4 4 + +julia> f(x) = x + 1 +f (generic function with 1 method) + +julia> transform(df, :a => ByRow(f)) +3×3 DataFrame + Row │ a b a_f + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +Alternatively, you may just want to define the function itself so it +[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +over vectors. + +```julia +julia> g(x) = x .+ 1 +g (generic function with 1 method) + +julia> transform(df, :a => g) +3×3 DataFrame + Row │ a b a_g + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 + +julia> h(x, y) = 2x .+ y +h (generic function with 1 method) + +julia> transform(df, [:a, :b] => h) +3×3 DataFrame + Row │ a b a_b_h + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 +``` + +[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) +are a convenient way to define and use an `operation_function` +all within the manipulation function call. + +```julia +julia> select(df, :a => ByRow(x -> x + 1)) +3×1 DataFrame + Row │ a_function + │ Int64 +─────┼──────────── + 1 │ 2 + 2 │ 3 + 3 │ 4 + +julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) +3×3 DataFrame + Row │ a b a_b_function + │ Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 + +julia> subset(df, :b => ByRow(x -> x < 5)) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 + +julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 +``` + +!!! Note + `operation_functions` within `subset` or `subset!` function calls + must return a Boolean vector. + `true` elements in the Boolean vector will determine + which rows are retained in the resulting data frame. + +As demonstrated above, `DataFrame` columns are usually passed +from `source_column_selector` to `operation_function` as one or more +vector arguments. +However, when `AsTable(source_column_selector)` is used, +the selected columns are collected and passed as a single `NamedTuple` +to `operation_function`. + +This is often useful when your `operation_function` is defined to operate +on a single collection argument rather than on multiple positional arguments. +The distinction is somewhat similar to the difference between the built-in +`min` and `minimum` functions. +`min` is defined to find the minimum value among multiple positional arguments, +while `minimum` is defined to find the minimum value +among the elements of a single collection argument. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 2 + 2 │ 2 4 6 1 + +julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments +2×1 DataFrame + Row │ a_b_etc_min + │ Int64 +─────┼───────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection +2×1 DataFrame + Row │ a_b_etc_minimum + │ Int64 +─────┼───────────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments +2×1 DataFrame + Row │ a_b_+ + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 6 + +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection +2×1 DataFrame + Row │ a_b_sum + │ Int64 +─────┼───────── + 1 │ 4 + 2 │ 6 + +julia> using Statistics # contains the `mean` function + +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection +2×1 DataFrame + Row │ b_c_d_mean + │ Float64 +─────┼──────────── + 1 │ 3.33333 + 2 │ 3.66667 +``` + +`AsTable` can also be used to pass columns to a function which operates +on fields of a `NamedTuple`. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 7 + 2 │ 2 4 6 8 + +julia> f(nt) = nt.a + nt.d +f (generic function with 1 method) + +julia> transform(df, AsTable(:) => ByRow(f)) +2×5 DataFrame + Row │ a b c d a_b_etc_f + │ Int64 Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────── + 1 │ 1 3 5 7 8 + 2 │ 2 4 6 8 10 +``` + +As demonstrated above, +in the `source_column_selector => operation_function` operation pair form, +the results of an operation will be placed into a new column with an +automatically-generated name based on the operation; +the new column name will be the `operation_function` name +appended to the source column name(s) with an underscore. + +This automatic column naming behavior can be avoided in two ways. +First, the operation result can be placed back into the original column +with the original column name by switching the keyword argument `renamecols` +from its default value (`true`) to `renamecols=false`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 11 5 + 2 │ 12 6 + 3 │ 13 7 + 4 │ 14 8 +``` + +The second method to avoid the default manipulation column naming is to +specify your own `new_column_names`. + +### `new_column_names` + +`new_column_names` can be included at the end of an `operation` pair to specify +the name of the new column(s). +`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, Cols(:) => ByRow(+) => :c) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, Cols(:) => ByRow(+) => "a+b") +4×3 DataFrame + Row │ a b a+b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, :a => ByRow(x->x+10) => "a+10") +4×3 DataFrame + Row │ a b a+10 + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 11 + 2 │ 2 6 12 + 3 │ 3 7 13 + 4 │ 4 8 14 +``` + +The `source_column_selector => new_column_names` operation form +can be used to rename columns without an intermediate function. +However, there are `rename` and `rename!` functions, +which accept similar syntax, +that tend to be more useful for this operation. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => :apple) # adds column `apple` +4×3 DataFrame + Row │ a b apple + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 + +julia> select(df, :a => :apple) # retains only column `apple` +4×1 DataFrame + Row │ apple + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + 4 │ 4 + +julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place +4×2 DataFrame + Row │ apple b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` + +Additionally, in the +`source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may be a renaming function which operates on a string +to create the destination column names programmatically. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> add_prefix(s) = "new_" * s +add_prefix (generic function with 1 method) + +julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 + +julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 +``` + +!!! Note + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` + +A renaming function will not work in the +`source_column_selector => new_column_names` operation form +because a function in the second element of the operation pair is assumed to take +the `source_column_selector => operation_function` operation form. +To work around this limitation, use the +`source_column_selector => operation_function => new_column_names` operation form +with `identity` as the `operation_function`. + +```julia +julia> transform(df, :a => add_prefix) +ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) + +julia> transform(df, :a => identity => add_prefix) +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 +``` + +In this case though, +it is probably again more useful to use the `rename` or `rename!` function +rather than one of the manipulation functions +in order to rename in-place and avoid the intermediate `operation_function`. +```julia +julia> rename(add_prefix, df) # rename all columns with a function +4×2 DataFrame + Row │ new_a new_b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> rename(add_prefix, df; cols=:a) # rename some columns with a function +4×2 DataFrame + Row │ new_a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` + +In the `source_column_selector => new_column_names` operation form, +only a single source column may be selected per operation, +so why is `new_column_names` plural? +It is possible to split the data contained inside a single column +into multiple new columns by supplying a vector of strings or symbols +as `new_column_names`. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> transform(df, :data => [:first, :second]) # manual naming +2×3 DataFrame + Row │ data first second + │ Tuple… Int64 Int64 +─────┼─────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +This kind of data splitting can even be done automatically with `AsTable`. + +```julia +julia> transform(df, :data => AsTable) # default automatic naming with tuples +2×3 DataFrame + Row │ data x1 x2 + │ Tuple… Int64 Int64 +─────┼────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +If a data frame column contains `NamedTuple`s, +then `AsTable` will preserve the field names. +```julia +julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples +2×1 DataFrame + Row │ data + │ NamedTup… +─────┼──────────────── + 1 │ (a = 1, b = 2) + 2 │ (a = 3, b = 4) + +julia> transform(df, :data => AsTable) # keeps names from named tuples +2×3 DataFrame + Row │ data a b + │ NamedTup… Int64 Int64 +─────┼────────────────────────────── + 1 │ (a = 1, b = 2) 1 2 + 2 │ (a = 3, b = 4) 3 4 +``` + +!!! Note + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. + +Renaming functions also work for multi-column transformations, +but they must operate on a vector of strings. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> new_names(v) = ["primary ", "secondary "] .* v +new_names (generic function with 1 method) + +julia> transform(df, :data => identity => new_names) +2×3 DataFrame + Row │ data primary data secondary data + │ Tuple… Int64 Int64 +─────┼────────────────────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +## Applying Multiple Operations per Manipulation +All data frame manipulation functions can accept multiple `operation` pairs +at once using any of the following methods: +- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments +- `manipulation_function(dataframe, [operation1, operation2])` : vector argument +- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument + +Passing multiple operations is especially useful for the `select`, `select!`, +and `combine` manipulation functions, +since they only retain columns which are a result of the passed operations. + +```julia +julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 1 50 hat + 2 │ 2 50 bat + 3 │ 3 60 cat + 4 │ 4 60 dog + +julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations +1×3 DataFrame + Row │ a_maximum b_sum c_join + │ Int64 Int64 String +─────┼──────────────────────────────── + 1 │ 4 220 hatbatcatdog + +julia> select(df, :c, :b, :a) # re-order columns +4×3 DataFrame + Row │ c b a + │ String Int64 Int64 +─────┼────────────────────── + 1 │ hat 50 1 + 2 │ bat 50 2 + 3 │ cat 60 3 + 4 │ dog 60 4 + +ulia> select(df, :b, :) # `:` here means all other columns +4×3 DataFrame + Row │ b a c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 50 1 hat + 2 │ 50 2 bat + 3 │ 60 3 cat + 4 │ 60 4 dog + +julia> select( + df, + :c => (x -> "a " .* x) => :one_c, + :a => (x -> 100x), + :b, + renamecols=false + ) # can mix operation forms +4×3 DataFrame + Row │ one_c a b + │ String Int64 Int64 +─────┼────────────────────── + 1 │ a hat 100 50 + 2 │ a bat 200 50 + 3 │ a cat 300 60 + 4 │ a dog 400 60 + +julia> select( + df, + :c => ByRow(reverse), + :c => ByRow(uppercase) + ) # multiple operations on same column +4×2 DataFrame + Row │ c_reverse c_uppercase + │ String String +─────┼──────────────────────── + 1 │ tah HAT + 2 │ tab BAT + 3 │ tac CAT + 4 │ god DOG +``` + +In the last two examples, +the manipulation function arguments were split across multiple lines. +This is a good way to make manipulations with many operations more readable. + +Passing multiple operations to `subset` or `subset!` is an easy way to narrow in +on a particular row of data. + +```julia +julia> subset( + df, + :b => ByRow(==(60)), + :c => ByRow(contains("at")) + ) # rows with 60 and "at" +1×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 3 60 cat +``` + +Note that all operations within a single manipulation must use the data +as it existed before the function call +i.e. you cannot use newly created columns for subsequent operations +within the same manipulation. + +```julia +julia> transform( + df, + [:a, :b] => ByRow(+) => :d, + :d => (x -> x ./ 2), + ) # requires two separate transformations +ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c + +julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) +4×4 DataFrame + Row │ a b c d + │ Int64 Int64 String Int64 +─────┼───────────────────────────── + 1 │ 1 50 hat 51 + 2 │ 2 50 bat 52 + 3 │ 3 60 cat 63 + 4 │ 4 60 dog 64 + +julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) +4×5 DataFrame + Row │ a b c d d_2 + │ Int64 Int64 String Int64 Float64 +─────┼────────────────────────────────────── + 1 │ 1 50 hat 51 25.5 + 2 │ 2 50 bat 52 26.0 + 3 │ 3 60 cat 63 31.5 + 4 │ 4 60 dog 64 32.0 +``` + + +## Broadcasting Operation Pairs + +[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +pairs with `.=>` is often a convenient way to generate multiple +similar `operation`s to be applied within a single manipulation. +Broadcasting within the `Pair` of an `operation` is no different than +broadcasting in base Julia. +The broadcasting `.=>` will be expanded into a vector of pairs +(`[operation1, operation2, ...]`), +and this expansion will occur before the manipulation function is invoked. +Then the manipulation function will use the +`manipulation_function(dataframe, [operation1, operation2, ...])` method. +This process will be explained in more detail below. + +To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. +In DataFrames.jl, a symbol, string, or integer +may be used to select a single column. +Some `Pair`s with these types are below. + +```julia +julia> typeof(:x => :a) +Pair{Symbol, Symbol} + +julia> typeof("x" => "a") +Pair{String, String} + +julia> typeof(1 => "a") +Pair{Int64, String} +``` + +Any of the `Pair`s above could be used to rename the first column +of the data frame below to `a`. + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> select(df, :x => :a) +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> select(df, 1 => "a") +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 +``` + +What should we do if we want to keep and rename both the `x` and `y` column? +One option is to supply a `Vector` of operation `Pair`s to `select`. +`select` will process all of these operations in order. + +```julia +julia> ["x" => "a", "y" => "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x" => "a", "y" => "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +We can use broadcasting to simplify the syntax above. + +```julia +julia> ["x", "y"] .=> ["a", "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x", "y"] .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Notice that `select` sees the same `Vector{Pair{String, String}}` operation +argument whether the individual pairs are written out explicitly or +constructed with broadcasting. +The broadcasting is applied before the call to `select`. + +```julia +julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) +true +``` + +!!! Note + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` + +In Julia, +a non-vector broadcasted with a vector will be repeated in each resultant pair element. + +```julia +julia> ["x", "y"] .=> :a # :a is repeated +2-element Vector{Pair{String, Symbol}}: + "x" => :a + "y" => :a + +julia> 1 .=> [:a, :b] # 1 is repeated +2-element Vector{Pair{Int64, Symbol}}: + 1 => :a + 1 => :b +``` + +We can use this fact to easily broadcast an `operation_function` to multiple columns. + +```julia +julia> f(x) = 2 * x +f (generic function with 1 method) + +julia> ["x", "y"] .=> f # f is repeated +2-element Vector{Pair{String, typeof(f)}}: + "x" => f + "y" => f + +julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming +3×2 DataFrame + Row │ x_f y_f + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + +julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated +2-element Vector{Pair{String, Pair{typeof(f), String}}}: + "x" => (f => "a") + "y" => (f => "b") + +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +A renaming function can be applied to multiple columns in the same way. +It will also be repeated in each operation `Pair`. + +```julia +julia> newname(s::String) = s * "_new" +newname (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated +2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: + "x" => (f => newname) + "y" => (f => newname) + +julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname +3×2 DataFrame + Row │ x_new y_new + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +You can see from the type output above +that a three element pair does not actually exist. +A `Pair` (as the name implies) can only contain two elements. +Thus, `:x => :y => :z` becomes a nested `Pair`, +where `:x` is the first element and points to the `Pair` `:y => :z`, +which is the second element. + +```julia +julia> p = :x => :y => :z +:x => (:y => :z) + +julia> p[1] +:x + +julia> p[2] +:y => :z + +julia> p[2][1] +:y + +julia> p[2][2] +:z + +julia> p[3] # there is no index 3 for a pair +ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] +``` + +In the previous examples, the source columns have been individually selected. +When broadcasting multiple columns to the same function, +often similarities in the column names or position can be exploited to avoid +tedious selection. +Consider a data frame with temperature data at three different locations +taken over time. +```julia +julia> df = DataFrame(Time = 1:4, + Temperature1 = [20, 23, 25, 28], + Temperature2 = [33, 37, 41, 44], + Temperature3 = [15, 10, 4, 0]) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 20 33 15 + 2 │ 2 23 37 10 + 3 │ 3 25 41 4 + 4 │ 4 28 44 0 +``` + +To convert all of the temperature data in one transformation, +we just need to define a conversion function and broadcast +it to all of the "Temperature" columns. + +```julia +julia> celsius_to_kelvin(x) = x + 273 +celsius_to_kelvin (generic function with 1 method) + +julia> transform( + df, + Cols(r"Temp") .=> ByRow(celsius_to_kelvin), + renamecols = false + ) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` +Or, simultaneously changing the column names: + +```julia +julia> rename_function(s) = "Temperature $(last(s)) (K)" +rename_function (generic function with 1 method) + +julia> select( + df, + "Time", + Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function + ) +4×4 DataFrame + Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` + +!!! Note Notes + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. + +You could also broadcast different columns to different functions +by supplying a vector of functions. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> f1(x) = x .+ 1 +f1 (generic function with 1 method) + +julia> f2(x) = x ./ 10 +f2 (generic function with 1 method) + +julia> transform(df, [:a, :b] .=> [f1, f2]) +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +However, this form is not much more convenient than supplying +multiple individual operations. + +```julia +julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +Perhaps more useful for broadcasting syntax +is to apply multiple functions to multiple columns +by changing the vector of functions to a 1-by-x matrix of functions. +(Recall that a list, a vector, or a matrix of operation pairs are all valid +for passing to the manipulation functions.) + +```julia +julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 +2×2 Matrix{Pair{Symbol}}: + :a=>f1 :a=>f2 + :b=>f1 :b=>f2 + +julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 +4×6 DataFrame + Row │ a b a_f1 b_f1 a_f2 b_f2 + │ Int64 Int64 Int64 Int64 Float64 Float64 +─────┼────────────────────────────────────────────── + 1 │ 1 5 2 6 0.1 0.5 + 2 │ 2 6 3 7 0.2 0.6 + 3 │ 3 7 4 8 0.3 0.7 + 4 │ 4 8 5 9 0.4 0.8 +``` + +In this way, every combination of selected columns and functions will be applied. + +Pair broadcasting is a simple but powerful tool +that can be used in any of the manipulation functions listed under +[Basic Usage of Manipulation Functions](@ref). +Experiment for yourself to discover other useful operations. + +## Additional Resources +More details and examples of operation pair syntax can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). +(The official wording describing the syntax has changed since the blog post was written, +but the examples are still illustrative. +The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language +or Domain-Specific Language.) + +For additional practice, +an interactive tutorial is provided on a variety of introductory topics +by the DataFrames.jl package author +[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). + + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. \ No newline at end of file From 46363d9a1a075160d404705e89c0c5a35c671bf4 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 29 Sep 2023 11:11:44 -0400 Subject: [PATCH 16/30] Add new file to make and index --- docs/make.jl | 1 + docs/src/index.md | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/make.jl b/docs/make.jl index fa64782da..c35d55b0b 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -26,6 +26,7 @@ makedocs( "Working with DataFrames" => "man/working_with_dataframes.md", "Importing and Exporting Data (I/O)" => "man/importing_and_exporting.md", "Joins" => "man/joins.md", + "Data Frame Manipulation Functions" => "man/manipulation_functions.md", "Split-apply-combine" => "man/split_apply_combine.md", "Reshaping" => "man/reshaping_and_pivoting.md", "Sorting" => "man/sorting.md", diff --git a/docs/src/index.md b/docs/src/index.md index 1d7511908..ea8697e9b 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -218,6 +218,7 @@ page](https://github.com/JuliaData/DataFrames.jl/releases). Pages = ["man/basics.md", "man/getting_started.md", "man/joins.md", + "man/manipulation_functions.md", "man/split_apply_combine.md", "man/reshaping_and_pivoting.md", "man/sorting.md", @@ -277,7 +278,7 @@ missing please kindly report an issue during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible): - + * the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of `:note`-style metadata are documented as such); From cd4c539e08beca0ddfe07a3ce322c07eae628f3d Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 29 Sep 2023 11:25:37 -0400 Subject: [PATCH 17/30] Rewrite Basics.md conclusion --- docs/src/man/basics.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index daadf7a00..5980083d0 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -2151,5 +2151,9 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) 985 rows omitted ``` -This concludes the introductory explanation of data frame manipulations. -For more advanced examples, see later sections of the manual. +This concludes the introductory examples of data frame manipulations. +See later sections of the manual, +particularly [Data Frame Manipulation Functions](@ref), +for additional explanations and functionality, +including how to broadcast operation functions and operation pairs +and how to pass or produce multiple columns using `AsTable`. From d70af831ca01d77f63fbaec52141980de6c68874 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 2 Oct 2023 15:27:17 -0400 Subject: [PATCH 18/30] Review Edits Round 2 --- docs/make.jl | 2 +- docs/src/index.md | 8 +- docs/src/man/basics.md | 13 +++- docs/src/man/manipulation_functions.md | 100 +++++++++++++++++++++++-- 4 files changed, 110 insertions(+), 13 deletions(-) diff --git a/docs/make.jl b/docs/make.jl index c35d55b0b..d854981e2 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -26,7 +26,6 @@ makedocs( "Working with DataFrames" => "man/working_with_dataframes.md", "Importing and Exporting Data (I/O)" => "man/importing_and_exporting.md", "Joins" => "man/joins.md", - "Data Frame Manipulation Functions" => "man/manipulation_functions.md", "Split-apply-combine" => "man/split_apply_combine.md", "Reshaping" => "man/reshaping_and_pivoting.md", "Sorting" => "man/sorting.md", @@ -35,6 +34,7 @@ makedocs( "Data manipulation frameworks" => "man/querying_frameworks.md", "Comparison with Python/R/Stata" => "man/comparisons.md" ], + "A Gentle Introduction to Data Frame Manipulation Functions" => "man/manipulation_functions.md", "API" => Any[ "Types" => "lib/types.md", "Functions" => "lib/functions.md", diff --git a/docs/src/index.md b/docs/src/index.md index ea8697e9b..78c9ecd92 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -218,7 +218,6 @@ page](https://github.com/JuliaData/DataFrames.jl/releases). Pages = ["man/basics.md", "man/getting_started.md", "man/joins.md", - "man/manipulation_functions.md", "man/split_apply_combine.md", "man/reshaping_and_pivoting.md", "man/sorting.md", @@ -229,6 +228,13 @@ Pages = ["man/basics.md", Depth = 2 ``` +## A Gentle Introduction to Data Frame Manipulation Functions + +```@contents +Pages = ["man/manipulation_functions.md"] +Depth = 1 +``` + ## API Only exported (i.e. available for use without `DataFrames.` qualifier after diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 5980083d0..7f77d555b 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1586,9 +1586,14 @@ which can be used to perform operations on data frame columns: as the source data frame, but with only the rows where the passed operations are true; - `subset!`: the same as `subset` but updates the passed data frame in place; -These functions and their methods are explained in more detail in the section -[Data Frame Manipulation Functions](@ref). -In this section, we will move straight to examples using the German dataset. +!!! Note Other Resources + * For formal, comprehensive explanations of all manipulation functions, + see the [Functions](@ref) API. + + * For an informal, long-form tutorial on these functions, + see [A Gentle Introduction to Data Frame Manipulation Functions](@ref). + +Let us now move straight to examples using the German dataset. ```jldoctest dataframe julia> using Statistics @@ -2153,7 +2158,7 @@ julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) This concludes the introductory examples of data frame manipulations. See later sections of the manual, -particularly [Data Frame Manipulation Functions](@ref), +particularly [A Gentle Introduction to Data Frame Manipulation Functions](@ref), for additional explanations and functionality, including how to broadcast operation functions and operation pairs and how to pass or produce multiple columns using `AsTable`. diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md index da4fb1e63..db62e7adb 100644 --- a/docs/src/man/manipulation_functions.md +++ b/docs/src/man/manipulation_functions.md @@ -1,7 +1,10 @@ -# Data Frame Manipulation Functions +# A Gentle Introduction to Data Frame Manipulation Functions The seven functions below can be used to manipulate data frames by applying operations to them. +This section of the documentation aims to methodically build understanding +of these functions and their possible arguments +by reinforcing foundational concepts and slowly increasing complexity. The functions without a `!` in their name will create a new data frame based on the source data frame, @@ -68,11 +71,11 @@ which is a type to link one object to another. [Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) In DataFrames.jl manipulation functions, `Pair` arguments are used to define column `operations` to be performed. -The provided examples will be explained in more detail below. +The examples shown above will be explained in more detail later. -The manipulation functions also have methods for applying multiple operations. +*The manipulation functions also have methods for applying multiple operations. See the later sections [Multiple Operations per Manipulation](@ref) -and [Broadcasting Operation Pairs](@ref) for more information. +and [Broadcasting Operation Pairs](@ref) for more information.* ### `source_column_selector` Inside an `operation`, `source_column_selector` is usually a column name @@ -494,6 +497,8 @@ This automatic column naming behavior can be avoided in two ways. First, the operation result can be placed back into the original column with the original column name by switching the keyword argument `renamecols` from its default value (`true`) to `renamecols=false`. +This option prevents the function name from being appended to the column name +as it usually would be. ```julia julia> df = DataFrame(a=1:4, b=5:8) @@ -616,9 +621,90 @@ julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place 4 │ 4 8 ``` -Additionally, in the -`source_column_selector => operation_function => new_column_names` operation form, -`new_column_names` may be a renaming function which operates on a string +If `new_column_names` already exist in the source data frame, +those columns will be replaced in the existing column location +rather than being added to the end. +This can be done by manually specifying an existing column name +or by using the `renamecols=false` keyword argument. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name +4×3 DataFrame + Row │ a b b_function + │ Int64 Int64 Int64 +─────┼────────────────────────── + 1 │ 1 5 15 + 2 │ 2 6 16 + 3 │ 3 7 17 + 4 │ 4 8 18 + +julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 15 + 2 │ 2 16 + 3 │ 3 17 + 4 │ 4 18 + +julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 15 5 + 2 │ 16 6 + 3 │ 17 7 + 4 │ 18 8 +``` + +Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. + +```julia +julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name +4×3 DataFrame + Row │ a b a_b_+ + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name +4×3 DataFrame + Row │ a b a_b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 6 5 + 2 │ 8 6 + 3 │ 10 7 + 4 │ 12 8 +``` + +In the `source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may also be a renaming function which operates on a string to create the destination column names programmatically. ```julia From 6377441dec4ea67082fe259d10025a540ec8395b Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 2 Oct 2023 16:30:25 -0400 Subject: [PATCH 19/30] Fix reference? --- docs/src/man/manipulation_functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md index db62e7adb..cabda4fd7 100644 --- a/docs/src/man/manipulation_functions.md +++ b/docs/src/man/manipulation_functions.md @@ -74,7 +74,7 @@ In DataFrames.jl manipulation functions, The examples shown above will be explained in more detail later. *The manipulation functions also have methods for applying multiple operations. -See the later sections [Multiple Operations per Manipulation](@ref) +See the later sections [Applying Multiple Operations per Manipulation](@ref) and [Broadcasting Operation Pairs](@ref) for more information.* ### `source_column_selector` From 6e7ed849fdd90287e84cf5a38366b0a3469a16a0 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 2 Oct 2023 16:51:16 -0400 Subject: [PATCH 20/30] maybe fix documenter? --- docs/src/man/basics.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 7f77d555b..0e9874301 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1589,7 +1589,6 @@ as the source data frame, but with only the rows where the passed operations are !!! Note Other Resources * For formal, comprehensive explanations of all manipulation functions, see the [Functions](@ref) API. - * For an informal, long-form tutorial on these functions, see [A Gentle Introduction to Data Frame Manipulation Functions](@ref). From d2d3de85b166a26b220ec8410e76a84d141ba4aa Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 5 Oct 2023 12:29:32 -0400 Subject: [PATCH 21/30] make h function require broadcasting --- docs/src/man/manipulation_functions.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md index cabda4fd7..72df94476 100644 --- a/docs/src/man/manipulation_functions.md +++ b/docs/src/man/manipulation_functions.md @@ -336,7 +336,7 @@ julia> transform(df, :a => g) 2 │ 2 5 3 3 │ 3 4 4 -julia> h(x, y) = 2x .+ y +julia> h(x, y) = x .+ y .+ 1 h (generic function with 1 method) julia> transform(df, [:a, :b] => h) @@ -345,8 +345,8 @@ julia> transform(df, [:a, :b] => h) │ Int64 Int64 Int64 ─────┼───────────────────── 1 │ 1 4 6 - 2 │ 2 5 9 - 3 │ 3 4 10 + 2 │ 2 5 8 + 3 │ 3 4 8 ``` [Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) From 0bdfc44fa0dc4abb3ba05e0bea46507e0a017547 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 12 Oct 2023 16:46:10 -0400 Subject: [PATCH 22/30] Fix existing typos in basics.md --- docs/src/man/basics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 0e9874301..4e8ba02f7 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1075,7 +1075,7 @@ true If in indexing you select a subset of rows from a data frame the mutation is performed in place, i.e. writing to an existing vector. -Below setting values of column `:Job` in rows `1:3` to values `[2, 4, 6]`: +Below setting values of column `:Job` in rows `1:3` to values `[2, 3, 2]`: ```jldoctest dataframe julia> df1[1:3, :Job] = [2, 3, 2] @@ -1181,7 +1181,7 @@ DataFrameRow 2 │ 98 male 2 ``` -This operations updated the data stored in the `df1` data frame. +These operations updated the data stored in the `df1` data frame. In a similar fashion views can be used to update data stored in their parent data frame. Here are some examples: From 72d87d26d9391026baf8baaeb601fd857bf62fb9 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 13 Oct 2023 17:04:13 -0400 Subject: [PATCH 23/30] Move back to basics.md and add comparison --- docs/make.jl | 1 - docs/src/index.md | 7 - docs/src/man/basics.md | 2108 ++++++++++++++++++------ docs/src/man/manipulation_functions.md | 1431 ---------------- 4 files changed, 1568 insertions(+), 1979 deletions(-) delete mode 100644 docs/src/man/manipulation_functions.md diff --git a/docs/make.jl b/docs/make.jl index d854981e2..fa64782da 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -34,7 +34,6 @@ makedocs( "Data manipulation frameworks" => "man/querying_frameworks.md", "Comparison with Python/R/Stata" => "man/comparisons.md" ], - "A Gentle Introduction to Data Frame Manipulation Functions" => "man/manipulation_functions.md", "API" => Any[ "Types" => "lib/types.md", "Functions" => "lib/functions.md", diff --git a/docs/src/index.md b/docs/src/index.md index e259fd7f1..66ed6f3e5 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -229,13 +229,6 @@ Pages = ["man/basics.md", Depth = 2 ``` -## A Gentle Introduction to Data Frame Manipulation Functions - -```@contents -Pages = ["man/manipulation_functions.md"] -Depth = 1 -``` - ## API Only exported (i.e. available for use without `DataFrames.` qualifier after diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 4e8ba02f7..55937b849 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1565,599 +1565,1627 @@ julia> german[Not(5), r"S"] 984 rows omitted ``` -## Basic Usage of Manipulation Functions - -In DataFrames.jl there are seven functions -which can be used to perform operations on data frame columns: - -- `combine`: creates a new data frame populated with columns that result from - operations applied to the source data frame columns, potentially combining - its rows; -- `select`: creates a new data frame that has the same number of rows as the - source data frame populated with columns that result from operations - applied to the source data frame columns; -- `select!`: the same as `select` but updates the passed data frame in place; -- `transform`: the same as `select` but keeps the columns that were already - present in the data frame (note though that these columns can be potentially - modified by the transformation passed to `transform`); -- `transform!`: the same as `transform` but updates the passed data frame in - place. -- `subset`: creates a new data frame populated with the same columns -as the source data frame, but with only the rows where the passed operations are true; -- `subset!`: the same as `subset` but updates the passed data frame in place; - -!!! Note Other Resources - * For formal, comprehensive explanations of all manipulation functions, - see the [Functions](@ref) API. - * For an informal, long-form tutorial on these functions, - see [A Gentle Introduction to Data Frame Manipulation Functions](@ref). - -Let us now move straight to examples using the German dataset. +## Manipulation Functions -```jldoctest dataframe -julia> using Statistics +The seven functions below can be used to manipulate data frames +by applying operations to them. + +The functions without a `!` in their name +will create a new data frame based on the source data frame, +so you will probably want to store the new data frame to a new variable name, +e.g. `new_df = transform(source_df, operation)`. +The functions with a `!` at the end of their name +will modify an existing data frame in-place, +so there is typically no need to assign the result to a variable, +e.g. `transform!(source_df, operation)` instead of +`source_df = transform(source_df, operation)`. + +The number of columns and rows in the resultant data frame varies +depending on the manipulation function employed. + +| Function | Memory Usage | Column Retention | Row Retention | +| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | +| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | +| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | +| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | +| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | +| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | + +### Constructing Operations + +All of the functions above use the same syntax which is commonly +`manipulation_function(dataframe, operation)`. +The `operation` argument defines the +operation to be applied to the source `dataframe`, +and it can take any of the following common forms explained below: + +`source_column_selector` +: selects source column(s) without manipulating or renaming them + + Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` + +`source_column_selector => operation_function` +: passes source column(s) as arguments to a function +and automatically names the resulting column(s) + + Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` + +`source_column_selector => operation_function => new_column_names` +: passes source column(s) as arguments to a function +and names the resulting column(s) `new_column_names` + + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` + + *(Not available for `subset` or `subset!`)* + +`source_column_selector => new_column_names` +: renames a source column, +or splits a column containing collection elements into multiple new columns + + Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` + + (*Not available for `subset` or `subset!`*) + +The `=>` operator constructs a +[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), +which is a type to link one object to another. +(Pairs are commonly used to create elements of a +[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) +In DataFrames.jl manipulation functions, +`Pair` arguments are used to define column `operations` to be performed. +The examples shown above will be explained in more detail later. + +*The manipulation functions also have methods for applying multiple operations. +See the later sections [Applying Multiple Operations per Manipulation](@ref) +and [Broadcasting Operation Pairs](@ref) for more information.* + +#### `source_column_selector` +Inside an `operation`, `source_column_selector` is usually a column name +or column index which identifies a data frame column. + +`source_column_selector` may be used as the entire `operation` +with `select` or `select!` to isolate or reorder columns. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) +3×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 + +julia> select(df, :b) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, "b") +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 + +julia> select(df, 2) +3×1 DataFrame + Row │ b + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 5 + 3 │ 6 +``` + +`source_column_selector` may also be used as the entire `operation` +with `subset` or `subset!` if the source column contains `Bool` values. + +```julia +julia> df = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + ) +4×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Scott false + 2 │ Jill true + 3 │ Erica false + 4 │ Jimmy true + +julia> subset(df, :minor) +2×2 DataFrame + Row │ name minor + │ String Bool +─────┼─────────────── + 1 │ Jill true + 2 │ Jimmy true +``` + +`source_column_selector` may instead be a collection of columns such as a vector, +a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), +a `Not`, `Between`, `All`, or `Cols` expression, +or a `:`. +See the [Indexing](@ref) API for the full list of possible values with references. + +!!! Note + The Julia parser sometimes prevents `:` from being used by itself. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. -julia> combine(german, :Age => mean => :mean_age) +```julia +julia> df = DataFrame( + id = [1, 2, 3], + first_name = ["José", "Emma", "Nathan"], + last_name = ["Garcia", "Marino", "Boyer"], + age = [61, 24, 33] + ) +3×4 DataFrame + Row │ id first_name last_name age + │ Int64 String String Int64 +─────┼───────────────────────────────────── + 1 │ 1 José Garcia 61 + 2 │ 2 Emma Marino 24 + 3 │ 3 Nathan Boyer 33 + +julia> select(df, [:last_name, :first_name]) +3×2 DataFrame + Row │ last_name first_name + │ String String +─────┼─────────────────────── + 1 │ Garcia José + 2 │ Marino Emma + 3 │ Boyer Nathan + +julia> select(df, r"name") +3×2 DataFrame + Row │ first_name last_name + │ String String +─────┼─────────────────────── + 1 │ José Garcia + 2 │ Emma Marino + 3 │ Nathan Boyer + +julia> select(df, Not(:id)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> select(df, Between(2,4)) +3×3 DataFrame + Row │ first_name last_name age + │ String String Int64 +─────┼────────────────────────────── + 1 │ José Garcia 61 + 2 │ Emma Marino 24 + 3 │ Nathan Boyer 33 + +julia> df2 = DataFrame( + name = ["Scott", "Jill", "Erica", "Jimmy"], + minor = [false, true, false, true], + male = [true, false, false, true], + ) +4×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼────────────────────── + 1 │ Scott false true + 2 │ Jill true false + 3 │ Erica false false + 4 │ Jimmy true true + +julia> subset(df2, [:minor, :male]) +1×3 DataFrame + Row │ name minor male + │ String Bool Bool +─────┼───────────────────── + 1 │ Jimmy true true +``` + +!!! Note + Using `Symbol` in `source_column_selector` will perform slightly faster than using `String`. + However, `String` is convenient when column names contain spaces. + + All elements of `source_column_selector` must be the same type + (unless wrapped in `Cols`), + e.g. `subset(df2, [:minor, "male"])` will error + since `Symbol` and `String` are used simultaneously.) + +#### `operation_function` +Inside an `operation` pair, `operation_function` is a function +which operates on data frame columns passed as vectors. +When multiple columns are selected by `source_column_selector`, +the `operation_function` will receive the columns as separate positional arguments +in the order they were selected, e.g. `f(column1, column2, column3)`. + +```julia +julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 4 + +julia> combine(df, :a => sum) 1×1 DataFrame - Row │ mean_age + Row │ a_sum + │ Int64 +─────┼─────── + 1 │ 6 + +julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows +3×3 DataFrame + Row │ a b b_maximum + │ Int64 Int64 Int64 +─────┼───────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 5 + 3 │ 3 4 5 + +julia> transform(df, [:b, :a] => -) # vector subtraction is okay +3×3 DataFrame + Row │ a b b_a_- + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 3 + 2 │ 2 5 3 + 3 │ 3 4 1 + +julia> transform(df, [:a, :b] => *) # vector multiplication is not defined +ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) +``` + +Don't worry! There is a quick fix for the previous error. +If you want to apply a function to each element in a column +instead of to the entire column vector, +then you can wrap your element-wise function in `ByRow` like +`ByRow(my_elementwise_function)`. +This will apply `my_elementwise_function` to every element in the column +and then collect the results back into a vector. + +```julia +julia> transform(df, [:a, :b] => ByRow(*)) +3×3 DataFrame + Row │ a b a_b_* + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 4 + 2 │ 2 5 10 + 3 │ 3 4 12 + +julia> transform(df, Cols(:) => ByRow(max)) +3×3 DataFrame + Row │ a b a_b_max + │ Int64 Int64 Int64 +─────┼─────────────────────── + 1 │ 1 4 4 + 2 │ 2 5 5 + 3 │ 3 4 4 + +julia> f(x) = x + 1 +f (generic function with 1 method) + +julia> transform(df, :a => ByRow(f)) +3×3 DataFrame + Row │ a b a_f + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 +``` + +Alternatively, you may just want to define the function itself so it +[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +over vectors. + +```julia +julia> g(x) = x .+ 1 +g (generic function with 1 method) + +julia> transform(df, :a => g) +3×3 DataFrame + Row │ a b a_g + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 2 + 2 │ 2 5 3 + 3 │ 3 4 4 + +julia> h(x, y) = x .+ y .+ 1 +h (generic function with 1 method) + +julia> transform(df, [:a, :b] => h) +3×3 DataFrame + Row │ a b a_b_h + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 6 + 2 │ 2 5 8 + 3 │ 3 4 8 +``` + +[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) +are a convenient way to define and use an `operation_function` +all within the manipulation function call. + +```julia +julia> select(df, :a => ByRow(x -> x + 1)) +3×1 DataFrame + Row │ a_function + │ Int64 +─────┼──────────── + 1 │ 2 + 2 │ 3 + 3 │ 4 + +julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) +3×3 DataFrame + Row │ a b a_b_function + │ Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 4 6 + 2 │ 2 5 9 + 3 │ 3 4 10 + +julia> subset(df, :b => ByRow(x -> x < 5)) +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 + +julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous +2×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 3 4 +``` + +!!! Note + `operation_functions` within `subset` or `subset!` function calls + must return a Boolean vector. + `true` elements in the Boolean vector will determine + which rows are retained in the resulting data frame. + +As demonstrated above, `DataFrame` columns are usually passed +from `source_column_selector` to `operation_function` as one or more +vector arguments. +However, when `AsTable(source_column_selector)` is used, +the selected columns are collected and passed as a single `NamedTuple` +to `operation_function`. + +This is often useful when your `operation_function` is defined to operate +on a single collection argument rather than on multiple positional arguments. +The distinction is somewhat similar to the difference between the built-in +`min` and `minimum` functions. +`min` is defined to find the minimum value among multiple positional arguments, +while `minimum` is defined to find the minimum value +among the elements of a single collection argument. + +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 2 + 2 │ 2 4 6 1 + +julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments +2×1 DataFrame + Row │ a_b_etc_min + │ Int64 +─────┼───────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection +2×1 DataFrame + Row │ a_b_etc_minimum + │ Int64 +─────┼───────────────── + 1 │ 1 + 2 │ 1 + +julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments +2×1 DataFrame + Row │ a_b_+ + │ Int64 +─────┼─────── + 1 │ 4 + 2 │ 6 + +julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection +2×1 DataFrame + Row │ a_b_sum + │ Int64 +─────┼───────── + 1 │ 4 + 2 │ 6 + +julia> using Statistics # contains the `mean` function + +julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection +2×1 DataFrame + Row │ b_c_d_mean │ Float64 -─────┼────────── - 1 │ 35.546 +─────┼──────────── + 1 │ 3.33333 + 2 │ 3.66667 +``` -julia> select(german, :Age => mean => :mean_age) -1000×1 DataFrame - Row │ mean_age - │ Float64 -──────┼────────── - 1 │ 35.546 - 2 │ 35.546 - 3 │ 35.546 - 4 │ 35.546 - 5 │ 35.546 - 6 │ 35.546 - 7 │ 35.546 - 8 │ 35.546 - ⋮ │ ⋮ - 994 │ 35.546 - 995 │ 35.546 - 996 │ 35.546 - 997 │ 35.546 - 998 │ 35.546 - 999 │ 35.546 - 1000 │ 35.546 - 985 rows omitted -``` - -As you can see in both cases the `mean` function was applied to `:Age` column -and the result was stored in the `:mean_age` column. The difference between -the `combine` and `select` functions is that the `combine` aggregates data -and produces as many rows as were returned by the transformation function. -On the other hand the `select` function always keeps the number of rows in a -data frame to be the same as in the source data frame. Therefore in this case -the result of the `mean` function got broadcasted. - -As `combine` potentially allows any number of rows to be produced as a result -of the transformation if we have a combination of transformations where some of -them produce a vector, and other produce scalars then scalars get broadcasted -exactly like in `select`. Here is an example: +`AsTable` can also be used to pass columns to a function which operates +on fields of a `NamedTuple`. -```jldoctest dataframe -julia> combine(german, :Age => mean => :mean_age, :Housing => unique => :housing) -3×2 DataFrame - Row │ mean_age housing - │ Float64 String7 -─────┼─────────────────── - 1 │ 35.546 own - 2 │ 35.546 free - 3 │ 35.546 rent +```julia +julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) +2×4 DataFrame + Row │ a b c d + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────── + 1 │ 1 3 5 7 + 2 │ 2 4 6 8 + +julia> f(nt) = nt.a + nt.d +f (generic function with 1 method) + +julia> transform(df, AsTable(:) => ByRow(f)) +2×5 DataFrame + Row │ a b c d a_b_etc_f + │ Int64 Int64 Int64 Int64 Int64 +─────┼─────────────────────────────────────── + 1 │ 1 3 5 7 8 + 2 │ 2 4 6 8 10 ``` -Note, however, that it is not allowed to return vectors of different lengths in -different transformations: +As demonstrated above, +in the `source_column_selector => operation_function` operation pair form, +the results of an operation will be placed into a new column with an +automatically-generated name based on the operation; +the new column name will be the `operation_function` name +appended to the source column name(s) with an underscore. -```jldoctest dataframe -julia> combine(german, :Age, :Housing => unique => :Housing) -ERROR: ArgumentError: New columns must have the same length as old columns +This automatic column naming behavior can be avoided in two ways. +First, the operation result can be placed back into the original column +with the original column name by switching the keyword argument `renamecols` +from its default value (`true`) to `renamecols=false`. +This option prevents the function name from being appended to the column name +as it usually would be. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 11 5 + 2 │ 12 6 + 3 │ 13 7 + 4 │ 14 8 ``` -Let us discuss some other examples using `select`. Often we want to apply some -function not to the whole column of a data frame, but rather to its individual -elements. Normally we can achieve this using broadcasting like this: +The second method to avoid the default manipulation column naming is to +specify your own `new_column_names`. -```jldoctest dataframe -julia> select(german, :Sex => (x -> uppercase.(x)) => :Sex) -1000×1 DataFrame - Row │ Sex - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +#### `new_column_names` + +`new_column_names` can be included at the end of an `operation` pair to specify +the name of the new column(s). +`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, Cols(:) => ByRow(+) => :c) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, Cols(:) => ByRow(+) => "a+b") +4×3 DataFrame + Row │ a b a+b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, :a => ByRow(x->x+10) => "a+10") +4×3 DataFrame + Row │ a b a+10 + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 11 + 2 │ 2 6 12 + 3 │ 3 7 13 + 4 │ 4 8 14 ``` -This pattern is encountered very often in practice, therefore there is a `ByRow` -convenience wrapper for a function that creates its broadcasted variant. In -these examples `ByRow` is a special type used for selection operations to signal -that the wrapped function should be applied to each element (row) of the -selection. Here we are passing `ByRow` wrapper to target column name `:Sex` -using `uppercase` function: +The `source_column_selector => new_column_names` operation form +can be used to rename columns without an intermediate function. +However, there are `rename` and `rename!` functions, +which accept similar syntax, +that tend to be more useful for this operation. -```jldoctest dataframe -julia> select(german, :Sex => ByRow(uppercase) => :SEX) -1000×1 DataFrame - Row │ SEX - │ String -──────┼──────── - 1 │ MALE - 2 │ FEMALE - 3 │ MALE - 4 │ MALE - 5 │ MALE - 6 │ MALE - 7 │ MALE - 8 │ MALE - ⋮ │ ⋮ - 994 │ MALE - 995 │ MALE - 996 │ FEMALE - 997 │ MALE - 998 │ MALE - 999 │ MALE - 1000 │ MALE -985 rows omitted +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :a => :apple) # adds column `apple` +4×3 DataFrame + Row │ a b apple + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 + +julia> select(df, :a => :apple) # retains only column `apple` +4×1 DataFrame + Row │ apple + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + 4 │ 4 + +julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place +4×2 DataFrame + Row │ apple b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 ``` -In this case we transform our source column `:Age` using `ByRow` wrapper and -automatically generate the target column name: +If `new_column_names` already exist in the source data frame, +those columns will be replaced in the existing column location +rather than being added to the end. +This can be done by manually specifying an existing column name +or by using the `renamecols=false` keyword argument. -```jldoctest dataframe -julia> select(german, :Age, :Age => ByRow(sqrt)) -1000×2 DataFrame - Row │ Age Age_sqrt - │ Int64 Float64 -──────┼───────────────── - 1 │ 67 8.18535 - 2 │ 22 4.69042 - 3 │ 49 7.0 - 4 │ 45 6.7082 - 5 │ 53 7.28011 - 6 │ 35 5.91608 - 7 │ 53 7.28011 - 8 │ 35 5.91608 - ⋮ │ ⋮ ⋮ - 994 │ 30 5.47723 - 995 │ 50 7.07107 - 996 │ 31 5.56776 - 997 │ 40 6.32456 - 998 │ 38 6.16441 - 999 │ 23 4.79583 - 1000 │ 27 5.19615 - 985 rows omitted +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name +4×3 DataFrame + Row │ a b b_function + │ Int64 Int64 Int64 +─────┼────────────────────────── + 1 │ 1 5 15 + 2 │ 2 6 16 + 3 │ 3 7 17 + 4 │ 4 8 18 + +julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 15 + 2 │ 2 16 + 3 │ 3 17 + 4 │ 4 18 + +julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 15 5 + 2 │ 16 6 + 3 │ 17 7 + 4 │ 18 8 ``` -When we pass just a column (without the `=>` part) we can use any column selector -that is allowed in indexing. +Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. -Here we exclude the column `:Age` from the resulting data frame: +```julia +julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name +4×3 DataFrame + Row │ a b a_b_+ + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 + +julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name +4×3 DataFrame + Row │ a b a_b + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 6 + 2 │ 2 6 8 + 3 │ 3 7 10 + 4 │ 4 8 12 -```jldoctest dataframe -julia> select(german, Not(:Age)) -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 6 5 + 2 │ 8 6 + 3 │ 10 7 + 4 │ 12 8 ``` -In the next example we drop columns `"Age"`, `"Saving accounts"`, -`"Checking account"`, `"Credit amount"`, and `"Purpose"`. Note that this time -we use string column selectors because some of the column names have spaces -in them: +In the `source_column_selector => operation_function => new_column_names` operation form, +`new_column_names` may also be a renaming function which operates on a string +to create the destination column names programmatically. -```jldoctest dataframe -julia> select(german, Not(["Age", "Saving accounts", "Checking account", - "Credit amount", "Purpose"])) -1000×5 DataFrame - Row │ id Sex Job Housing Duration - │ Int64 String7 Int64 String7 Int64 -──────┼────────────────────────────────────────── - 1 │ 0 male 2 own 6 - 2 │ 1 female 2 own 48 - 3 │ 2 male 1 own 12 - 4 │ 3 male 2 free 42 - 5 │ 4 male 2 free 24 - 6 │ 5 male 1 free 36 - 7 │ 6 male 2 own 24 - 8 │ 7 male 3 rent 36 - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ - 994 │ 993 male 3 own 36 - 995 │ 994 male 2 own 12 - 996 │ 995 female 1 own 12 - 997 │ 996 male 3 own 30 - 998 │ 997 male 2 own 12 - 999 │ 998 male 2 free 45 - 1000 │ 999 male 2 own 45 - 985 rows omitted - -``` - -As another example let us present that the `r"S"` regular expression we used -above also works with `select`: +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 -```jldoctest dataframe -julia> select(german, r"S") -1000×2 DataFrame - Row │ Sex Saving accounts - │ String7 String15 -──────┼────────────────────────── - 1 │ male NA - 2 │ female little - 3 │ male little - 4 │ male little - 5 │ male little - 6 │ male NA - 7 │ male quite rich - 8 │ male little - ⋮ │ ⋮ ⋮ - 994 │ male little - 995 │ male NA - 996 │ female little - 997 │ male little - 998 │ male little - 999 │ male little - 1000 │ male moderate - 985 rows omitted -``` - -The benefit of `select` or `combine` over indexing is that it is easier -to get the union of several column selectors, e.g.: +julia> add_prefix(s) = "new_" * s +add_prefix (generic function with 1 method) -```jldoctest dataframe -julia> select(german, r"S", "Job", 1) -1000×4 DataFrame - Row │ Sex Saving accounts Job id - │ String7 String15 Int64 Int64 -──────┼──────────────────────────────────────── - 1 │ male NA 2 0 - 2 │ female little 2 1 - 3 │ male little 1 2 - 4 │ male little 2 3 - 5 │ male little 2 4 - 6 │ male NA 1 5 - 7 │ male quite rich 2 6 - 8 │ male little 3 7 - ⋮ │ ⋮ ⋮ ⋮ ⋮ - 994 │ male little 3 993 - 995 │ male NA 2 994 - 996 │ female little 1 995 - 997 │ male little 3 996 - 998 │ male little 2 997 - 999 │ male little 2 998 - 1000 │ male moderate 2 999 - 985 rows omitted -``` - -Taking advantage of this flexibility here is an idiomatic pattern to move some -column to the front of a data frame: +julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 + +julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 10 + 2 │ 2 6 20 + 3 │ 3 7 30 + 4 │ 4 8 40 +``` + +!!! Note + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` + +A renaming function will not work in the +`source_column_selector => new_column_names` operation form +because a function in the second element of the operation pair is assumed to take +the `source_column_selector => operation_function` operation form. +To work around this limitation, use the +`source_column_selector => operation_function => new_column_names` operation form +with `identity` as the `operation_function`. -```jldoctest dataframe -julia> select(german, "Sex", :) -1000×10 DataFrame - Row │ Sex id Age Job Housing Saving accounts Checking accou ⋯ - │ String7 Int64 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ male 0 67 2 own NA little ⋯ - 2 │ female 1 22 2 own little moderate - 3 │ male 2 49 1 own little NA - 4 │ male 3 45 2 free little little - 5 │ male 4 53 2 free little little ⋯ - 6 │ male 5 35 1 free NA NA - 7 │ male 6 53 2 own quite rich NA - 8 │ male 7 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ male 993 30 3 own little little ⋯ - 995 │ male 994 50 2 own NA NA - 996 │ female 995 31 1 own little NA - 997 │ male 996 40 3 own little little - 998 │ male 997 38 2 own little NA ⋯ - 999 │ male 998 23 2 free little little - 1000 │ male 999 27 2 own moderate moderate - 4 columns and 985 rows omitted +```julia +julia> transform(df, :a => add_prefix) +ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) + +julia> transform(df, :a => identity => add_prefix) +4×3 DataFrame + Row │ a b new_a + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 5 1 + 2 │ 2 6 2 + 3 │ 3 7 3 + 4 │ 4 8 4 ``` -Below, we are simply passing source column and target column name to rename them -(without specifying the transformation part): +In this case though, +it is probably again more useful to use the `rename` or `rename!` function +rather than one of the manipulation functions +in order to rename in-place and avoid the intermediate `operation_function`. +```julia +julia> rename(add_prefix, df) # rename all columns with a function +4×2 DataFrame + Row │ new_a new_b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> rename(add_prefix, df; cols=:a) # rename some columns with a function +4×2 DataFrame + Row │ new_a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 +``` -```jldoctest dataframe -julia> select(german, :Sex => :x1, :Age => :x2) -1000×2 DataFrame - Row │ x1 x2 - │ String7 Int64 -──────┼──────────────── - 1 │ male 67 - 2 │ female 22 - 3 │ male 49 - 4 │ male 45 - 5 │ male 53 - 6 │ male 35 - 7 │ male 53 - 8 │ male 35 - ⋮ │ ⋮ ⋮ - 994 │ male 30 - 995 │ male 50 - 996 │ female 31 - 997 │ male 40 - 998 │ male 38 - 999 │ male 23 - 1000 │ male 27 - 985 rows omitted +In the `source_column_selector => new_column_names` operation form, +only a single source column may be selected per operation, +so why is `new_column_names` plural? +It is possible to split the data contained inside a single column +into multiple new columns by supplying a vector of strings or symbols +as `new_column_names`. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> transform(df, :data => [:first, :second]) # manual naming +2×3 DataFrame + Row │ data first second + │ Tuple… Int64 Int64 +─────┼─────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 ``` -It is important to note that `select` always returns a data frame, even if a -single column selected as opposed to indexing syntax. Compare the following: +This kind of data splitting can even be done automatically with `AsTable`. -```jldoctest dataframe -julia> select(german, :Age) -1000×1 DataFrame - Row │ Age - │ Int64 -──────┼─────── - 1 │ 67 - 2 │ 22 - 3 │ 49 - 4 │ 45 - 5 │ 53 - 6 │ 35 - 7 │ 53 - 8 │ 35 - ⋮ │ ⋮ - 994 │ 30 - 995 │ 50 - 996 │ 31 - 997 │ 40 - 998 │ 38 - 999 │ 23 - 1000 │ 27 -985 rows omitted +```julia +julia> transform(df, :data => AsTable) # default automatic naming with tuples +2×3 DataFrame + Row │ data x1 x2 + │ Tuple… Int64 Int64 +─────┼────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` -julia> german[:, :Age] -1000-element Vector{Int64}: - 67 - 22 - 49 - 45 - 53 - 35 - 53 - 35 - 61 - 28 - ⋮ - 34 - 23 - 30 - 50 - 31 - 40 - 38 - 23 - 27 -``` - -By default `select` copies columns of a passed source data frame. In order to -avoid copying, pass the `copycols=false` keyword argument: +If a data frame column contains `NamedTuple`s, +then `AsTable` will preserve the field names. +```julia +julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples +2×1 DataFrame + Row │ data + │ NamedTup… +─────┼──────────────── + 1 │ (a = 1, b = 2) + 2 │ (a = 3, b = 4) -```jldoctest dataframe -julia> df = select(german, :Sex) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +julia> transform(df, :data => AsTable) # keeps names from named tuples +2×3 DataFrame + Row │ data a b + │ NamedTup… Int64 Int64 +─────┼────────────────────────────── + 1 │ (a = 1, b = 2) 1 2 + 2 │ (a = 3, b = 4) 3 4 +``` -julia> df.Sex === german.Sex # copy -false +!!! Note + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. -julia> df = select(german, :Sex, copycols=false) -1000×1 DataFrame - Row │ Sex - │ String7 -──────┼───────── - 1 │ male - 2 │ female - 3 │ male - 4 │ male - 5 │ male - 6 │ male - 7 │ male - 8 │ male - ⋮ │ ⋮ - 994 │ male - 995 │ male - 996 │ female - 997 │ male - 998 │ male - 999 │ male - 1000 │ male -985 rows omitted +Renaming functions also work for multi-column transformations, +but they must operate on a vector of strings. + +```julia +julia> df = DataFrame(data = [(1,2), (3,4)]) +2×1 DataFrame + Row │ data + │ Tuple… +─────┼──────── + 1 │ (1, 2) + 2 │ (3, 4) + +julia> new_names(v) = ["primary ", "secondary "] .* v +new_names (generic function with 1 method) + +julia> transform(df, :data => identity => new_names) +2×3 DataFrame + Row │ data primary data secondary data + │ Tuple… Int64 Int64 +─────┼────────────────────────────────────── + 1 │ (1, 2) 1 2 + 2 │ (3, 4) 3 4 +``` + +### Applying Multiple Operations per Manipulation +All data frame manipulation functions can accept multiple `operation` pairs +at once using any of the following methods: +- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments +- `manipulation_function(dataframe, [operation1, operation2])` : vector argument +- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument + +Passing multiple operations is especially useful for the `select`, `select!`, +and `combine` manipulation functions, +since they only retain columns which are a result of the passed operations. + +```julia +julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) +4×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 1 50 hat + 2 │ 2 50 bat + 3 │ 3 60 cat + 4 │ 4 60 dog + +julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations +1×3 DataFrame + Row │ a_maximum b_sum c_join + │ Int64 Int64 String +─────┼──────────────────────────────── + 1 │ 4 220 hatbatcatdog + +julia> select(df, :c, :b, :a) # re-order columns +4×3 DataFrame + Row │ c b a + │ String Int64 Int64 +─────┼────────────────────── + 1 │ hat 50 1 + 2 │ bat 50 2 + 3 │ cat 60 3 + 4 │ dog 60 4 + +ulia> select(df, :b, :) # `:` here means all other columns +4×3 DataFrame + Row │ b a c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 50 1 hat + 2 │ 50 2 bat + 3 │ 60 3 cat + 4 │ 60 4 dog + +julia> select( + df, + :c => (x -> "a " .* x) => :one_c, + :a => (x -> 100x), + :b, + renamecols=false + ) # can mix operation forms +4×3 DataFrame + Row │ one_c a b + │ String Int64 Int64 +─────┼────────────────────── + 1 │ a hat 100 50 + 2 │ a bat 200 50 + 3 │ a cat 300 60 + 4 │ a dog 400 60 + +julia> select( + df, + :c => ByRow(reverse), + :c => ByRow(uppercase) + ) # multiple operations on same column +4×2 DataFrame + Row │ c_reverse c_uppercase + │ String String +─────┼──────────────────────── + 1 │ tah HAT + 2 │ tab BAT + 3 │ tac CAT + 4 │ god DOG +``` + +In the last two examples, +the manipulation function arguments were split across multiple lines. +This is a good way to make manipulations with many operations more readable. + +Passing multiple operations to `subset` or `subset!` is an easy way to narrow in +on a particular row of data. + +```julia +julia> subset( + df, + :b => ByRow(==(60)), + :c => ByRow(contains("at")) + ) # rows with 60 and "at" +1×3 DataFrame + Row │ a b c + │ Int64 Int64 String +─────┼────────────────────── + 1 │ 3 60 cat +``` -julia> df.Sex === german.Sex # no-copy is performed +Note that all operations within a single manipulation must use the data +as it existed before the function call +i.e. you cannot use newly created columns for subsequent operations +within the same manipulation. + +```julia +julia> transform( + df, + [:a, :b] => ByRow(+) => :d, + :d => (x -> x ./ 2), + ) # requires two separate transformations +ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c + +julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) +4×4 DataFrame + Row │ a b c d + │ Int64 Int64 String Int64 +─────┼───────────────────────────── + 1 │ 1 50 hat 51 + 2 │ 2 50 bat 52 + 3 │ 3 60 cat 63 + 4 │ 4 60 dog 64 + +julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) +4×5 DataFrame + Row │ a b c d d_2 + │ Int64 Int64 String Int64 Float64 +─────┼────────────────────────────────────── + 1 │ 1 50 hat 51 25.5 + 2 │ 2 50 bat 52 26.0 + 3 │ 3 60 cat 63 31.5 + 4 │ 4 60 dog 64 32.0 +``` + + +### Broadcasting Operation Pairs + +[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) +pairs with `.=>` is often a convenient way to generate multiple +similar `operation`s to be applied within a single manipulation. +Broadcasting within the `Pair` of an `operation` is no different than +broadcasting in base Julia. +The broadcasting `.=>` will be expanded into a vector of pairs +(`[operation1, operation2, ...]`), +and this expansion will occur before the manipulation function is invoked. +Then the manipulation function will use the +`manipulation_function(dataframe, [operation1, operation2, ...])` method. +This process will be explained in more detail below. + +To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. +In DataFrames.jl, a symbol, string, or integer +may be used to select a single column. +Some `Pair`s with these types are below. + +```julia +julia> typeof(:x => :a) +Pair{Symbol, Symbol} + +julia> typeof("x" => "a") +Pair{String, String} + +julia> typeof(1 => "a") +Pair{Int64, String} +``` + +Any of the `Pair`s above could be used to rename the first column +of the data frame below to `a`. + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> select(df, :x => :a) +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 + +julia> select(df, 1 => "a") +3×1 DataFrame + Row │ a + │ Int64 +─────┼─────── + 1 │ 1 + 2 │ 2 + 3 │ 3 +``` + +What should we do if we want to keep and rename both the `x` and `y` column? +One option is to supply a `Vector` of operation `Pair`s to `select`. +`select` will process all of these operations in order. + +```julia +julia> ["x" => "a", "y" => "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x" => "a", "y" => "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +We can use broadcasting to simplify the syntax above. + +```julia +julia> ["x", "y"] .=> ["a", "b"] +2-element Vector{Pair{String, String}}: + "x" => "a" + "y" => "b" + +julia> select(df, ["x", "y"] .=> ["a", "b"]) +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Notice that `select` sees the same `Vector{Pair{String, String}}` operation +argument whether the individual pairs are written out explicitly or +constructed with broadcasting. +The broadcasting is applied before the call to `select`. + +```julia +julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) true ``` -To perform the selection operation in-place use `select!`: +!!! Note + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` -```jldoctest dataframe -julia> select!(german, Not(:Age)); +In Julia, +a non-vector broadcasted with a vector will be repeated in each resultant pair element. -julia> german -1000×9 DataFrame - Row │ id Sex Job Housing Saving accounts Checking account Cre ⋯ - │ Int64 String7 Int64 String7 String15 String15 Int ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 2 own NA little ⋯ - 2 │ 1 female 2 own little moderate - 3 │ 2 male 1 own little NA - 4 │ 3 male 2 free little little - 5 │ 4 male 2 free little little ⋯ - 6 │ 5 male 1 free NA NA - 7 │ 6 male 2 own quite rich NA - 8 │ 7 male 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 3 own little little ⋯ - 995 │ 994 male 2 own NA NA - 996 │ 995 female 1 own little NA - 997 │ 996 male 3 own little little - 998 │ 997 male 2 own little NA ⋯ - 999 │ 998 male 2 free little little - 1000 │ 999 male 2 own moderate moderate - 3 columns and 985 rows omitted +```julia +julia> ["x", "y"] .=> :a # :a is repeated +2-element Vector{Pair{String, Symbol}}: + "x" => :a + "y" => :a + +julia> 1 .=> [:a, :b] # 1 is repeated +2-element Vector{Pair{Int64, Symbol}}: + 1 => :a + 1 => :b ``` -As you can see the `:Age` column was dropped from the `german` data frame. +We can use this fact to easily broadcast an `operation_function` to multiple columns. -The `transform` and `transform!` functions work identically to `select` and -`select!` with the only difference that they retain all columns that are present -in the source data frame. Here are some examples: +```julia +julia> f(x) = 2 * x +f (generic function with 1 method) -```jldoctest dataframe -julia> german = copy(german_ref); +julia> ["x", "y"] .=> f # f is repeated +2-element Vector{Pair{String, typeof(f)}}: + "x" => f + "y" => f -julia> df = german_ref[1:8, 1:5] -8×5 DataFrame - Row │ id Age Sex Job Housing - │ Int64 Int64 String7 Int64 String7 -─────┼─────────────────────────────────────── - 1 │ 0 67 male 2 own - 2 │ 1 22 female 2 own - 3 │ 2 49 male 1 own - 4 │ 3 45 male 2 free - 5 │ 4 53 male 2 free - 6 │ 5 35 male 1 free - 7 │ 6 53 male 2 own - 8 │ 7 35 male 3 rent - -julia> transform(df, :Age => maximum) -8×6 DataFrame - Row │ id Age Sex Job Housing Age_maximum - │ Int64 Int64 String7 Int64 String7 Int64 +julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming +3×2 DataFrame + Row │ x_f y_f + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 + +julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated +2-element Vector{Pair{String, Pair{typeof(f), String}}}: + "x" => (f => "a") + "y" => (f => "b") + +julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming +3×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +A renaming function can be applied to multiple columns in the same way. +It will also be repeated in each operation `Pair`. + +```julia +julia> newname(s::String) = s * "_new" +newname (generic function with 1 method) + +julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated +2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: + "x" => (f => newname) + "y" => (f => newname) + +julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname +3×2 DataFrame + Row │ x_new y_new + │ Int64 Int64 +─────┼────────────── + 1 │ 2 8 + 2 │ 4 10 + 3 │ 6 12 +``` + +You can see from the type output above +that a three element pair does not actually exist. +A `Pair` (as the name implies) can only contain two elements. +Thus, `:x => :y => :z` becomes a nested `Pair`, +where `:x` is the first element and points to the `Pair` `:y => :z`, +which is the second element. + +```julia +julia> p = :x => :y => :z +:x => (:y => :z) + +julia> p[1] +:x + +julia> p[2] +:y => :z + +julia> p[2][1] +:y + +julia> p[2][2] +:z + +julia> p[3] # there is no index 3 for a pair +ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] +``` + +In the previous examples, the source columns have been individually selected. +When broadcasting multiple columns to the same function, +often similarities in the column names or position can be exploited to avoid +tedious selection. +Consider a data frame with temperature data at three different locations +taken over time. +```julia +julia> df = DataFrame(Time = 1:4, + Temperature1 = [20, 23, 25, 28], + Temperature2 = [33, 37, 41, 44], + Temperature3 = [15, 10, 4, 0]) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 20 33 15 + 2 │ 2 23 37 10 + 3 │ 3 25 41 4 + 4 │ 4 28 44 0 +``` + +To convert all of the temperature data in one transformation, +we just need to define a conversion function and broadcast +it to all of the "Temperature" columns. + +```julia +julia> celsius_to_kelvin(x) = x + 273 +celsius_to_kelvin (generic function with 1 method) + +julia> transform( + df, + Cols(r"Temp") .=> ByRow(celsius_to_kelvin), + renamecols = false + ) +4×4 DataFrame + Row │ Time Temperature1 Temperature2 Temperature3 + │ Int64 Int64 Int64 Int64 +─────┼───────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` +Or, simultaneously changing the column names: + +```julia +julia> rename_function(s) = "Temperature $(last(s)) (K)" +rename_function (generic function with 1 method) + +julia> select( + df, + "Time", + Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function + ) +4×4 DataFrame + Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) + │ Int64 Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────────────────── + 1 │ 1 293 306 288 + 2 │ 2 296 310 283 + 3 │ 3 298 314 277 + 4 │ 4 301 317 273 +``` + +!!! Note Notes + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. + +You could also broadcast different columns to different functions +by supplying a vector of functions. + +```julia +julia> df = DataFrame(a=1:4, b=5:8) +4×2 DataFrame + Row │ a b + │ Int64 Int64 +─────┼────────────── + 1 │ 1 5 + 2 │ 2 6 + 3 │ 3 7 + 4 │ 4 8 + +julia> f1(x) = x .+ 1 +f1 (generic function with 1 method) + +julia> f2(x) = x ./ 10 +f2 (generic function with 1 method) + +julia> transform(df, [:a, :b] .=> [f1, f2]) +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +However, this form is not much more convenient than supplying +multiple individual operations. + +```julia +julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous +4×4 DataFrame + Row │ a b a_f1 b_f2 + │ Int64 Int64 Int64 Float64 +─────┼────────────────────────────── + 1 │ 1 5 2 0.5 + 2 │ 2 6 3 0.6 + 3 │ 3 7 4 0.7 + 4 │ 4 8 5 0.8 +``` + +Perhaps more useful for broadcasting syntax +is to apply multiple functions to multiple columns +by changing the vector of functions to a 1-by-x matrix of functions. +(Recall that a list, a vector, or a matrix of operation pairs are all valid +for passing to the manipulation functions.) + +```julia +julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 +2×2 Matrix{Pair{Symbol}}: + :a=>f1 :a=>f2 + :b=>f1 :b=>f2 + +julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 +4×6 DataFrame + Row │ a b a_f1 b_f1 a_f2 b_f2 + │ Int64 Int64 Int64 Int64 Float64 Float64 +─────┼────────────────────────────────────────────── + 1 │ 1 5 2 6 0.1 0.5 + 2 │ 2 6 3 7 0.2 0.6 + 3 │ 3 7 4 8 0.3 0.7 + 4 │ 4 8 5 9 0.4 0.8 +``` + +In this way, every combination of selected columns and functions will be applied. + +Pair broadcasting is a simple but powerful tool +that can be used in any of the manipulation functions listed under +[Basic Usage of Manipulation Functions](@ref). +Experiment for yourself to discover other useful operations. + +### Additional Resources +More details and examples of operation pair syntax can be found in +[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). +(The official wording describing the syntax has changed since the blog post was written, +but the examples are still illustrative. +The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language +or Domain-Specific Language.) + +For additional syntax niceties, +many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) +and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) +packages useful +to help simplify manipulations that may be tedious with operation pairs alone. + +## Approach Comparison + +After that deep dive into [Manipulation Functions](@ref), +it is a good idea to review the alternative approaches covered in +[Getting and Setting Data in a Data Frame](@ref). +Let us compare the two approaches with a few examples. + +### Convenience + +For simple operations, +often getting/setting data with dot syntax +is simpler than the equivalent data frame manipulation. +Here we will add the two columns of our data frame together +and place the result in a new third column. + +Setup: + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6) # define data frame +3×2 DataFrame + Row │ x y + │ Int64 Int64 +─────┼────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 +``` + +Manipulation: + +```julia +julia> transform!(df, [:x, :y] => (+) => :z) +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Dot Syntax: + +```julia +julia> df.x # dot syntax returns a vector +3-element Vector{Int64}: + 1 + 2 + 3 + +julia> df.z = df.x + df.y +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Recall that the return type from a data frame manipulation function call is always a `DataFrame`. +The return type of a data frame column accessed with dot syntax is a `Vector`. +Thus the expression `df.x + df.y` gets the column data as vectors +and returns the result of the vector addition. +However, in that same line, +we assigned the resultant `Vector` to a new column `z` in the data frame `df`. +We could have instead assigned the resultant `Vector` to some other variable, +and then `df` would not have been altered. +The approach with dot syntax is very versatile +since the data getting, mathematics, and data setting can be separate steps. + +```julia +julia> df.x +3-element Vector{Int64}: + 1 + 2 + 3 + +julia> v = df.x + df.y +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df.z = v +3-element Vector{Int64}: + 5 + 7 + 9 +``` + +One downside to dot syntax is that the column name must be explicitly written in the code. +Indexing syntax can perform a similar operation with dynamic column names. +(Manipulation functions can also work with dynamic column names as will be shown in the next example.) + +```julia +julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # define data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names + +# Imagine the above data was read from a file or entered by a user at runtime. + +julia> df.c1 # dot syntax expects an explicit column name and cannot be used +ERROR: ArgumentError: column name :c1 not found in the data frame + +julia> df[:, c3] = df[:, c1] + df[:, c2] # access columns with names stored in variables +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 ─────┼──────────────────────────────────────────────────── - 1 │ 0 67 male 2 own 67 - 2 │ 1 22 female 2 own 67 - 3 │ 2 49 male 1 own 67 - 4 │ 3 45 male 2 free 67 - 5 │ 4 53 male 2 free 67 - 6 │ 5 35 male 1 free 67 - 7 │ 6 53 male 2 own 67 - 8 │ 7 35 male 3 rent 67 + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 ``` -In the example below we are swapping values stored in columns `:Sex` and `:Age`: +One benefit of using manipulation functions is that +the name of the data frame only needs to be written once. -```jldoctest dataframe -julia> transform(german, :Age => :Sex, :Sex => :Age) -1000×10 DataFrame - Row │ id Age Sex Job Housing Saving accounts Checking accou ⋯ - │ Int64 String7 Int64 Int64 String7 String15 String15 ⋯ -──────┼───────────────────────────────────────────────────────────────────────── - 1 │ 0 male 67 2 own NA little ⋯ - 2 │ 1 female 22 2 own little moderate - 3 │ 2 male 49 1 own little NA - 4 │ 3 male 45 2 free little little - 5 │ 4 male 53 2 free little little ⋯ - 6 │ 5 male 35 1 free NA NA - 7 │ 6 male 53 2 own quite rich NA - 8 │ 7 male 35 3 rent little moderate - ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ - 994 │ 993 male 30 3 own little little ⋯ - 995 │ 994 male 50 2 own NA NA - 996 │ 995 female 31 1 own little NA - 997 │ 996 male 40 3 own little little - 998 │ 997 male 38 2 own little NA ⋯ - 999 │ 998 male 23 2 free little little - 1000 │ 999 male 27 2 own moderate moderate - 4 columns and 985 rows omitted +Setup: + +```julia +julia> my_very_long_data_frame_name = DataFrame( + "My First Column" => 1:3, + "My Second Column" => 4:6 + ) # define data frame +3×2 DataFrame + Row │ My First Column My Second Column + │ Int64 Int64 +─────┼─────────────────────────────────── + 1 │ 1 4 + 2 │ 2 5 + 3 │ 3 6 + +julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names ``` -If we give more than one source column to a transformation they are passed as -consecutive positional arguments. So for example the -`[:Age, :Job] => (+) => :res` transformation below evaluates `+(df1.Age, df1.Job)` -(which adds two columns) and stores the result in the `:res` column: +Manipulation: -```jldoctest dataframe -julia> select(german, :Age, :Job, [:Age, :Job] => (+) => :res) -1000×3 DataFrame - Row │ Age Job res - │ Int64 Int64 Int64 -──────┼───────────────────── - 1 │ 67 2 69 - 2 │ 22 2 24 - 3 │ 49 1 50 - 4 │ 45 2 47 - 5 │ 53 2 55 - 6 │ 35 1 36 - 7 │ 53 2 55 - 8 │ 35 3 38 - ⋮ │ ⋮ ⋮ ⋮ - 994 │ 30 3 33 - 995 │ 50 2 52 - 996 │ 31 1 32 - 997 │ 40 3 43 - 998 │ 38 2 40 - 999 │ 23 2 25 - 1000 │ 27 2 29 - 985 rows omitted -``` - -This concludes the introductory examples of data frame manipulations. -See later sections of the manual, -particularly [A Gentle Introduction to Data Frame Manipulation Functions](@ref), -for additional explanations and functionality, -including how to broadcast operation functions and operation pairs -and how to pass or produce multiple columns using `AsTable`. +```julia + +julia> transform!(my_very_long_data_frame_name, [c1, c2] => (+) => c3) +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Indexing: + +```julia +julia> my_very_long_data_frame_name[:, c3] = my_very_long_data_frame_name[:, c1] + my_very_long_data_frame_name[:, c2] +3-element Vector{Int64}: + 5 + 7 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +### Speed + +TODO: Compare speed, memory, and view options (@view, !, :, copycols=false). +(May need someone else to write this part unless I do more studying.) diff --git a/docs/src/man/manipulation_functions.md b/docs/src/man/manipulation_functions.md deleted file mode 100644 index 72df94476..000000000 --- a/docs/src/man/manipulation_functions.md +++ /dev/null @@ -1,1431 +0,0 @@ -# A Gentle Introduction to Data Frame Manipulation Functions - -The seven functions below can be used to manipulate data frames -by applying operations to them. -This section of the documentation aims to methodically build understanding -of these functions and their possible arguments -by reinforcing foundational concepts and slowly increasing complexity. - -The functions without a `!` in their name -will create a new data frame based on the source data frame, -so you will probably want to store the new data frame to a new variable name, -e.g. `new_df = transform(source_df, operation)`. -The functions with a `!` at the end of their name -will modify an existing data frame in-place, -so there is typically no need to assign the result to a variable, -e.g. `transform!(source_df, operation)` instead of -`source_df = transform(source_df, operation)`. - -The number of columns and rows in the resultant data frame varies -depending on the manipulation function employed. - -| Function | Memory Usage | Column Retention | Row Retention | -| ------------ | -------------------------------- | --------------------------------------- | --------------------------------------------------- | -| `transform` | Creates a new data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `transform!` | Modifies an existing data frame. | Retains original and resultant columns. | Retains same number of rows as original data frame. | -| `select` | Creates a new data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `select!` | Modifies an existing data frame. | Retains only resultant columns. | Retains same number of rows as original data frame. | -| `subset` | Creates a new data frame. | Retains original columns. | Retains only rows where condition is true. | -| `subset!` | Modifies an existing data frame. | Retains original columns. | Retains only rows where condition is true. | -| `combine` | Creates a new data frame. | Retains only resultant columns. | Retains only resultant rows. | - -## Constructing Operations - -All of the functions above use the same syntax which is commonly -`manipulation_function(dataframe, operation)`. -The `operation` argument defines the -operation to be applied to the source `dataframe`, -and it can take any of the following common forms explained below: - -`source_column_selector` -: selects source column(s) without manipulating or renaming them - - Examples: `:a`, `[:a, :b]`, `All()`, `Not(:a)` - -`source_column_selector => operation_function` -: passes source column(s) as arguments to a function -and automatically names the resulting column(s) - - Examples: `:a => sum`, `[:a, :b] => +`, `:a => ByRow(==(3))` - -`source_column_selector => operation_function => new_column_names` -: passes source column(s) as arguments to a function -and names the resulting column(s) `new_column_names` - - Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` - - *(Not available for `subset` or `subset!`)* - -`source_column_selector => new_column_names` -: renames a source column, -or splits a column containing collection elements into multiple new columns - - Examples: `:a => :new_a`, `:a_b => [:a, :b]`, `:nt => AsTable` - - (*Not available for `subset` or `subset!`*) - -The `=>` operator constructs a -[Pair](https://docs.julialang.org/en/v1/base/collections/#Core.Pair), -which is a type to link one object to another. -(Pairs are commonly used to create elements of a -[Dictionary](https://docs.julialang.org/en/v1/base/collections/#Dictionaries).) -In DataFrames.jl manipulation functions, -`Pair` arguments are used to define column `operations` to be performed. -The examples shown above will be explained in more detail later. - -*The manipulation functions also have methods for applying multiple operations. -See the later sections [Applying Multiple Operations per Manipulation](@ref) -and [Broadcasting Operation Pairs](@ref) for more information.* - -### `source_column_selector` -Inside an `operation`, `source_column_selector` is usually a column name -or column index which identifies a data frame column. - -`source_column_selector` may be used as the entire `operation` -with `select` or `select!` to isolate or reorder columns. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6], c = [7, 8, 9]) -3×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 7 - 2 │ 2 5 8 - 3 │ 3 6 9 - -julia> select(df, :b) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, "b") -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 - -julia> select(df, 2) -3×1 DataFrame - Row │ b - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 5 - 3 │ 6 -``` - -`source_column_selector` may also be used as the entire `operation` -with `subset` or `subset!` if the source column contains `Bool` values. - -```julia -julia> df = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - ) -4×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Scott false - 2 │ Jill true - 3 │ Erica false - 4 │ Jimmy true - -julia> subset(df, :minor) -2×2 DataFrame - Row │ name minor - │ String Bool -─────┼─────────────── - 1 │ Jill true - 2 │ Jimmy true -``` - -`source_column_selector` may instead be a collection of columns such as a vector, -a [regular expression](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions), -a `Not`, `Between`, `All`, or `Cols` expression, -or a `:`. -See the [Indexing](@ref) API for the full list of possible values with references. - -!!! Note - The Julia parser sometimes prevents `:` from being used by itself. - If you get - `ERROR: syntax: whitespace not allowed after ":" used for quoting`, - try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. - -```julia -julia> df = DataFrame( - id = [1, 2, 3], - first_name = ["José", "Emma", "Nathan"], - last_name = ["Garcia", "Marino", "Boyer"], - age = [61, 24, 33] - ) -3×4 DataFrame - Row │ id first_name last_name age - │ Int64 String String Int64 -─────┼───────────────────────────────────── - 1 │ 1 José Garcia 61 - 2 │ 2 Emma Marino 24 - 3 │ 3 Nathan Boyer 33 - -julia> select(df, [:last_name, :first_name]) -3×2 DataFrame - Row │ last_name first_name - │ String String -─────┼─────────────────────── - 1 │ Garcia José - 2 │ Marino Emma - 3 │ Boyer Nathan - -julia> select(df, r"name") -3×2 DataFrame - Row │ first_name last_name - │ String String -─────┼─────────────────────── - 1 │ José Garcia - 2 │ Emma Marino - 3 │ Nathan Boyer - -julia> select(df, Not(:id)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> select(df, Between(2,4)) -3×3 DataFrame - Row │ first_name last_name age - │ String String Int64 -─────┼────────────────────────────── - 1 │ José Garcia 61 - 2 │ Emma Marino 24 - 3 │ Nathan Boyer 33 - -julia> df2 = DataFrame( - name = ["Scott", "Jill", "Erica", "Jimmy"], - minor = [false, true, false, true], - male = [true, false, false, true], - ) -4×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼────────────────────── - 1 │ Scott false true - 2 │ Jill true false - 3 │ Erica false false - 4 │ Jimmy true true - -julia> subset(df2, [:minor, :male]) -1×3 DataFrame - Row │ name minor male - │ String Bool Bool -─────┼───────────────────── - 1 │ Jimmy true true -``` - -### `operation_function` -Inside an `operation` pair, `operation_function` is a function -which operates on data frame columns passed as vectors. -When multiple columns are selected by `source_column_selector`, -the `operation_function` will receive the columns as separate positional arguments -in the order they were selected, e.g. `f(column1, column2, column3)`. - -```julia -julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 4]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 4 - -julia> combine(df, :a => sum) -1×1 DataFrame - Row │ a_sum - │ Int64 -─────┼─────── - 1 │ 6 - -julia> transform(df, :b => maximum) # `transform` and `select` copy scalar result to all rows -3×3 DataFrame - Row │ a b b_maximum - │ Int64 Int64 Int64 -─────┼───────────────────────── - 1 │ 1 4 5 - 2 │ 2 5 5 - 3 │ 3 4 5 - -julia> transform(df, [:b, :a] => -) # vector subtraction is okay -3×3 DataFrame - Row │ a b b_a_- - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 3 - 2 │ 2 5 3 - 3 │ 3 4 1 - -julia> transform(df, [:a, :b] => *) # vector multiplication is not defined -ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64}) -``` - -Don't worry! There is a quick fix for the previous error. -If you want to apply a function to each element in a column -instead of to the entire column vector, -then you can wrap your element-wise function in `ByRow` like -`ByRow(my_elementwise_function)`. -This will apply `my_elementwise_function` to every element in the column -and then collect the results back into a vector. - -```julia -julia> transform(df, [:a, :b] => ByRow(*)) -3×3 DataFrame - Row │ a b a_b_* - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 4 - 2 │ 2 5 10 - 3 │ 3 4 12 - -julia> transform(df, Cols(:) => ByRow(max)) -3×3 DataFrame - Row │ a b a_b_max - │ Int64 Int64 Int64 -─────┼─────────────────────── - 1 │ 1 4 4 - 2 │ 2 5 5 - 3 │ 3 4 4 - -julia> f(x) = x + 1 -f (generic function with 1 method) - -julia> transform(df, :a => ByRow(f)) -3×3 DataFrame - Row │ a b a_f - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 -``` - -Alternatively, you may just want to define the function itself so it -[broadcasts](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -over vectors. - -```julia -julia> g(x) = x .+ 1 -g (generic function with 1 method) - -julia> transform(df, :a => g) -3×3 DataFrame - Row │ a b a_g - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 2 - 2 │ 2 5 3 - 3 │ 3 4 4 - -julia> h(x, y) = x .+ y .+ 1 -h (generic function with 1 method) - -julia> transform(df, [:a, :b] => h) -3×3 DataFrame - Row │ a b a_b_h - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 4 6 - 2 │ 2 5 8 - 3 │ 3 4 8 -``` - -[Anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions) -are a convenient way to define and use an `operation_function` -all within the manipulation function call. - -```julia -julia> select(df, :a => ByRow(x -> x + 1)) -3×1 DataFrame - Row │ a_function - │ Int64 -─────┼──────────── - 1 │ 2 - 2 │ 3 - 3 │ 4 - -julia> transform(df, [:a, :b] => ByRow((x, y) -> 2x + y)) -3×3 DataFrame - Row │ a b a_b_function - │ Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 4 6 - 2 │ 2 5 9 - 3 │ 3 4 10 - -julia> subset(df, :b => ByRow(x -> x < 5)) -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 - -julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous -2×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 3 4 -``` - -!!! Note - `operation_functions` within `subset` or `subset!` function calls - must return a Boolean vector. - `true` elements in the Boolean vector will determine - which rows are retained in the resulting data frame. - -As demonstrated above, `DataFrame` columns are usually passed -from `source_column_selector` to `operation_function` as one or more -vector arguments. -However, when `AsTable(source_column_selector)` is used, -the selected columns are collected and passed as a single `NamedTuple` -to `operation_function`. - -This is often useful when your `operation_function` is defined to operate -on a single collection argument rather than on multiple positional arguments. -The distinction is somewhat similar to the difference between the built-in -`min` and `minimum` functions. -`min` is defined to find the minimum value among multiple positional arguments, -while `minimum` is defined to find the minimum value -among the elements of a single collection argument. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 2:-1:1) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 2 - 2 │ 2 4 6 1 - -julia> select(df, Cols(:) => ByRow(min)) # min operates on multiple arguments -2×1 DataFrame - Row │ a_b_etc_min - │ Int64 -─────┼───────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, AsTable(:) => ByRow(minimum)) # minimum operates on a collection -2×1 DataFrame - Row │ a_b_etc_minimum - │ Int64 -─────┼───────────────── - 1 │ 1 - 2 │ 1 - -julia> select(df, [:a,:b] => ByRow(+)) # `+` operates on a multiple arguments -2×1 DataFrame - Row │ a_b_+ - │ Int64 -─────┼─────── - 1 │ 4 - 2 │ 6 - -julia> select(df, AsTable([:a,:b]) => ByRow(sum)) # `sum` operates on a collection -2×1 DataFrame - Row │ a_b_sum - │ Int64 -─────┼───────── - 1 │ 4 - 2 │ 6 - -julia> using Statistics # contains the `mean` function - -julia> select(df, AsTable(Between(:b, :d)) => ByRow(mean)) # `mean` operates on a collection -2×1 DataFrame - Row │ b_c_d_mean - │ Float64 -─────┼──────────── - 1 │ 3.33333 - 2 │ 3.66667 -``` - -`AsTable` can also be used to pass columns to a function which operates -on fields of a `NamedTuple`. - -```julia -julia> df = DataFrame(a = 1:2, b = 3:4, c = 5:6, d = 7:8) -2×4 DataFrame - Row │ a b c d - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────── - 1 │ 1 3 5 7 - 2 │ 2 4 6 8 - -julia> f(nt) = nt.a + nt.d -f (generic function with 1 method) - -julia> transform(df, AsTable(:) => ByRow(f)) -2×5 DataFrame - Row │ a b c d a_b_etc_f - │ Int64 Int64 Int64 Int64 Int64 -─────┼─────────────────────────────────────── - 1 │ 1 3 5 7 8 - 2 │ 2 4 6 8 10 -``` - -As demonstrated above, -in the `source_column_selector => operation_function` operation pair form, -the results of an operation will be placed into a new column with an -automatically-generated name based on the operation; -the new column name will be the `operation_function` name -appended to the source column name(s) with an underscore. - -This automatic column naming behavior can be avoided in two ways. -First, the operation result can be placed back into the original column -with the original column name by switching the keyword argument `renamecols` -from its default value (`true`) to `renamecols=false`. -This option prevents the function name from being appended to the column name -as it usually would be. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => ByRow(x->x+10), renamecols=false) # add 10 in-place -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 11 5 - 2 │ 12 6 - 3 │ 13 7 - 4 │ 14 8 -``` - -The second method to avoid the default manipulation column naming is to -specify your own `new_column_names`. - -### `new_column_names` - -`new_column_names` can be included at the end of an `operation` pair to specify -the name of the new column(s). -`new_column_names` may be a symbol, string, function, vector of symbols, vector of strings, or `AsTable`. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, Cols(:) => ByRow(+) => :c) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, Cols(:) => ByRow(+) => "a+b") -4×3 DataFrame - Row │ a b a+b - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, :a => ByRow(x->x+10) => "a+10") -4×3 DataFrame - Row │ a b a+10 - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 11 - 2 │ 2 6 12 - 3 │ 3 7 13 - 4 │ 4 8 14 -``` - -The `source_column_selector => new_column_names` operation form -can be used to rename columns without an intermediate function. -However, there are `rename` and `rename!` functions, -which accept similar syntax, -that tend to be more useful for this operation. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :a => :apple) # adds column `apple` -4×3 DataFrame - Row │ a b apple - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 - -julia> select(df, :a => :apple) # retains only column `apple` -4×1 DataFrame - Row │ apple - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - 4 │ 4 - -julia> rename(df, :a => :apple) # renames column `a` to `apple` in-place -4×2 DataFrame - Row │ apple b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -If `new_column_names` already exist in the source data frame, -those columns will be replaced in the existing column location -rather than being added to the end. -This can be done by manually specifying an existing column name -or by using the `renamecols=false` keyword argument. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> transform(df, :b => (x -> x .+ 10)) # automatic new column and column name -4×3 DataFrame - Row │ a b b_function - │ Int64 Int64 Int64 -─────┼────────────────────────── - 1 │ 1 5 15 - 2 │ 2 6 16 - 3 │ 3 7 17 - 4 │ 4 8 18 - -julia> transform(df, :b => (x -> x .+ 10), renamecols=false) # transform column in-place -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 15 - 2 │ 2 16 - 3 │ 3 17 - 4 │ 4 18 - -julia> transform(df, :b => (x -> x .+ 10) => :a) # replace column :a -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 15 5 - 2 │ 16 6 - 3 │ 17 7 - 4 │ 18 8 -``` - -Actually, `renamecols=false` just prevents the function name from being appended to the final column name such that the operation is *usually* returned to the same column. - -```julia -julia> transform(df, [:a, :b] => +) # new column name is all source columns and function name -4×3 DataFrame - Row │ a b a_b_+ - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, [:a, :b] => +, renamecols=false) # same as above but with no function name -4×3 DataFrame - Row │ a b a_b - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 6 - 2 │ 2 6 8 - 3 │ 3 7 10 - 4 │ 4 8 12 - -julia> transform(df, [:a, :b] => (+) => :a) # manually overwrite column :a (see Note below about parentheses) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 6 5 - 2 │ 8 6 - 3 │ 10 7 - 4 │ 12 8 -``` - -In the `source_column_selector => operation_function => new_column_names` operation form, -`new_column_names` may also be a renaming function which operates on a string -to create the destination column names programmatically. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> add_prefix(s) = "new_" * s -add_prefix (generic function with 1 method) - -julia> transform(df, :a => (x -> 10 .* x) => add_prefix) # with named renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 - -julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous renaming function -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 10 - 2 │ 2 6 20 - 3 │ 3 7 30 - 4 │ 4 8 40 -``` - -!!! Note - It is a good idea to wrap anonymous functions in parentheses - to avoid the `=>` operator accidently becoming part of the anonymous function. - The examples above do not work correctly without the parentheses! - ```julia - julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼──────────────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>add_prefix - 2 │ 2 6 [10, 20, 30, 40]=>add_prefix - 3 │ 3 7 [10, 20, 30, 40]=>add_prefix - 4 │ 4 8 [10, 20, 30, 40]=>add_prefix - - julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼───────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>#18 - 2 │ 2 6 [10, 20, 30, 40]=>#18 - 3 │ 3 7 [10, 20, 30, 40]=>#18 - 4 │ 4 8 [10, 20, 30, 40]=>#18 - ``` - -A renaming function will not work in the -`source_column_selector => new_column_names` operation form -because a function in the second element of the operation pair is assumed to take -the `source_column_selector => operation_function` operation form. -To work around this limitation, use the -`source_column_selector => operation_function => new_column_names` operation form -with `identity` as the `operation_function`. - -```julia -julia> transform(df, :a => add_prefix) -ERROR: MethodError: no method matching *(::String, ::Vector{Int64}) - -julia> transform(df, :a => identity => add_prefix) -4×3 DataFrame - Row │ a b new_a - │ Int64 Int64 Int64 -─────┼───────────────────── - 1 │ 1 5 1 - 2 │ 2 6 2 - 3 │ 3 7 3 - 4 │ 4 8 4 -``` - -In this case though, -it is probably again more useful to use the `rename` or `rename!` function -rather than one of the manipulation functions -in order to rename in-place and avoid the intermediate `operation_function`. -```julia -julia> rename(add_prefix, df) # rename all columns with a function -4×2 DataFrame - Row │ new_a new_b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> rename(add_prefix, df; cols=:a) # rename some columns with a function -4×2 DataFrame - Row │ new_a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 -``` - -In the `source_column_selector => new_column_names` operation form, -only a single source column may be selected per operation, -so why is `new_column_names` plural? -It is possible to split the data contained inside a single column -into multiple new columns by supplying a vector of strings or symbols -as `new_column_names`. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) # vector of tuples -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> transform(df, :data => [:first, :second]) # manual naming -2×3 DataFrame - Row │ data first second - │ Tuple… Int64 Int64 -─────┼─────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -This kind of data splitting can even be done automatically with `AsTable`. - -```julia -julia> transform(df, :data => AsTable) # default automatic naming with tuples -2×3 DataFrame - Row │ data x1 x2 - │ Tuple… Int64 Int64 -─────┼────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -If a data frame column contains `NamedTuple`s, -then `AsTable` will preserve the field names. -```julia -julia> df = DataFrame(data = [(a=1,b=2), (a=3,b=4)]) # vector of named tuples -2×1 DataFrame - Row │ data - │ NamedTup… -─────┼──────────────── - 1 │ (a = 1, b = 2) - 2 │ (a = 3, b = 4) - -julia> transform(df, :data => AsTable) # keeps names from named tuples -2×3 DataFrame - Row │ data a b - │ NamedTup… Int64 Int64 -─────┼────────────────────────────── - 1 │ (a = 1, b = 2) 1 2 - 2 │ (a = 3, b = 4) 3 4 -``` - -!!! Note - To pack multiple columns into a single column of `NamedTuple`s - (reverse of the above operation) - apply the `identity` function `ByRow`, e.g. - `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. - -Renaming functions also work for multi-column transformations, -but they must operate on a vector of strings. - -```julia -julia> df = DataFrame(data = [(1,2), (3,4)]) -2×1 DataFrame - Row │ data - │ Tuple… -─────┼──────── - 1 │ (1, 2) - 2 │ (3, 4) - -julia> new_names(v) = ["primary ", "secondary "] .* v -new_names (generic function with 1 method) - -julia> transform(df, :data => identity => new_names) -2×3 DataFrame - Row │ data primary data secondary data - │ Tuple… Int64 Int64 -─────┼────────────────────────────────────── - 1 │ (1, 2) 1 2 - 2 │ (3, 4) 3 4 -``` - -## Applying Multiple Operations per Manipulation -All data frame manipulation functions can accept multiple `operation` pairs -at once using any of the following methods: -- `manipulation_function(dataframe, operation1, operation2)` : multiple arguments -- `manipulation_function(dataframe, [operation1, operation2])` : vector argument -- `manipulation_function(dataframe, [operation1 operation2])` : matrix argument - -Passing multiple operations is especially useful for the `select`, `select!`, -and `combine` manipulation functions, -since they only retain columns which are a result of the passed operations. - -```julia -julia> df = DataFrame(a = 1:4, b = [50,50,60,60], c = ["hat","bat","cat","dog"]) -4×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 1 50 hat - 2 │ 2 50 bat - 3 │ 3 60 cat - 4 │ 4 60 dog - -julia> combine(df, :a => maximum, :b => sum, :c => join) # 3 combine operations -1×3 DataFrame - Row │ a_maximum b_sum c_join - │ Int64 Int64 String -─────┼──────────────────────────────── - 1 │ 4 220 hatbatcatdog - -julia> select(df, :c, :b, :a) # re-order columns -4×3 DataFrame - Row │ c b a - │ String Int64 Int64 -─────┼────────────────────── - 1 │ hat 50 1 - 2 │ bat 50 2 - 3 │ cat 60 3 - 4 │ dog 60 4 - -ulia> select(df, :b, :) # `:` here means all other columns -4×3 DataFrame - Row │ b a c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 50 1 hat - 2 │ 50 2 bat - 3 │ 60 3 cat - 4 │ 60 4 dog - -julia> select( - df, - :c => (x -> "a " .* x) => :one_c, - :a => (x -> 100x), - :b, - renamecols=false - ) # can mix operation forms -4×3 DataFrame - Row │ one_c a b - │ String Int64 Int64 -─────┼────────────────────── - 1 │ a hat 100 50 - 2 │ a bat 200 50 - 3 │ a cat 300 60 - 4 │ a dog 400 60 - -julia> select( - df, - :c => ByRow(reverse), - :c => ByRow(uppercase) - ) # multiple operations on same column -4×2 DataFrame - Row │ c_reverse c_uppercase - │ String String -─────┼──────────────────────── - 1 │ tah HAT - 2 │ tab BAT - 3 │ tac CAT - 4 │ god DOG -``` - -In the last two examples, -the manipulation function arguments were split across multiple lines. -This is a good way to make manipulations with many operations more readable. - -Passing multiple operations to `subset` or `subset!` is an easy way to narrow in -on a particular row of data. - -```julia -julia> subset( - df, - :b => ByRow(==(60)), - :c => ByRow(contains("at")) - ) # rows with 60 and "at" -1×3 DataFrame - Row │ a b c - │ Int64 Int64 String -─────┼────────────────────── - 1 │ 3 60 cat -``` - -Note that all operations within a single manipulation must use the data -as it existed before the function call -i.e. you cannot use newly created columns for subsequent operations -within the same manipulation. - -```julia -julia> transform( - df, - [:a, :b] => ByRow(+) => :d, - :d => (x -> x ./ 2), - ) # requires two separate transformations -ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c - -julia> new_df = transform(df, [:a, :b] => ByRow(+) => :d) -4×4 DataFrame - Row │ a b c d - │ Int64 Int64 String Int64 -─────┼───────────────────────────── - 1 │ 1 50 hat 51 - 2 │ 2 50 bat 52 - 3 │ 3 60 cat 63 - 4 │ 4 60 dog 64 - -julia> transform!(new_df, :d => (x -> x ./ 2) => :d_2) -4×5 DataFrame - Row │ a b c d d_2 - │ Int64 Int64 String Int64 Float64 -─────┼────────────────────────────────────── - 1 │ 1 50 hat 51 25.5 - 2 │ 2 50 bat 52 26.0 - 3 │ 3 60 cat 63 31.5 - 4 │ 4 60 dog 64 32.0 -``` - - -## Broadcasting Operation Pairs - -[Broadcasting](https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting) -pairs with `.=>` is often a convenient way to generate multiple -similar `operation`s to be applied within a single manipulation. -Broadcasting within the `Pair` of an `operation` is no different than -broadcasting in base Julia. -The broadcasting `.=>` will be expanded into a vector of pairs -(`[operation1, operation2, ...]`), -and this expansion will occur before the manipulation function is invoked. -Then the manipulation function will use the -`manipulation_function(dataframe, [operation1, operation2, ...])` method. -This process will be explained in more detail below. - -To illustrate these concepts, let us first examine the `Type` of a basic `Pair`. -In DataFrames.jl, a symbol, string, or integer -may be used to select a single column. -Some `Pair`s with these types are below. - -```julia -julia> typeof(:x => :a) -Pair{Symbol, Symbol} - -julia> typeof("x" => "a") -Pair{String, String} - -julia> typeof(1 => "a") -Pair{Int64, String} -``` - -Any of the `Pair`s above could be used to rename the first column -of the data frame below to `a`. - -```julia -julia> df = DataFrame(x = 1:3, y = 4:6) -3×2 DataFrame - Row │ x y - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 - -julia> select(df, :x => :a) -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 - -julia> select(df, 1 => "a") -3×1 DataFrame - Row │ a - │ Int64 -─────┼─────── - 1 │ 1 - 2 │ 2 - 3 │ 3 -``` - -What should we do if we want to keep and rename both the `x` and `y` column? -One option is to supply a `Vector` of operation `Pair`s to `select`. -`select` will process all of these operations in order. - -```julia -julia> ["x" => "a", "y" => "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x" => "a", "y" => "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -We can use broadcasting to simplify the syntax above. - -```julia -julia> ["x", "y"] .=> ["a", "b"] -2-element Vector{Pair{String, String}}: - "x" => "a" - "y" => "b" - -julia> select(df, ["x", "y"] .=> ["a", "b"]) -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 4 - 2 │ 2 5 - 3 │ 3 6 -``` - -Notice that `select` sees the same `Vector{Pair{String, String}}` operation -argument whether the individual pairs are written out explicitly or -constructed with broadcasting. -The broadcasting is applied before the call to `select`. - -```julia -julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) -true -``` - -!!! Note - These operation pairs (or vector of pairs) can be given variable names. - This is uncommon in practice but could be helpful for intermediate - inspection and testing. - ```julia - df = DataFrame(x = 1:3, y = 4:6) # create data frame - operation = ["x", "y"] .=> ["a", "b"] # save operation to variable - typeof(operation) # check type of operation - first(operation) # check first pair in operation - last(operation) # check last pair in operation - select(df, operation) # manipulate `df` with `operation` - ``` - -In Julia, -a non-vector broadcasted with a vector will be repeated in each resultant pair element. - -```julia -julia> ["x", "y"] .=> :a # :a is repeated -2-element Vector{Pair{String, Symbol}}: - "x" => :a - "y" => :a - -julia> 1 .=> [:a, :b] # 1 is repeated -2-element Vector{Pair{Int64, Symbol}}: - 1 => :a - 1 => :b -``` - -We can use this fact to easily broadcast an `operation_function` to multiple columns. - -```julia -julia> f(x) = 2 * x -f (generic function with 1 method) - -julia> ["x", "y"] .=> f # f is repeated -2-element Vector{Pair{String, typeof(f)}}: - "x" => f - "y" => f - -julia> select(df, ["x", "y"] .=> f) # apply f with automatic column renaming -3×2 DataFrame - Row │ x_f y_f - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 - -julia> ["x", "y"] .=> f .=> ["a", "b"] # f is repeated -2-element Vector{Pair{String, Pair{typeof(f), String}}}: - "x" => (f => "a") - "y" => (f => "b") - -julia> select(df, ["x", "y"] .=> f .=> ["a", "b"]) # apply f with manual column renaming -3×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -A renaming function can be applied to multiple columns in the same way. -It will also be repeated in each operation `Pair`. - -```julia -julia> newname(s::String) = s * "_new" -newname (generic function with 1 method) - -julia> ["x", "y"] .=> f .=> newname # both f and newname are repeated -2-element Vector{Pair{String, Pair{typeof(f), typeof(newname)}}}: - "x" => (f => newname) - "y" => (f => newname) - -julia> select(df, ["x", "y"] .=> f .=> newname) # apply f then rename column with newname -3×2 DataFrame - Row │ x_new y_new - │ Int64 Int64 -─────┼────────────── - 1 │ 2 8 - 2 │ 4 10 - 3 │ 6 12 -``` - -You can see from the type output above -that a three element pair does not actually exist. -A `Pair` (as the name implies) can only contain two elements. -Thus, `:x => :y => :z` becomes a nested `Pair`, -where `:x` is the first element and points to the `Pair` `:y => :z`, -which is the second element. - -```julia -julia> p = :x => :y => :z -:x => (:y => :z) - -julia> p[1] -:x - -julia> p[2] -:y => :z - -julia> p[2][1] -:y - -julia> p[2][2] -:z - -julia> p[3] # there is no index 3 for a pair -ERROR: BoundsError: attempt to access Pair{Symbol, Pair{Symbol, Symbol}} at index [3] -``` - -In the previous examples, the source columns have been individually selected. -When broadcasting multiple columns to the same function, -often similarities in the column names or position can be exploited to avoid -tedious selection. -Consider a data frame with temperature data at three different locations -taken over time. -```julia -julia> df = DataFrame(Time = 1:4, - Temperature1 = [20, 23, 25, 28], - Temperature2 = [33, 37, 41, 44], - Temperature3 = [15, 10, 4, 0]) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 20 33 15 - 2 │ 2 23 37 10 - 3 │ 3 25 41 4 - 4 │ 4 28 44 0 -``` - -To convert all of the temperature data in one transformation, -we just need to define a conversion function and broadcast -it to all of the "Temperature" columns. - -```julia -julia> celsius_to_kelvin(x) = x + 273 -celsius_to_kelvin (generic function with 1 method) - -julia> transform( - df, - Cols(r"Temp") .=> ByRow(celsius_to_kelvin), - renamecols = false - ) -4×4 DataFrame - Row │ Time Temperature1 Temperature2 Temperature3 - │ Int64 Int64 Int64 Int64 -─────┼───────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` -Or, simultaneously changing the column names: - -```julia -julia> rename_function(s) = "Temperature $(last(s)) (K)" -rename_function (generic function with 1 method) - -julia> select( - df, - "Time", - Cols(r"Temp") .=> ByRow(celsius_to_kelvin) .=> rename_function - ) -4×4 DataFrame - Row │ Time Temperature 1 (K) Temperature 2 (K) Temperature 3 (K) - │ Int64 Int64 Int64 Int64 -─────┼──────────────────────────────────────────────────────────────── - 1 │ 1 293 306 288 - 2 │ 2 296 310 283 - 3 │ 3 298 314 277 - 4 │ 4 301 317 273 -``` - -!!! Note Notes - * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. - * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. - Without `ByRow`, the manipulations above would have thrown - `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. - * Regular expression (`r""`) and `:` `source_column_selectors` - must be wrapped in `Cols` to be properly broadcasted - because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. - -You could also broadcast different columns to different functions -by supplying a vector of functions. - -```julia -julia> df = DataFrame(a=1:4, b=5:8) -4×2 DataFrame - Row │ a b - │ Int64 Int64 -─────┼────────────── - 1 │ 1 5 - 2 │ 2 6 - 3 │ 3 7 - 4 │ 4 8 - -julia> f1(x) = x .+ 1 -f1 (generic function with 1 method) - -julia> f2(x) = x ./ 10 -f2 (generic function with 1 method) - -julia> transform(df, [:a, :b] .=> [f1, f2]) -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -However, this form is not much more convenient than supplying -multiple individual operations. - -```julia -julia> transform(df, [:a => f1, :b => f2]) # same manipulation as previous -4×4 DataFrame - Row │ a b a_f1 b_f2 - │ Int64 Int64 Int64 Float64 -─────┼────────────────────────────── - 1 │ 1 5 2 0.5 - 2 │ 2 6 3 0.6 - 3 │ 3 7 4 0.7 - 4 │ 4 8 5 0.8 -``` - -Perhaps more useful for broadcasting syntax -is to apply multiple functions to multiple columns -by changing the vector of functions to a 1-by-x matrix of functions. -(Recall that a list, a vector, or a matrix of operation pairs are all valid -for passing to the manipulation functions.) - -```julia -julia> [:a, :b] .=> [f1 f2] # No comma `,` between f1 and f2 -2×2 Matrix{Pair{Symbol}}: - :a=>f1 :a=>f2 - :b=>f1 :b=>f2 - -julia> transform(df, [:a, :b] .=> [f1 f2]) # No comma `,` between f1 and f2 -4×6 DataFrame - Row │ a b a_f1 b_f1 a_f2 b_f2 - │ Int64 Int64 Int64 Int64 Float64 Float64 -─────┼────────────────────────────────────────────── - 1 │ 1 5 2 6 0.1 0.5 - 2 │ 2 6 3 7 0.2 0.6 - 3 │ 3 7 4 8 0.3 0.7 - 4 │ 4 8 5 9 0.4 0.8 -``` - -In this way, every combination of selected columns and functions will be applied. - -Pair broadcasting is a simple but powerful tool -that can be used in any of the manipulation functions listed under -[Basic Usage of Manipulation Functions](@ref). -Experiment for yourself to discover other useful operations. - -## Additional Resources -More details and examples of operation pair syntax can be found in -[this blog post](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html). -(The official wording describing the syntax has changed since the blog post was written, -but the examples are still illustrative. -The operation pair syntax is sometimes referred to as the DataFrames.jl mini-language -or Domain-Specific Language.) - -For additional practice, -an interactive tutorial is provided on a variety of introductory topics -by the DataFrames.jl package author -[here](https://github.com/bkamins/Julia-DataFrames-Tutorial). - - -For additional syntax niceties, -many users find the [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) -and [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl) -packages useful -to help simplify manipulations that may be tedious with operation pairs alone. \ No newline at end of file From 043605a15647ba24b38afa161f76bb67d3fa5f45 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 13 Oct 2023 17:17:17 -0400 Subject: [PATCH 24/30] Add more comments --- docs/src/man/basics.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 55937b849..19d6e8492 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -3075,19 +3075,19 @@ The approach with dot syntax is very versatile since the data getting, mathematics, and data setting can be separate steps. ```julia -julia> df.x +julia> df.x # dot syntax returns a vector 3-element Vector{Int64}: 1 2 3 -julia> v = df.x + df.y +julia> v = df.x + df.y # assign mathematical result to a vector `v` 3-element Vector{Int64}: 5 7 9 -julia> df.z = v +julia> df.z = v # place `v` into the data frame `df` with the column name `z` 3-element Vector{Int64}: 5 7 From 26b503e3b9baf42b06d78a8e6a56c8fb17150630 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Fri, 13 Oct 2023 17:22:59 -0400 Subject: [PATCH 25/30] Add link to @with macro --- docs/src/man/basics.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 19d6e8492..5c4df8bca 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -3133,6 +3133,9 @@ julia> df # see that the previous expression updated the data frame `df` One benefit of using manipulation functions is that the name of the data frame only needs to be written once. +(The `@with` macro from the +[DataFramesMeta](https://juliadata.github.io/DataFramesMeta.jl/stable/#@with) package +can somewhat relieve this issue in the other approaches.) Setup: From 679f65f4723c811bdb05942c4f1e26cd93f2d87f Mon Sep 17 00:00:00 2001 From: Nathan Boyer <65452054+nathanrboyer@users.noreply.github.com> Date: Sat, 14 Oct 2023 21:42:47 -0400 Subject: [PATCH 26/30] Delete redundant expression --- docs/src/man/basics.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 5c4df8bca..cac082b31 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -3041,12 +3041,6 @@ julia> transform!(df, [:x, :y] => (+) => :z) Dot Syntax: ```julia -julia> df.x # dot syntax returns a vector -3-element Vector{Int64}: - 1 - 2 - 3 - julia> df.z = df.x + df.y 3-element Vector{Int64}: 5 From 79a11711b713ae777638c25f562b26679e9ef8bd Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Mon, 16 Oct 2023 16:20:38 -0400 Subject: [PATCH 27/30] Clean up new section and delete with reference --- docs/src/man/basics.md | 135 +++++++++++++++++++++++++++++++++-------- 1 file changed, 110 insertions(+), 25 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index cac082b31..6f2427c56 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -3002,9 +3002,7 @@ to help simplify manipulations that may be tedious with operation pairs alone. After that deep dive into [Manipulation Functions](@ref), it is a good idea to review the alternative approaches covered in [Getting and Setting Data in a Data Frame](@ref). -Let us compare the two approaches with a few examples. - -### Convenience +Let us compare the approaches with a few examples. For simple operations, often getting/setting data with dot syntax @@ -3012,10 +3010,10 @@ is simpler than the equivalent data frame manipulation. Here we will add the two columns of our data frame together and place the result in a new third column. -Setup: +**Setup:** ```julia -julia> df = DataFrame(x = 1:3, y = 4:6) # define data frame +julia> df = DataFrame(x = 1:3, y = 4:6) # define a data frame 3×2 DataFrame Row │ x y │ Int64 Int64 @@ -3025,7 +3023,7 @@ julia> df = DataFrame(x = 1:3, y = 4:6) # define data frame 3 │ 3 6 ``` -Manipulation: +**Manipulation:** ```julia julia> transform!(df, [:x, :y] => (+) => :z) @@ -3038,7 +3036,7 @@ julia> transform!(df, [:x, :y] => (+) => :z) 3 │ 3 6 9 ``` -Dot Syntax: +**Dot Syntax:** ```julia julia> df.z = df.x + df.y @@ -3088,12 +3086,19 @@ julia> df.z = v # place `v` into the data frame `df` with the column name `z` 9 ``` -One downside to dot syntax is that the column name must be explicitly written in the code. -Indexing syntax can perform a similar operation with dynamic column names. -(Manipulation functions can also work with dynamic column names as will be shown in the next example.) +However, one way in which dot syntax is less versatile +is that the column name must be explicitly written in the code. +Indexing syntax is a good alternative in these cases +which is only slightly longer to write than dot syntax. +Both indexing syntax and manipulation functions can operate on dynamic column names +stored in variables. + +**Setup:** + +Imagine this setup data was read from a file and/or entered by a user at runtime. ```julia -julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # define data frame +julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # define a data frame 3×2 DataFrame Row │ My First Column My Second Column │ Int64 Int64 @@ -3103,12 +3108,18 @@ julia> df = DataFrame("My First Column" => 1:3, "My Second Column" => 4:6) # de 3 │ 3 6 julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names +``` -# Imagine the above data was read from a file or entered by a user at runtime. +**Dot Syntax:** -julia> df.c1 # dot syntax expects an explicit column name and cannot be used +```julia +julia> df.c1 # dot syntax expects an explicit column name and cannot be used to access variable column name ERROR: ArgumentError: column name :c1 not found in the data frame +``` +**Indexing:** + +```julia julia> df[:, c3] = df[:, c1] + df[:, c2] # access columns with names stored in variables 3-element Vector{Int64}: 5 @@ -3125,19 +3136,30 @@ julia> df # see that the previous expression updated the data frame `df` 3 │ 3 6 9 ``` -One benefit of using manipulation functions is that -the name of the data frame only needs to be written once. -(The `@with` macro from the -[DataFramesMeta](https://juliadata.github.io/DataFramesMeta.jl/stable/#@with) package -can somewhat relieve this issue in the other approaches.) +**Manipulation:** -Setup: +```julia +julia> transform!(df, [c1, c2] => (+) => c3) # access columns with names stored in variables +3×3 DataFrame + Row │ My First Column My Second Column My Third Column + │ Int64 Int64 Int64 +─────┼──────────────────────────────────────────────────── + 1 │ 1 4 5 + 2 │ 2 5 7 + 3 │ 3 6 9 +``` + +Additionally, manipulation functions only require +the name of the data frame to be written once. +This can be helpful when dealing with long variable and column names. + +**Setup:** ```julia julia> my_very_long_data_frame_name = DataFrame( "My First Column" => 1:3, "My Second Column" => 4:6 - ) # define data frame + ) # define a data frame 3×2 DataFrame Row │ My First Column My Second Column │ Int64 Int64 @@ -3149,7 +3171,7 @@ julia> my_very_long_data_frame_name = DataFrame( julia> c1 = "My First Column"; c2 = "My Second Column"; c3 = "My Third Column"; # define column names ``` -Manipulation: +**Manipulation:** ```julia @@ -3163,7 +3185,7 @@ julia> transform!(my_very_long_data_frame_name, [c1, c2] => (+) => c3) 3 │ 3 6 9 ``` -Indexing: +**Indexing:** ```julia julia> my_very_long_data_frame_name[:, c3] = my_very_long_data_frame_name[:, c1] + my_very_long_data_frame_name[:, c2] @@ -3182,7 +3204,70 @@ julia> df # see that the previous expression updated the data frame `df` 3 │ 3 6 9 ``` -### Speed +Another benefit of manipulation functions and indexing over dot syntax is that +it is easier to operate on a subset of columns. + +**Setup:** + +```julia +julia> df = DataFrame(x = 1:3, y = 4:6, z = 7:9) # define data frame +3×3 DataFrame + Row │ x y z + │ Int64 Int64 Int64 +─────┼───────────────────── + 1 │ 1 4 7 + 2 │ 2 5 8 + 3 │ 3 6 9 +``` + +**Dot Syntax:** + +```julia +julia> df.Not(:x) # will not work; requires a literal column name +ERROR: ArgumentError: column name :Not not found in the data frame +``` + +**Indexing:** + +```julia +julia> df[:, :y_z_max] = maximum.(eachrow(df[:, Not(:x)])) # find maximum value across all rows except for column `x` +3-element Vector{Int64}: + 7 + 8 + 9 + +julia> df # see that the previous expression updated the data frame `df` +3×4 DataFrame + Row │ x y z y_z_max + │ Int64 Int64 Int64 Int64 +─────┼────────────────────────────── + 1 │ 1 4 7 7 + 2 │ 2 5 8 8 + 3 │ 3 6 9 9 +``` + +**Manipulation:** + +```julia +julia> transform!(df, Not(:x) => ByRow(max)) # find maximum value across all rows except for column `x` +3×4 DataFrame + Row │ x y z y_z_max + │ Int64 Int64 Int64 Int64 +─────┼────────────────────────────── + 1 │ 1 4 7 7 + 2 │ 2 5 8 8 + 3 │ 3 6 9 9 +``` + +Moreover, indexing can operate on a subset of columns *and* rows. + +**Indexing:** + +```julia +julia> y_z_max_row3 = maximum(df[3, Not(:x)]) # find maximum value across row 3 except for column `x` +9 +``` -TODO: Compare speed, memory, and view options (@view, !, :, copycols=false). -(May need someone else to write this part unless I do more studying.) +Hopefully this small comparison has illustrated some of the benefits and drawbacks +of the various syntaxes available in DataFrames.jl. +The best syntax to use depends on the situation. From 82d935c39a15949c464aacdad98629df145ca697 Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 30 May 2024 09:50:11 -0400 Subject: [PATCH 28/30] Fix admonitions --- docs/src/man/basics.md | 132 +++++++++++++++++++++-------------------- 1 file changed, 69 insertions(+), 63 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index d51038c2d..5449fb0e9 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1752,11 +1752,12 @@ a `Not`, `Between`, `All`, or `Cols` expression, or a `:`. See the [Indexing](@ref) API for the full list of possible values with references. -!!! Note - The Julia parser sometimes prevents `:` from being used by itself. - If you get - `ERROR: syntax: whitespace not allowed after ":" used for quoting`, - try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. +!!! note + + The Julia parser sometimes prevents `:` from being used by itself. + If you get + `ERROR: syntax: whitespace not allowed after ":" used for quoting`, + try using `All()`, `Cols(:)`, or `(:)` instead to select all columns. ```julia julia> df = DataFrame( @@ -1831,14 +1832,15 @@ julia> subset(df2, [:minor, :male]) 1 │ Jimmy true true ``` -!!! Note - Using `Symbol` in `source_column_selector` will perform slightly faster than using `String`. - However, `String` is convenient when column names contain spaces. +!!! note + + Using `Symbol` in `source_column_selector` will perform slightly faster than using `String`. + However, `String` is convenient when column names contain spaces. - All elements of `source_column_selector` must be the same type - (unless wrapped in `Cols`), - e.g. `subset(df2, [:minor, "male"])` will error - since `Symbol` and `String` are used simultaneously.) + All elements of `source_column_selector` must be the same type + (unless wrapped in `Cols`), + e.g. `subset(df2, [:minor, "male"])` will error + since `Symbol` and `String` are used simultaneously. #### `operation_function` Inside an `operation` pair, `operation_function` is a function @@ -1996,7 +1998,8 @@ julia> subset(df, :b => ByRow(<(5))) # shorter version of the previous 2 │ 3 4 ``` -!!! Note +!!! note + `operation_functions` within `subset` or `subset!` function calls must return a Boolean vector. `true` elements in the Boolean vector will determine @@ -2349,31 +2352,31 @@ julia> transform(df, :a => (x -> 10 .* x) => (s -> "new_" * s)) # with anonymous 4 │ 4 8 40 ``` -!!! Note - It is a good idea to wrap anonymous functions in parentheses - to avoid the `=>` operator accidently becoming part of the anonymous function. - The examples above do not work correctly without the parentheses! - ```julia - julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼──────────────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>add_prefix - 2 │ 2 6 [10, 20, 30, 40]=>add_prefix - 3 │ 3 7 [10, 20, 30, 40]=>add_prefix - 4 │ 4 8 [10, 20, 30, 40]=>add_prefix - - julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! - 4×3 DataFrame - Row │ a b a_function - │ Int64 Int64 Pair… - ─────┼───────────────────────────────────── - 1 │ 1 5 [10, 20, 30, 40]=>#18 - 2 │ 2 6 [10, 20, 30, 40]=>#18 - 3 │ 3 7 [10, 20, 30, 40]=>#18 - 4 │ 4 8 [10, 20, 30, 40]=>#18 - ``` +!!! note + + It is a good idea to wrap anonymous functions in parentheses + to avoid the `=>` operator accidently becoming part of the anonymous function. + The examples above do not work correctly without the parentheses! + ```julia + julia> transform(df, :a => x -> 10 .* x => add_prefix) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼──────────────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>add_prefix + 2 │ 2 6 [10, 20, 30, 40]=>add_prefix + 3 │ 3 7 [10, 20, 30, 40]=>add_prefix + 4 │ 4 8 [10, 20, 30, 40]=>add_prefix + julia> transform(df, :a => x -> 10 .* x => s -> "new_" * s) # Not what we wanted! + 4×3 DataFrame + Row │ a b a_function + │ Int64 Int64 Pair… + ─────┼───────────────────────────────────── + 1 │ 1 5 [10, 20, 30, 40]=>#18 + 2 │ 2 6 [10, 20, 30, 40]=>#18 + 3 │ 3 7 [10, 20, 30, 40]=>#18 + 4 │ 4 8 [10, 20, 30, 40]=>#18 + ``` A renaming function will not work in the `source_column_selector => new_column_names` operation form @@ -2481,11 +2484,12 @@ julia> transform(df, :data => AsTable) # keeps names from named tuples 2 │ (a = 3, b = 4) 3 4 ``` -!!! Note - To pack multiple columns into a single column of `NamedTuple`s - (reverse of the above operation) - apply the `identity` function `ByRow`, e.g. - `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. +!!! note + + To pack multiple columns into a single column of `NamedTuple`s + (reverse of the above operation) + apply the `identity` function `ByRow`, e.g. + `transform(df, AsTable([:a, :b]) => ByRow(identity) => :data)`. Renaming functions also work for multi-column transformations, but they must operate on a vector of strings. @@ -2756,18 +2760,19 @@ julia> ["x" => "a", "y" => "b"] == (["x", "y"] .=> ["a", "b"]) true ``` -!!! Note - These operation pairs (or vector of pairs) can be given variable names. - This is uncommon in practice but could be helpful for intermediate - inspection and testing. - ```julia - df = DataFrame(x = 1:3, y = 4:6) # create data frame - operation = ["x", "y"] .=> ["a", "b"] # save operation to variable - typeof(operation) # check type of operation - first(operation) # check first pair in operation - last(operation) # check last pair in operation - select(df, operation) # manipulate `df` with `operation` - ``` +!!! note + + These operation pairs (or vector of pairs) can be given variable names. + This is uncommon in practice but could be helpful for intermediate + inspection and testing. + ```julia + df = DataFrame(x = 1:3, y = 4:6) # create data frame + operation = ["x", "y"] .=> ["a", "b"] # save operation to variable + typeof(operation) # check type of operation + first(operation) # check first pair in operation + last(operation) # check last pair in operation + select(df, operation) # manipulate `df` with `operation` + ``` In Julia, a non-vector broadcasted with a vector will be repeated in each resultant pair element. @@ -2932,14 +2937,15 @@ julia> select( 4 │ 4 301 317 273 ``` -!!! Note Notes - * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. - * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. - Without `ByRow`, the manipulations above would have thrown - `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. - * Regular expression (`r""`) and `:` `source_column_selectors` - must be wrapped in `Cols` to be properly broadcasted - because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. +!!! note "Notes" + + * `Not("Time")` or `2:4` would have been equally good choices for `source_column_selector` in the above operations. + * Don't forget `ByRow` if your function is to be applied to elements rather than entire column vectors. + Without `ByRow`, the manipulations above would have thrown + `ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)`. + * Regular expression (`r""`) and `:` `source_column_selectors` + must be wrapped in `Cols` to be properly broadcasted + because otherwise the broadcasting occurs before the expression is expanded into a vector of matches. You could also broadcast different columns to different functions by supplying a vector of functions. From d9864bac42ffe6d50bce57271de3f507ce07306e Mon Sep 17 00:00:00 2001 From: nathanrboyer Date: Thu, 30 May 2024 12:03:34 -0400 Subject: [PATCH 29/30] Fix manipulation function reference --- docs/src/man/basics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 5449fb0e9..242bab3f5 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -3020,7 +3020,7 @@ In this way, every combination of selected columns and functions will be applied Pair broadcasting is a simple but powerful tool that can be used in any of the manipulation functions listed under -[Basic Usage of Manipulation Functions](@ref). +[Manipulation Functions](@ref). Experiment for yourself to discover other useful operations. ### Additional Resources From 9d03a64aaea9dd33decff00452bdd5066ad515cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Fri, 13 Dec 2024 12:31:31 +0100 Subject: [PATCH 30/30] Apply suggestions from code review --- docs/src/man/basics.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/src/man/basics.md b/docs/src/man/basics.md index 242bab3f5..03e5c5082 100644 --- a/docs/src/man/basics.md +++ b/docs/src/man/basics.md @@ -1650,7 +1650,7 @@ and automatically names the resulting column(s) : passes source column(s) as arguments to a function and names the resulting column(s) `new_column_names` - Examples: `:a => sum => :sum_of_a`, `[:a, :b] => + => :a_plus_b` + Examples: `:a => sum => :sum_of_a`, `[:a, :b] => (+) => :a_plus_b` *(Not available for `subset` or `subset!`)* @@ -1834,13 +1834,13 @@ julia> subset(df2, [:minor, :male]) !!! note - Using `Symbol` in `source_column_selector` will perform slightly faster than using `String`. - However, `String` is convenient when column names contain spaces. + Using `Symbol` in `source_column_selector` will perform slightly faster than using string. + However, a string is convenient when column names contain spaces. All elements of `source_column_selector` must be the same type (unless wrapped in `Cols`), e.g. `subset(df2, [:minor, "male"])` will error - since `Symbol` and `String` are used simultaneously. + since `Symbol` and string are used simultaneously. #### `operation_function` Inside an `operation` pair, `operation_function` is a function @@ -3095,7 +3095,7 @@ julia> df # see that the previous expression updated the data frame `df` 3 │ 3 6 9 ``` -Recall that the return type from a data frame manipulation function call is always a `DataFrame`. +Recall that the return type from a data frame manipulation function call is always a data frame. The return type of a data frame column accessed with dot syntax is a `Vector`. Thus the expression `df.x + df.y` gets the column data as vectors and returns the result of the vector addition.