diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 29854cf58..341f6b0dc 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-07T11:00:40","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-08T08:54:13","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/assets/README/index.html b/dev/assets/README/index.html index 52c09ce2e..b240f3cc7 100644 --- a/dev/assets/README/index.html +++ b/dev/assets/README/index.html @@ -1,2 +1,2 @@ -Introduction · DataFrames.jl

Introduction

In this folder we store the following data sets:

  • german_credit.csv
  • iris.csv

German Credit data set

License:

https://opendatacommons.org/licenses/dbcl/1-0/

Source:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) Professor Dr. Hans Hofmann Institut für Statistik und Ökonometrie Universität Hamburg FB Wirtschaftswissenschaften Von-Melle-Park 5 2000 Hamburg 13

The original data is from UCI, and the file stored here is from Kaggle

Iris data set

License

https://creativecommons.org/publicdomain/zero/1.0/

Source:

https://archive.ics.uci.edu/ml/datasets/Iris Creator: R.A. Fisher

+Introduction · DataFrames.jl

Introduction

In this folder we store the following data sets:

  • german_credit.csv
  • iris.csv

German Credit data set

License:

https://opendatacommons.org/licenses/dbcl/1-0/

Source:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) Professor Dr. Hans Hofmann Institut für Statistik und Ökonometrie Universität Hamburg FB Wirtschaftswissenschaften Von-Melle-Park 5 2000 Hamburg 13

The original data is from UCI, and the file stored here is from Kaggle

Iris data set

License

https://creativecommons.org/publicdomain/zero/1.0/

Source:

https://archive.ics.uci.edu/ml/datasets/Iris Creator: R.A. Fisher

diff --git a/dev/index.html b/dev/index.html index 57deb7272..63d9e8eeb 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Introduction · DataFrames.jl

DataFrames.jl

Welcome to the DataFrames.jl documentation!

This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.

For more illustrations of DataFrames.jl usage, in particular in conjunction with other packages you can check-out the following resources (they are kept up to date with the released version of DataFrames.jl):

If you prefer to learn DataFrames.jl from a book you can consider reading:

What is DataFrames.jl?

DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool.

DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.

To understand the toolchain in more detail, have a look at the tutorials in this manual. New users can start with the First Steps with DataFrames.jl section.

You may find the DataFramesMeta.jl package or one of the other convenience packages discussed in the Data manipulation frameworks section of this manual helpful when writing more advanced data transformations, especially if you do not have a significant programming experience. These packages provide convenience syntax similar to dplyr in R.

If you use metadata when working with DataFrames.jl you might find the TableMetadataTools.jl package useful. This package defines several convenience functions for performing typical metadata operations.

DataFrames.jl and the Julia Data Ecosystem

The Julia data ecosystem can be a difficult space for new users to navigate, in part because the Julia ecosystem tends to distribute functionality across different libraries more than some other languages. Because many people coming to DataFrames.jl are just starting to explore the Julia data ecosystem, below is a list of well-supported libraries that provide different data science tools, along with a few notes about what makes each library special, and how well integrated they are with DataFrames.jl.

  • Statistics
    • StatsKit.jl: A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
    • Statistics: The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
    • LinearAlgebra: Like Statistics, many linear algebra features (factorizations, inversions, etc.) live in a library you have to load to use.
    • SparseArrays are also in the standard library but must be loaded to be used.
    • FreqTables.jl: Create frequency tables / cross-tabulations. Tightly integrated with DataFrames.jl.
    • HypothesisTests.jl: A range of hypothesis testing tools.
    • GLM.jl: Tools for estimating linear and generalized linear models. Tightly integrated with DataFrames.jl.
    • StatsModels.jl: For converting heterogeneous DataFrame into homogeneous matrices for use with linear algebra libraries or machine learning applications that don't directly support DataFrames. Will do things like convert categorical variables into indicators/one-hot-encodings, create interaction terms, etc.
    • MultivariateStats.jl: linear regression, ridge regression, PCA, component analyses tools. Not well integrated with DataFrames.jl, but easily used in combination with StatsModels.
  • Machine Learning
    • MLJ.jl: if you're more of an applied user, there is a single package the pulls from all these different libraries and provides a single, scikit-learn inspired API: MLJ.jl. MLJ.jl provides a common interface for a wide range of machine learning algorithms.
    • ScikitLearn.jl: A Julia wrapper around the full Python scikit-learn machine learning library. Not well integrated with DataFrames.jl, but can be combined using StatsModels.jl.
    • AutoMLPipeline: A package that makes it trivial to create complex ML pipeline structures using simple expressions. It leverages on the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions, and makes it easy to discover optimal structures for machine learning regression and classification.
    • Deep learning: KNet.jl and Flux.jl.
  • Plotting
    • Plots.jl: Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R). StatsPlots.jl provides Plots.jl with recipes for many standard statistical plots.
    • Gadfly.jl: High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).
    • AlgebraOfGraphics.jl: A "grammar of graphics" library build upon Makie.jl.
    • VegaLite.jl: High-level plotting library that uses a different "grammar of graphics" syntax and has an emphasis on interactive graphics.
  • Data Wrangling:
    • Impute.jl: various methods for handling missing data in vectors, matrices and tables.
    • DataFramesMeta.jl: A range of convenience functions for DataFrames.jl that augment select and transform to provide a user experience similar to that provided by dplyr in R.
    • DataFrameMacros.jl: Provides macro versions of the common DataFrames.jl functions similar to DataFramesMeta.jl, with convenient syntax for the manipulation of multiple columns at once.
    • Query.jl: Query.jl provides a single framework for data wrangling that works with a range of libraries, including DataFrames.jl, other tabular data libraries (more on those below), and even non-tabular data. Provides many convenience functions analogous to those in dplyr in R or LINQ.
    • You can find more information on these packages in the Data manipulation frameworks section of this manual.
  • And More!
    • Graphs.jl: A pure-Julia, high performance network analysis library. Edgelists in DataFrames can be easily converted into graphs using the GraphDataFrameBridge.jl package.
  • IO:

While not all of these libraries are tightly integrated with DataFrames.jl, because DataFrames are essentially collections of aligned Julia vectors, so it is easy to (a) pull out a vector for use with a non-DataFrames-integrated library, or (b) convert your table into a homogeneously-typed matrix using the Matrix constructor or StatsModels.jl.

Other Julia Tabular Libraries

DataFrames.jl is a great general purpose tool for data manipulation and wrangling, but it's not ideal for all applications. For users with more specialized needs, consider using:

  • TypedTables.jl: Type-stable heterogeneous tables. Useful for improved performance when the structure of your table is relatively stable and does not feature thousands of columns.
  • JuliaDB.jl: For users working with data that is too large to fit in memory, we suggest JuliaDB.jl, which offers better performance for large datasets, and can handle out-of-core data manipulations (Python users can think of JuliaDB.jl as the Julia version of dask).

Note that most tabular data libraries in the Julia ecosystem (including DataFrames.jl) support a common interface (defined in the Tables.jl package). As a result, some libraries are capable or working with a range of tabular data structures, making it easy to move between tabular libraries as your needs change. A user of Query.jl, for example, can use the same code to manipulate data in a DataFrame, a Table (defined by TypedTables.jl), or a JuliaDB table.

Questions?

If there is something you expect DataFrames to be capable of, but cannot figure out how to do, please reach out with questions in Domains/Data on Discourse. Additionally you might want to listen to an introduction to DataFrames.jl on JuliaAcademy.

Please report bugs by opening an issue.

You can follow the source links throughout the documentation to jump right to the source files on GitHub to make pull requests for improving the documentation and function capabilities.

Please review DataFrames contributing guidelines before submitting your first PR!

Information on specific versions can be found on the Release page.

Package Manual

API

Only exported (i.e. available for use without DataFrames. qualifier after loading the DataFrames.jl package with using DataFrames) types and functions are considered a part of the public API of the DataFrames.jl package. In general all such objects are documented in this manual (in case some documentation is missing please kindly report an issue here).

Note

Breaking changes to public and documented API are avoided in DataFrames.jl where possible.

The following changes are not considered breaking:

  • specific floating point values computed by operations may change at any time; users should rely only on approximate accuracy;
  • in functions that use the default random number generator provided by Base Julia the specific random numbers computed may change across Julia versions;
  • if the changed functionality is classified as a bug;
  • if the changed behavior was not documented; two major cases are:
    1. in its implementation some function accepted a wider range of arguments that it was documented to handle - changes in handling of undocumented arguments are not considered as breaking;
    2. the type of the value returned by a function changes, but it still follows the contract specified in the documentation; for example if a function is documented to return a vector then changing its type from Vector to PooledVector is not considered as breaking;
  • error behavior: code that threw an exception can change exception type thrown or stop throwing an exception;
  • changes in display (how objects are printed);
  • changes to the state of global objects from Base Julia whose state normally is considered volatile (e.g. state of global random number generator).

All types and functions that are part of public API are guaranteed to go through a deprecation period before a breaking change is made to them or they would be removed.

The standard practice is that breaking changes are implemented when a major release of DataFrames.jl is made (e.g. functionalities deprecated in a 1.x release would be changed in the 2.0 release).

In rare cases a breaking change might be introduced in a minor release. In such a case the changed behavior still goes through one minor release during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible):

  • the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of :note-style metadata are documented as such);
  • the change is on the border of being classified as a bug (in rare cases even if a behavior of some function was documented its consequences for certain argument combinations could be decided to be unintended and not wanted);
  • the change is needed to adjust DataFrames.jl functionality to changes in Base Julia.

Please be warned that while Julia allows you to access internal functions or types of DataFrames.jl these can change without warning between versions of DataFrames.jl. In particular it is not safe to directly access fields of types that are a part of public API of the DataFrames.jl package using e.g. the getfield function. Whenever some operation on fields of defined types is considered allowed an appropriate exported function should be used instead.

Index

+Introduction · DataFrames.jl

DataFrames.jl

Welcome to the DataFrames.jl documentation!

This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.

For more illustrations of DataFrames.jl usage, in particular in conjunction with other packages you can check-out the following resources (they are kept up to date with the released version of DataFrames.jl):

If you prefer to learn DataFrames.jl from a book you can consider reading:

What is DataFrames.jl?

DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool.

DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.

To understand the toolchain in more detail, have a look at the tutorials in this manual. New users can start with the First Steps with DataFrames.jl section.

You may find the DataFramesMeta.jl package or one of the other convenience packages discussed in the Data manipulation frameworks section of this manual helpful when writing more advanced data transformations, especially if you do not have a significant programming experience. These packages provide convenience syntax similar to dplyr in R.

If you use metadata when working with DataFrames.jl you might find the TableMetadataTools.jl package useful. This package defines several convenience functions for performing typical metadata operations.

DataFrames.jl and the Julia Data Ecosystem

The Julia data ecosystem can be a difficult space for new users to navigate, in part because the Julia ecosystem tends to distribute functionality across different libraries more than some other languages. Because many people coming to DataFrames.jl are just starting to explore the Julia data ecosystem, below is a list of well-supported libraries that provide different data science tools, along with a few notes about what makes each library special, and how well integrated they are with DataFrames.jl.

  • Statistics
    • StatsKit.jl: A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
    • Statistics: The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
    • LinearAlgebra: Like Statistics, many linear algebra features (factorizations, inversions, etc.) live in a library you have to load to use.
    • SparseArrays are also in the standard library but must be loaded to be used.
    • FreqTables.jl: Create frequency tables / cross-tabulations. Tightly integrated with DataFrames.jl.
    • HypothesisTests.jl: A range of hypothesis testing tools.
    • GLM.jl: Tools for estimating linear and generalized linear models. Tightly integrated with DataFrames.jl.
    • StatsModels.jl: For converting heterogeneous DataFrame into homogeneous matrices for use with linear algebra libraries or machine learning applications that don't directly support DataFrames. Will do things like convert categorical variables into indicators/one-hot-encodings, create interaction terms, etc.
    • MultivariateStats.jl: linear regression, ridge regression, PCA, component analyses tools. Not well integrated with DataFrames.jl, but easily used in combination with StatsModels.
  • Machine Learning
    • MLJ.jl: if you're more of an applied user, there is a single package the pulls from all these different libraries and provides a single, scikit-learn inspired API: MLJ.jl. MLJ.jl provides a common interface for a wide range of machine learning algorithms.
    • ScikitLearn.jl: A Julia wrapper around the full Python scikit-learn machine learning library. Not well integrated with DataFrames.jl, but can be combined using StatsModels.jl.
    • AutoMLPipeline: A package that makes it trivial to create complex ML pipeline structures using simple expressions. It leverages on the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions, and makes it easy to discover optimal structures for machine learning regression and classification.
    • Deep learning: KNet.jl and Flux.jl.
  • Plotting
    • Plots.jl: Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R). StatsPlots.jl provides Plots.jl with recipes for many standard statistical plots.
    • Gadfly.jl: High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).
    • AlgebraOfGraphics.jl: A "grammar of graphics" library build upon Makie.jl.
    • VegaLite.jl: High-level plotting library that uses a different "grammar of graphics" syntax and has an emphasis on interactive graphics.
  • Data Wrangling:
    • Impute.jl: various methods for handling missing data in vectors, matrices and tables.
    • DataFramesMeta.jl: A range of convenience functions for DataFrames.jl that augment select and transform to provide a user experience similar to that provided by dplyr in R.
    • DataFrameMacros.jl: Provides macro versions of the common DataFrames.jl functions similar to DataFramesMeta.jl, with convenient syntax for the manipulation of multiple columns at once.
    • Query.jl: Query.jl provides a single framework for data wrangling that works with a range of libraries, including DataFrames.jl, other tabular data libraries (more on those below), and even non-tabular data. Provides many convenience functions analogous to those in dplyr in R or LINQ.
    • You can find more information on these packages in the Data manipulation frameworks section of this manual.
  • And More!
    • Graphs.jl: A pure-Julia, high performance network analysis library. Edgelists in DataFrames can be easily converted into graphs using the GraphDataFrameBridge.jl package.
  • IO:

While not all of these libraries are tightly integrated with DataFrames.jl, because DataFrames are essentially collections of aligned Julia vectors, so it is easy to (a) pull out a vector for use with a non-DataFrames-integrated library, or (b) convert your table into a homogeneously-typed matrix using the Matrix constructor or StatsModels.jl.

Other Julia Tabular Libraries

DataFrames.jl is a great general purpose tool for data manipulation and wrangling, but it's not ideal for all applications. For users with more specialized needs, consider using:

  • TypedTables.jl: Type-stable heterogeneous tables. Useful for improved performance when the structure of your table is relatively stable and does not feature thousands of columns.
  • JuliaDB.jl: For users working with data that is too large to fit in memory, we suggest JuliaDB.jl, which offers better performance for large datasets, and can handle out-of-core data manipulations (Python users can think of JuliaDB.jl as the Julia version of dask).

Note that most tabular data libraries in the Julia ecosystem (including DataFrames.jl) support a common interface (defined in the Tables.jl package). As a result, some libraries are capable or working with a range of tabular data structures, making it easy to move between tabular libraries as your needs change. A user of Query.jl, for example, can use the same code to manipulate data in a DataFrame, a Table (defined by TypedTables.jl), or a JuliaDB table.

Questions?

If there is something you expect DataFrames to be capable of, but cannot figure out how to do, please reach out with questions in Domains/Data on Discourse. Additionally you might want to listen to an introduction to DataFrames.jl on JuliaAcademy.

Please report bugs by opening an issue.

You can follow the source links throughout the documentation to jump right to the source files on GitHub to make pull requests for improving the documentation and function capabilities.

Please review DataFrames contributing guidelines before submitting your first PR!

Information on specific versions can be found on the Release page.

Package Manual

API

Only exported (i.e. available for use without DataFrames. qualifier after loading the DataFrames.jl package with using DataFrames) types and functions are considered a part of the public API of the DataFrames.jl package. In general all such objects are documented in this manual (in case some documentation is missing please kindly report an issue here).

Note

Breaking changes to public and documented API are avoided in DataFrames.jl where possible.

The following changes are not considered breaking:

  • specific floating point values computed by operations may change at any time; users should rely only on approximate accuracy;
  • in functions that use the default random number generator provided by Base Julia the specific random numbers computed may change across Julia versions;
  • if the changed functionality is classified as a bug;
  • if the changed behavior was not documented; two major cases are:
    1. in its implementation some function accepted a wider range of arguments that it was documented to handle - changes in handling of undocumented arguments are not considered as breaking;
    2. the type of the value returned by a function changes, but it still follows the contract specified in the documentation; for example if a function is documented to return a vector then changing its type from Vector to PooledVector is not considered as breaking;
  • error behavior: code that threw an exception can change exception type thrown or stop throwing an exception;
  • changes in display (how objects are printed);
  • changes to the state of global objects from Base Julia whose state normally is considered volatile (e.g. state of global random number generator).

All types and functions that are part of public API are guaranteed to go through a deprecation period before a breaking change is made to them or they would be removed.

The standard practice is that breaking changes are implemented when a major release of DataFrames.jl is made (e.g. functionalities deprecated in a 1.x release would be changed in the 2.0 release).

In rare cases a breaking change might be introduced in a minor release. In such a case the changed behavior still goes through one minor release during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible):

  • the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of :note-style metadata are documented as such);
  • the change is on the border of being classified as a bug (in rare cases even if a behavior of some function was documented its consequences for certain argument combinations could be decided to be unintended and not wanted);
  • the change is needed to adjust DataFrames.jl functionality to changes in Base Julia.

Please be warned that while Julia allows you to access internal functions or types of DataFrames.jl these can change without warning between versions of DataFrames.jl. In particular it is not safe to directly access fields of types that are a part of public API of the DataFrames.jl package using e.g. the getfield function. Whenever some operation on fields of defined types is considered allowed an appropriate exported function should be used instead.

Index

diff --git a/dev/lib/functions/index.html b/dev/lib/functions/index.html index 916d950d4..2aef46abc 100644 --- a/dev/lib/functions/index.html +++ b/dev/lib/functions/index.html @@ -22,7 +22,7 @@ 3 │ 1 b const 4 │ 2 b const 5 │ 1 c const - 6 │ 2 c constsource
Base.copyFunction
copy(df::DataFrame; copycols::Bool=true)

Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.

Metadata: this function preserves all table-level and column-level metadata.

source
copy(dfr::DataFrameRow)

Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.

source
copy(key::GroupKey)

Construct a NamedTuple with the same contents as the GroupKey.

source
Base.similarFunction
similar(df::AbstractDataFrame, rows::Integer=nrow(df))

Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.

Metadata: this function preserves table-level and column-level :note-style metadata.

source

Summary information

DataAPI.describeFunction
describe(df::AbstractDataFrame; cols=:)
+   6 │     2  c     const
source
Base.copyFunction
copy(df::DataFrame; copycols::Bool=true)

Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.

Metadata: this function preserves all table-level and column-level metadata.

source
copy(dfr::DataFrameRow)

Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.

source
copy(key::GroupKey)

Construct a NamedTuple with the same contents as the GroupKey.

source
Base.similarFunction
similar(df::AbstractDataFrame, rows::Integer=nrow(df))

Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.

Metadata: this function preserves table-level and column-level :note-style metadata.

source

Summary information

DataAPI.describeFunction
describe(df::AbstractDataFrame; cols=:)
 describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)

Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.

Arguments

  • df : the AbstractDataFrame
  • stats::Union{Symbol, Pair}... : the summary statistics to report. Arguments can be:
    • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :sum, :eltype, :nunique, :nuniqueall, :first, :last, :nnonmissing, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nmissing, and :eltype.
    • :detailed as the only Symbol argument to return all statistics except :first, :last, :sum, :nuniqueall, and :nnonmissing.
    • :all as the only Symbol argument to return all statistics.
    • A function => name pair where name is a Symbol or string. This will create a column of summary statistics with the provided name.
  • cols : a keyword argument allowing to select only a subset or transformation of columns from df to describe. Can be any column selector or transformation accepted by select.

Details

For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.

When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings. Use :nuniqueall to report the number of unique values in all columns.

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable and :nnonmissing the number of non-missing values.

If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Metadata: this function drops all metadata.

Examples

julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j');
 
 julia> describe(df)
@@ -57,7 +57,7 @@
  Row │ variable  min      sum
      │ Symbol    Float64  Float64
 ─────┼────────────────────────────
-   1 │ x             0.1      5.5
source
Base.isemptyFunction
isempty(df::AbstractDataFrame)

Return true if data frame df has zero rows, and false otherwise.

source
Base.lengthFunction
length(dfr::DataFrameRow)

Return the number of elements of dfr.

See also: size

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
+   1 │ x             0.1      5.5
source
Base.isemptyFunction
isempty(df::AbstractDataFrame)

Return true if data frame df has zero rows, and false otherwise.

source
Base.lengthFunction
length(dfr::DataFrameRow)

Return the number of elements of dfr.

See also: size

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
 DataFrameRow
  Row │ a      b
      │ Int64  Char
@@ -65,15 +65,15 @@
    1 │     1  a
 
 julia> length(dfr)
-2
source
DataAPI.ncolFunction
ncol(df::AbstractDataFrame)

Return the number of columns in an AbstractDataFrame df.

See also nrow, size.

Examples

julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
+2
source
DataAPI.ncolFunction
ncol(df::AbstractDataFrame)

Return the number of columns in an AbstractDataFrame df.

See also nrow, size.

Examples

julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
 
 julia> ncol(df)
-3
source
Base.ndimsFunction
ndims(::AbstractDataFrame)
-ndims(::Type{<:AbstractDataFrame})

Return the number of dimensions of a data frame, which is always 2.

source
ndims(::DataFrameRow)
-ndims(::Type{<:DataFrameRow})

Return the number of dimensions of a data frame row, which is always 1.

source
DataAPI.nrowFunction
nrow(df::AbstractDataFrame)

Return the number of rows in an AbstractDataFrame df.

See also: ncol, size.

Examples

julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
+3
source
Base.ndimsFunction
ndims(::AbstractDataFrame)
+ndims(::Type{<:AbstractDataFrame})

Return the number of dimensions of a data frame, which is always 2.

source
ndims(::DataFrameRow)
+ndims(::Type{<:DataFrameRow})

Return the number of dimensions of a data frame row, which is always 1.

source
DataAPI.nrowFunction
nrow(df::AbstractDataFrame)

Return the number of rows in an AbstractDataFrame df.

See also: ncol, size.

Examples

julia> df = DataFrame(i=1:10, x=rand(10), y=rand(["a", "b", "c"], 10));
 
 julia> nrow(df)
-10
source
DataAPI.rownumberFunction
rownumber(dfr::DataFrameRow)

Return a row number in the AbstractDataFrame that dfr was created from.

Note that this differs from the first element in the tuple returned by parentindices. The latter gives the row number in the parent(dfr), which is the source DataFrame where data that dfr gives access to is stored.

Examples

julia> df = DataFrame(reshape(1:12, 3, 4), :auto)
+10
source
DataAPI.rownumberFunction
rownumber(dfr::DataFrameRow)

Return a row number in the AbstractDataFrame that dfr was created from.

Note that this differs from the first element in the tuple returned by parentindices. The latter gives the row number in the parent(dfr), which is the source DataFrame where data that dfr gives access to is stored.

Examples

julia> df = DataFrame(reshape(1:12, 3, 4), :auto)
 3×4 DataFrame
  Row │ x1     x2     x3     x4
      │ Int64  Int64  Int64  Int64
@@ -132,7 +132,7 @@
 ─────┼────────────────────────────
    1 │     1      4      7     10
    2 │     2      5      8     11
-   3 │     3      6      9     12
source
Base.showFunction
show([io::IO, ]df::AbstractDataFrame;
+   3 │     3      6      9     12
source
Base.showFunction
show([io::IO, ]df::AbstractDataFrame;
      allrows::Bool = !get(io, :limit, false),
      allcols::Bool = !get(io, :limit, false),
      allgroups::Bool = !get(io, :limit, false),
@@ -151,7 +151,7 @@
 ───────────────
      1  x
      2  y
-     3  z
source
show(io::IO, mime::MIME, df::AbstractDataFrame)

Render a data frame to an I/O stream in MIME type mime.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • mime::MIME: supported MIME types are: "text/plain", "text/html", "text/latex", "text/csv", "text/tab-separated-values" (the last two MIME types do not support showing #undef values)
  • df::AbstractDataFrame: The data frame to print.

Additionally selected MIME types support passing the following keyword arguments:

  • MIME type "text/plain" accepts all listed keyword arguments and their behavior is identical as for show(::IO, ::AbstractDataFrame)
  • MIME type "text/html" accepts the following keyword arguments:
    • eltypes::Bool = true: Whether to print the column types under column names.
    • summary::Bool = true: Whether to print a brief string summary of the data frame.
    • max_column_width::AbstractString = "": The maximum column width. It must be a string containing a valid CSS length. For example, passing "100px" will limit the width of all columns to 100 pixels. If empty, the columns will be rendered without limits.
    • kwargs...: Any keyword argument supported by the function pretty_table of PrettyTables.jl can be passed here to customize the output.

Examples

julia> show(stdout, MIME("text/latex"), DataFrame(A=1:3, B=["x", "y", "z"]))
+     3  z
source
show(io::IO, mime::MIME, df::AbstractDataFrame)

Render a data frame to an I/O stream in MIME type mime.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • mime::MIME: supported MIME types are: "text/plain", "text/html", "text/latex", "text/csv", "text/tab-separated-values" (the last two MIME types do not support showing #undef values)
  • df::AbstractDataFrame: The data frame to print.

Additionally selected MIME types support passing the following keyword arguments:

  • MIME type "text/plain" accepts all listed keyword arguments and their behavior is identical as for show(::IO, ::AbstractDataFrame)
  • MIME type "text/html" accepts the following keyword arguments:
    • eltypes::Bool = true: Whether to print the column types under column names.
    • summary::Bool = true: Whether to print a brief string summary of the data frame.
    • max_column_width::AbstractString = "": The maximum column width. It must be a string containing a valid CSS length. For example, passing "100px" will limit the width of all columns to 100 pixels. If empty, the columns will be rendered without limits.
    • kwargs...: Any keyword argument supported by the function pretty_table of PrettyTables.jl can be passed here to customize the output.

Examples

julia> show(stdout, MIME("text/latex"), DataFrame(A=1:3, B=["x", "y", "z"]))
 \begin{tabular}{r|cc}
 	& A & B\\
 	\hline
@@ -167,13 +167,13 @@
 "A","B"
 1,"x"
 2,"y"
-3,"z"
source
Base.sizeFunction
size(df::AbstractDataFrame[, dim])

Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.

See also: nrow, ncol

Examples

julia> df = DataFrame(a=1:3, b='a':'c');
+3,"z"
source
Base.sizeFunction
size(df::AbstractDataFrame[, dim])

Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.

See also: nrow, ncol

Examples

julia> df = DataFrame(a=1:3, b='a':'c');
 
 julia> size(df)
 (3, 2)
 
 julia> size(df, 1)
-3
source
size(dfr::DataFrameRow[, dim])

Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.

See also: length

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
+3
source
size(dfr::DataFrameRow[, dim])

Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.

See also: length

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :]
 DataFrameRow
  Row │ a      b
      │ Int64  Char
@@ -184,7 +184,7 @@
 (2,)
 
 julia> size(dfr, 1)
-2
source

Working with column names

Base.namesFunction
names(df::AbstractDataFrame, cols=:)
+2
source

Working with column names

Base.namesFunction
names(df::AbstractDataFrame, cols=:)
 names(df::DataFrameRow, cols=:)
 names(df::GroupedDataFrame, cols=:)
 names(df::DataFrameRows, cols=:)
@@ -229,7 +229,7 @@
 julia> names(df, any.(ismissing, eachcol(df))) # pick columns that contain missing values
 2-element Vector{String}:
  "x1"
- "x3"
source
Base.propertynamesFunction
propertynames(df::AbstractDataFrame)

Return a freshly allocated Vector{Symbol} of names of columns contained in df.

source
DataFrames.renameFunction
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
+ "x3"
source
Base.propertynamesFunction
propertynames(df::AbstractDataFrame)

Return a freshly allocated Vector{Symbol} of names of columns contained in df.

source
DataFrames.renameFunction
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
        makeunique::Bool=false)
 rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
        makeunique::Bool=false)
@@ -291,7 +291,7 @@
      │ Int64  Int64  Int64
 ─────┼─────────────────────
    1 │     1      2      3
-
source
DataFrames.rename!Function
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
+
source
DataFrames.rename!Function
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
         makeunique::Bool=false)
 rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
         makeunique::Bool=false)
@@ -342,7 +342,7 @@
      │ Int64  Int64  Int64
 ─────┼─────────────────────
    1 │     1      2      3
-
source

Mutating and transforming data frames and grouped data frames

Base.append!Function
append!(df::DataFrame, tables...; cols::Symbol=:setequal,
+
source

Mutating and transforming data frames and grouped data frames

Base.append!Function
append!(df::DataFrame, tables...; cols::Symbol=:setequal,
         promote::Bool=(cols in [:union, :subset]))

Add the rows of tables passed as tables to the end of df. If the table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

The exact behavior of append! depends on the cols argument:

  • If cols == :setequal (this is the default) then df2 must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then df2 must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if df2 is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then df2 may contain more columns than df, but all column names that are present in df must be present in df2 and only these are used.
  • If cols == :subset then append! behaves like for :intersect but if some column is missing in df2 then a missing value is pushed to df.
  • If cols == :union then append! adds columns missing in df that are present in df2, for columns present in df but missing in df2 a missing value is pushed.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

The above rule has the following exceptions:

  • If df has no columns then copies of columns from df2 are added to it.
  • If df2 has no columns then calling append! leaves df unchanged.

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Metadata: table-level :note-style metadata and column-level :note-style metadata for columns present in df are preserved. If new columns are added their :note-style metadata is copied from the appended table. Other metadata is dropped.

See also: use push! to add individual rows to a data frame, prepend! to add a table at the beginning, and vcat to vertically concatenate data frames.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3)
 3×2 DataFrame
  Row │ A      B
@@ -385,7 +385,7 @@
    3 │       6.0        6  missing
    4 │       1.0  missing  missing
    5 │ missing    missing        1
-   6 │ missing    missing        2
source
DataFrames.combineFunction
combine(df::AbstractDataFrame, args...;
+   6 │ missing    missing        2
source
DataFrames.combineFunction
combine(df::AbstractDataFrame, args...;
         renamecols::Bool=true, threads::Bool=true)
 combine(f::Callable, df::AbstractDataFrame;
         renamecols::Bool=true, threads::Bool=true)
@@ -625,7 +625,7 @@
    5 │     3      2      3      5
    6 │     3      2      7      9
    7 │     4      1      4      5
-   8 │     4      1      8      9
source
DataFrames.fillcombinationsFunction
fillcombinations(df::AbstractDataFrame, indexcols;
+   8 │     4      1      8      9
source
DataFrames.fillcombinationsFunction
fillcombinations(df::AbstractDataFrame, indexcols;
                      allowduplicates::Bool=false,
                      fill=missing)

Generate all combinations of levels of column(s) indexcols in data frame df. Levels and their order are determined by the levels function (i.e. unique values sorted lexicographically by default, or a custom set of levels for e.g. CategoricalArray columns), in addition to missing if present.

For combinations of indexcols not present in df these columns are filled with the fill value (missing by default).

If allowduplicates=false (the default) indexcols may only contain unique combinations of indexcols values. If allowduplicates=true duplicates are allowed.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(x=1:2, y='a':'b', z=["x", "y"])
 2×3 DataFrame
@@ -653,7 +653,7 @@
    1 │      1  a     x
    2 │      0  b     x
    3 │      0  a     y
-   4 │      2  b     y
source
DataFrames.flattenFunction
flatten(df::AbstractDataFrame, cols; scalar::Type=Union{})

When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If scalar is passed then values that have this type in flattened columns are treated as scalars and broadcasted as many times as is needed to match lengths of values stored in other columns. If all values in a row are scalars, a single row is produced.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df1 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
+   4 │      2  b     y
source
DataFrames.flattenFunction
flatten(df::AbstractDataFrame, cols; scalar::Type=Union{})

When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If scalar is passed then values that have this type in flattened columns are treated as scalars and broadcasted as many times as is needed to match lengths of values stored in other columns. If all values in a row are scalars, a single row is produced.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df1 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
 2×3 DataFrame
  Row │ a      b       c
      │ Int64  Array…  Array…
@@ -730,7 +730,7 @@
    2 │     1        2        6
    3 │     2  missing  missing
    4 │     3  missing        7
-   5 │     3  missing        8
source
Base.hcatFunction
hcat(df::AbstractDataFrame...;
+   5 │     3  missing        8
source
Base.hcatFunction
hcat(df::AbstractDataFrame...;
      makeunique::Bool=false, copycols::Bool=true)

Horizontally concatenate data frames.

If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).

If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.

Metadata: hcat propagates table-level :note-style metadata for keys that are present in all passed data frames and have the same value; it propagates column-level :note-style metadata.

Example

julia> df1 = DataFrame(A=1:3, B=1:3)
 3×2 DataFrame
  Row │ A      B
@@ -764,7 +764,7 @@
 julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);
 
 julia> df3.A === df1.A
-true
source
Base.insert!Function
insert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray};
+true
source
Base.insert!Function
insert!(df::DataFrame, index::Integer, row::Union{Tuple, AbstractArray};
         cols::Symbol=:setequal, promote::Bool=false)
 insert!(df::DataFrame, index::Integer, row::Union{DataFrameRow, NamedTuple,
                                                   AbstractDict, Tables.AbstractRow};
@@ -835,7 +835,7 @@
    5 │ b              2  missing
    6 │ c              3  missing
    7 │ a              1  missing
-   8 │ 1.0      missing        1.0
source
DataFrames.insertcolsFunction
insertcols(df::AbstractDataFrame[, col], (name=>val)::Pair...;
+   8 │ 1.0      missing        1.0
source
DataFrames.insertcolsFunction
insertcols(df::AbstractDataFrame[, col], (name=>val)::Pair...;
            after::Bool=false, makeunique::Bool=false, copycols::Bool=true)

Insert a column into a copy of df data frame using the insertcols! function and return the newly created data frame.

If col is omitted it is set to ncol(df)+1 (the column is inserted as the last column).

Arguments

  • df : the data frame to which we want to add columns
  • col : a position at which we want to insert a column, passed as an integer or a column name (a string or a Symbol); the column selected with col and columns following it are shifted to the right in df after the operation
  • name : the name of the new column
  • val : an AbstractVector giving the contents of the new column or a value of any type other than AbstractArray which will be repeated to fill a new vector; As a particular rule a values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way
  • after : if true columns are inserted after col
  • makeunique : defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix
  • copycols : whether vectors passed as columns should be copied

If val is an AbstractRange then the result of collect(val) is inserted.

If df is a SubDataFrame then it must have been created with : as column selector (otherwise an error is thrown). In this case the copycols keyword argument is ignored (i.e. the added column is always copied) and the parent data frame's column is filled with missing in rows that are filtered out by df.

If df isa DataFrame that has no columns and only values other than AbstractVector are passed then it is used to create a one-element column. If df isa DataFrame that has no columns and at least one AbstractVector is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df).

Metadata: this function preserves table-level and column-level :note-style metadata.

See also insertcols!.

Examples

julia> df = DataFrame(a=1:3)
 3×1 DataFrame
  Row │ a
@@ -870,7 +870,7 @@
 ─────┼──────────────
    1 │     1      7
    2 │     2      8
-   3 │     3      9
source
DataFrames.insertcols!Function
insertcols!(df::AbstractDataFrame[, col], (name=>val)::Pair...;
+   3 │     3      9
source
DataFrames.insertcols!Function
insertcols!(df::AbstractDataFrame[, col], (name=>val)::Pair...;
             after::Bool=false, makeunique::Bool=false, copycols::Bool=true)

Insert a column into a data frame in place. Return the updated data frame.

If col is omitted it is set to ncol(df)+1 (the column is inserted as the last column).

Arguments

  • df : the data frame to which we want to add columns
  • col : a position at which we want to insert a column, passed as an integer or a column name (a string or a Symbol); the column selected with col and columns following it are shifted to the right in df after the operation
  • name : the name of the new column
  • val : an AbstractVector giving the contents of the new column or a value of any type other than AbstractArray which will be repeated to fill a new vector; As a particular rule a values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way
  • after : if true columns are inserted after col
  • makeunique : defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix
  • copycols : whether vectors passed as columns should be copied

If val is an AbstractRange then the result of collect(val) is inserted.

If df is a SubDataFrame then it must have been created with : as column selector (otherwise an error is thrown). In this case the copycols keyword argument is ignored (i.e. the added column is always copied) and the parent data frame's column is filled with missing in rows that are filtered out by df.

If df isa DataFrame that has no columns and only values other than AbstractVector are passed then it is used to create a one-element column. If df isa DataFrame that has no columns and at least one AbstractVector is passed then its length is used to determine the number of elements in all created columns. In all other cases the number of rows in all created columns must match nrow(df).

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

See also insertcols.

Examples

julia> df = DataFrame(a=1:3)
 3×1 DataFrame
  Row │ a
@@ -905,7 +905,7 @@
 ─────┼──────────────────────────────────
    1 │ a         7      2      3      1
    2 │ b         8      3      4      2
-   3 │ c         9      4      5      3
source
Base.invpermute!Function
invpermute!(df::AbstractDataFrame, p)

Like permute!, but the inverse of the given permutation is applied.

invpermute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then invpermute! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
+   3 │ c         9      4      5      3
source
Base.invpermute!Function
invpermute!(df::AbstractDataFrame, p)

Like permute!, but the inverse of the given permutation is applied.

invpermute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then invpermute! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
 5×3 DataFrame
  Row │ a      b      c
      │ Int64  Int64  Int64
@@ -936,7 +936,7 @@
    2 │     2      7     12
    3 │     3      8     13
    4 │     4      9     14
-   5 │     5     10     15
source
DataFrames.mapcolsFunction
mapcols(f::Union{Function, Type}, df::AbstractDataFrame; cols=All())

Return a DataFrame where each column of df selected by cols (by default, all columns) is transformed using function f. Columns not selected by cols are copied.

f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

The cols column selector can be any value accepted as column selector by the names function.

Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
+   5 │     5     10     15
source
DataFrames.mapcolsFunction
mapcols(f::Union{Function, Type}, df::AbstractDataFrame; cols=All())

Return a DataFrame where each column of df selected by cols (by default, all columns) is transformed using function f. Columns not selected by cols are copied.

f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

The cols column selector can be any value accepted as column selector by the names function.

Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
 4×2 DataFrame
  Row │ x      y
      │ Int64  Int64
@@ -964,7 +964,7 @@
    1 │     1    121
    2 │     2    144
    3 │     3    169
-   4 │     4    196
source
DataFrames.mapcols!Function
mapcols!(f::Union{Function, Type}, df::DataFrame; cols=All())

Update a DataFrame in-place where each column of df selected by cols (by default, all columns) is transformed using function f. Columns not selected by cols are left unchanged.

f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols! reuses the columns from df if they are returned by f.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
+   4 │     4    196
source
DataFrames.mapcols!Function
mapcols!(f::Union{Function, Type}, df::DataFrame; cols=All())

Update a DataFrame in-place where each column of df selected by cols (by default, all columns) is transformed using function f. Columns not selected by cols are left unchanged.

f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols! reuses the columns from df if they are returned by f.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
 4×2 DataFrame
  Row │ x      y
      │ Int64  Int64
@@ -996,7 +996,7 @@
    1 │     2    121
    2 │     8    144
    3 │    18    169
-   4 │    32    196
source
Base.permute!Function
permute!(df::AbstractDataFrame, p)

Permute data frame df in-place, according to permutation p. Throws ArgumentError if p is not a permutation.

To return a new data frame instead of permuting df in-place, use df[p, :].

permute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then permute! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
+   4 │    32    196
source
Base.permute!Function
permute!(df::AbstractDataFrame, p)

Permute data frame df in-place, according to permutation p. Throws ArgumentError if p is not a permutation.

To return a new data frame instead of permuting df in-place, use df[p, :].

permute! will produce a correct result even if some columns of passed data frame or permutation p are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then permute! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
 5×3 DataFrame
  Row │ a      b      c
      │ Int64  Int64  Int64
@@ -1016,7 +1016,7 @@
    2 │     3      8     13
    3 │     1      6     11
    4 │     2      7     12
-   5 │     4      9     14
source
Base.prepend!Function
prepend!(df::DataFrame, tables...; cols::Symbol=:setequal,
+   5 │     4      9     14
source
Base.prepend!Function
prepend!(df::DataFrame, tables...; cols::Symbol=:setequal,
          promote::Bool=(cols in [:union, :subset]))

Add the rows of tables passed as tables to the beginning of df. If the table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

Add the rows of df2 to the beginning of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being prepended.

The exact behavior of prepend! depends on the cols argument:

  • If cols == :setequal (this is the default) then df2 must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then df2 must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if df2 is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then df2 may contain more columns than df, but all column names that are present in df must be present in df2 and only these are used.
  • If cols == :subset then append! behaves like for :intersect but if some column is missing in df2 then a missing value is pushed to df.
  • If cols == :union then append! adds columns missing in df that are present in df2, for columns present in df but missing in df2 a missing value is pushed.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

The above rule has the following exceptions:

  • If df has no columns then copies of columns from df2 are added to it.
  • If df2 has no columns then calling prepend! leaves df unchanged.

Please note that prepend! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Metadata: table-level :note-style metadata and column-level :note-style metadata for columns present in df are preserved. If new columns are added their :note-style metadata is copied from the appended table. Other metadata is dropped.

See also: use pushfirst! to add individual rows at the beginning of a data frame, append! to add a table at the end, and vcat to vertically concatenate data frames.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3)
 3×2 DataFrame
  Row │ A      B
@@ -1059,7 +1059,7 @@
    3 │ missing    missing        2
    4 │       4.0        4  missing
    5 │       5.0        5  missing
-   6 │       6.0        6  missing
source
Base.push!Function
push!(df::DataFrame, row::Union{Tuple, AbstractArray}...;
+   6 │       6.0        6  missing
source
Base.push!Function
push!(df::DataFrame, row::Union{Tuple, AbstractArray}...;
       cols::Symbol=:setequal, promote::Bool=false)
 push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
                                 Tables.AbstractRow}...;
@@ -1139,7 +1139,7 @@
 ─────┼──────────────
    1 │     1      2
    2 │     3      4
-   3 │     5      6
source
Base.pushfirst!Function
pushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}...;
+   3 │     5      6
source
Base.pushfirst!Function
pushfirst!(df::DataFrame, row::Union{Tuple, AbstractArray}...;
            cols::Symbol=:setequal, promote::Bool=false)
 pushfirst!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict,
                                      Tables.AbstractRow}...;
@@ -1219,7 +1219,7 @@
 ─────┼──────────────
    1 │     3      4
    2 │     5      6
-   3 │     1      2
source
Base.reduceFunction
reduce(::typeof(vcat),
+   3 │     1      2
source
Base.reduceFunction
reduce(::typeof(vcat),
        dfs::Union{AbstractVector{<:AbstractDataFrame},
                   Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}};
        cols::Union{Symbol, AbstractVector{Symbol},
@@ -1278,7 +1278,7 @@
    6 │     6        6  missing       2
    7 │     7  missing        7       3
    8 │     8  missing        8       3
-   9 │     9  missing        9       3
source
Base.repeatFunction
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
+   9 │     9  missing        9       3
source
Base.repeatFunction
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
 2×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -1302,7 +1302,7 @@
    9 │     1      3
   10 │     1      3
   11 │     2      4
-  12 │     2      4
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
+  12 │     2      4
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
 2×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -1318,7 +1318,7 @@
    1 │     1      3
    2 │     2      4
    3 │     1      3
-   4 │     2      4
source
DataFrames.repeat!Function
repeat!(df::DataFrame; inner::Integer=1, outer::Integer=1)

Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
+   4 │     2      4
source
DataFrames.repeat!Function
repeat!(df::DataFrame; inner::Integer=1, outer::Integer=1)

Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
 2×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -1344,7 +1344,7 @@
    9 │     1      3
   10 │     1      3
   11 │     2      4
-  12 │     2      4
source
repeat!(df::DataFrame, count::Integer)

Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
+  12 │     2      4
source
repeat!(df::DataFrame, count::Integer)

Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.

Metadata: this function preserves table-level and column-level :note-style metadata.

Example

julia> df = DataFrame(a=1:2, b=3:4)
 2×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -1360,7 +1360,7 @@
    1 │     1      3
    2 │     2      4
    3 │     1      3
-   4 │     2      4
source
Base.reverseFunction
reverse(df::AbstractDataFrame, start=1, stop=nrow(df))

Return a data frame containing the rows in df in reversed order. If start and stop are provided, only rows in the start:stop range are affected.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
+   4 │     2      4
source
Base.reverseFunction
reverse(df::AbstractDataFrame, start=1, stop=nrow(df))

Return a data frame containing the rows in df in reversed order. If start and stop are provided, only rows in the start:stop range are affected.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
 5×3 DataFrame
  Row │ a      b      c
      │ Int64  Int64  Int64
@@ -1391,7 +1391,7 @@
    2 │     3      8     13
    3 │     2      7     12
    4 │     4      9     14
-   5 │     5     10     15
source
Base.reverse!Function
reverse!(df::AbstractDataFrame, start=1, stop=nrow(df))

Mutate data frame in-place to reverse its row order. If start and stop are provided, only rows in the start:stop range are affected.

reverse! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then reverse! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
+   5 │     5     10     15
source
Base.reverse!Function
reverse!(df::AbstractDataFrame, start=1, stop=nrow(df))

Mutate data frame in-place to reverse its row order. If start and stop are provided, only rows in the start:stop range are affected.

reverse! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then reverse! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> df = DataFrame(a=1:5, b=6:10, c=11:15)
 5×3 DataFrame
  Row │ a      b      c
      │ Int64  Int64  Int64
@@ -1422,7 +1422,7 @@
    2 │     3      8     13
    3 │     4      9     14
    4 │     2      7     12
-   5 │     1      6     11
source
DataFrames.selectFunction
select(df::AbstractDataFrame, args...;
+   5 │     1      6     11
source
DataFrames.selectFunction
select(df::AbstractDataFrame, args...;
        copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
 select(args::Callable, df::DataFrame;
        renamecols::Bool=true, threads::Bool=true)
@@ -1658,14 +1658,14 @@
    5 │     2      3    0.375             2          2
    6 │     1      5    0.625             1          4
    7 │     1      5    0.625             1          5
-   8 │     2      3    0.375             2          3
source
DataFrames.select!Function
select!(df::AbstractDataFrame, args...;
+   8 │     2      3    0.375             2          3
source
DataFrames.select!Function
select!(df::AbstractDataFrame, args...;
         renamecols::Bool=true, threads::Bool=true)
 select!(args::Base.Callable, df::DataFrame;
         renamecols::Bool=true, threads::Bool=true)
 select!(gd::GroupedDataFrame, args...; ungroup::Bool=true,
         renamecols::Bool=true, threads::Bool=true)
 select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true,
-        renamecols::Bool=true, threads::Bool=true)

Mutate df or gd in place to retain only columns or transformations specified by args... and return it. The result is guaranteed to have the same number of rows as df or parent of gd, except when no columns are selected (in which case the result has zero rows).

If a SubDataFrame or GroupedDataFrame{SubDataFrame} is passed, the parent data frame is updated using columns generated by args..., following the same rules as indexing:

  • for existing columns filtered-out rows are filled with values present in the old columns
  • for new columns (which is only allowed if SubDataFrame was created with : as column selector) filtered-out rows are filled with missing
  • dropped columns (which are only allowed if SubDataFrame was created with : as column selector) are removed
  • if SubDataFrame was not created with : as column selector then select! is only allowed if the transformations keep exactly the same sequence of column names as is in the passed df

If a GroupedDataFrame is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame objects constructed using the same parent data frame they might get corrupt.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform; existing columns in the source data frame are put as first columns in the result;

As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns, which must be a single name (as a Symbol or a string), a vector of names or AsTable. Additionally it can be a Function which takes a string or a vector of strings as an argument containing names of columns selected by cols, and returns the target columns names (all accepted types except AsTable are allowed).
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
  5. column-independent operations function => target_cols or just function for specific functions where the input columns are omitted; without target_cols the new column has the same name as function, otherwise it must be single name (as a Symbol or a string). Supported functions are:
    • nrow to efficiently compute the number of rows in each group.
    • proprow to efficiently compute the proportion of rows in each group.
    • eachindex to return a vector holding the number of each row within each group.
    • groupindices to return the group number.
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group if a GroupedDataFrame is processed, or with the data frame itself if an AbstractDataFrame is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.

Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, Tables.AbstractRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:

  • if copycols=true then results of such transformations always perform a copy of the source column or its view;
  • if copycols=false then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using ===, which excludes views of source columns) performs a copy;

Note that performing transform! or select! assumes that copycols=false.

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.

If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.

In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.
  • threads::Bool=true : whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.

Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.

See select for examples.

source
Random.shuffleFunction
shuffle([rng=GLOBAL_RNG,] df::AbstractDataFrame)

Return a copy of df with randomly permuted rows. The optional rng argument specifies a random number generator.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> using Random
+        renamecols::Bool=true, threads::Bool=true)

Mutate df or gd in place to retain only columns or transformations specified by args... and return it. The result is guaranteed to have the same number of rows as df or parent of gd, except when no columns are selected (in which case the result has zero rows).

If a SubDataFrame or GroupedDataFrame{SubDataFrame} is passed, the parent data frame is updated using columns generated by args..., following the same rules as indexing:

  • for existing columns filtered-out rows are filled with values present in the old columns
  • for new columns (which is only allowed if SubDataFrame was created with : as column selector) filtered-out rows are filled with missing
  • dropped columns (which are only allowed if SubDataFrame was created with : as column selector) are removed
  • if SubDataFrame was not created with : as column selector then select! is only allowed if the transformations keep exactly the same sequence of column names as is in the passed df

If a GroupedDataFrame is passed then it is updated to reflect the new rows of its updated parent. If there are independent GroupedDataFrame objects constructed using the same parent data frame they might get corrupt.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform; existing columns in the source data frame are put as first columns in the result;

As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns, which must be a single name (as a Symbol or a string), a vector of names or AsTable. Additionally it can be a Function which takes a string or a vector of strings as an argument containing names of columns selected by cols, and returns the target columns names (all accepted types except AsTable are allowed).
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
  5. column-independent operations function => target_cols or just function for specific functions where the input columns are omitted; without target_cols the new column has the same name as function, otherwise it must be single name (as a Symbol or a string). Supported functions are:
    • nrow to efficiently compute the number of rows in each group.
    • proprow to efficiently compute the proportion of rows in each group.
    • eachindex to return a vector holding the number of each row within each group.
    • groupindices to return the group number.
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group if a GroupedDataFrame is processed, or with the data frame itself if an AbstractDataFrame is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.

Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, Tables.AbstractRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:

  • if copycols=true then results of such transformations always perform a copy of the source column or its view;
  • if copycols=false then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using ===, which excludes views of source columns) performs a copy;

Note that performing transform! or select! assumes that copycols=false.

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.

If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.

In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.
  • threads::Bool=true : whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.

Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.

See select for examples.

source
Random.shuffleFunction
shuffle([rng=GLOBAL_RNG,] df::AbstractDataFrame)

Return a copy of df with randomly permuted rows. The optional rng argument specifies a random number generator.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> using Random
 
 julia> rng = MersenneTwister(1234);
 
@@ -1678,7 +1678,7 @@
    2 │     1      1
    3 │     4      4
    4 │     3      3
-   5 │     5      5
source
Random.shuffle!Function
shuffle!([rng=GLOBAL_RNG,] df::AbstractDataFrame)

Randomly permute rows of df in-place. The optional rng argument specifies a random number generator.

shuffle! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then shuffle! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> using Random
+   5 │     5      5
source
Random.shuffle!Function
shuffle!([rng=GLOBAL_RNG,] df::AbstractDataFrame)

Randomly permute rows of df in-place. The optional rng argument specifies a random number generator.

shuffle! will produce a correct result even if some columns of passed data frame are identical (checked with ===). Otherwise, if two columns share some part of memory but are not identical (e.g. are different views of the same parent vector) then shuffle! result might be incorrect.

Metadata: this function preserves table-level and column-level :note-style metadata.

Metadata having other styles is dropped (from parent data frame when df is a SubDataFrame).

Examples

julia> using Random
 
 julia> rng = MersenneTwister(1234);
 
@@ -1691,11 +1691,11 @@
    2 │     1      1
    3 │     4      4
    4 │     3      3
-   5 │     5      5
source
DataFrames.table_transformationFunction
table_transformation(df_sel::AbstractDataFrame, fun)

This is the function called when AsTable(...) => fun is requested. The df_sel argument is a data frame storing columns selected by the AsTable(...) selector.

By default it calls default_table_transformation. However, it is allowed to add special methods for specific types of fun, as long as the result matches what would be produced by default_table_transformation, except that it is allowed to perform eltype conversion of the resulting vectors or value type promotions that are consistent with promote_type.

It is guaranteed that df_sel has at least one column.

The main use of special table_transformation methods is to provide more efficient than the default implementations of requested fun transformation.

This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental.

Fast paths are implemented within DataFrames.jl for the following functions fun:

  • sum, ByRow(sum), ByRow(sum∘skipmissing)
  • length, ByRow(length), ByRow(length∘skipmissing)
  • mean, ByRow(mean), ByRow(mean∘skipmissing)
  • ByRow(var), ByRow(var∘skipmissing)
  • ByRow(std), ByRow(std∘skipmissing)
  • ByRow(median), ByRow(median∘skipmissing)
  • minimum, ByRow(minimum), ByRow(minimum∘skipmissing)
  • maximum, ByRow(maximum), ByRow(maximum∘skipmissing)
  • fun∘collect and ByRow(fun∘collect) where fun is any function

Note that in order to improve the performance ByRow(sum), ByRow(sum∘skipmissing), ByRow(mean), and ByRow(mean∘skipmissing) perform all operations in the target element type. In some very rare cases (like mixing very large Int64 values and Float64 values) it can lead to a result different from the one that would be obtained by calling the function outside of DataFrames.jl. The way to avoid this precision loss is to use an anonymous function, e.g. instead of ByRow(sum) use ByRow(x -> sum(x)). However, in general for such scenarios even standard aggregation functions should not be considered to provide reliable output, and users are recommended to switch to higher precision calculations. An example of a case when standard sum is affected by the situation discussed is:

julia> sum(Any[typemax(Int), typemax(Int), 1.0])
+   5 │     5      5
source
DataFrames.table_transformationFunction
table_transformation(df_sel::AbstractDataFrame, fun)

This is the function called when AsTable(...) => fun is requested. The df_sel argument is a data frame storing columns selected by the AsTable(...) selector.

By default it calls default_table_transformation. However, it is allowed to add special methods for specific types of fun, as long as the result matches what would be produced by default_table_transformation, except that it is allowed to perform eltype conversion of the resulting vectors or value type promotions that are consistent with promote_type.

It is guaranteed that df_sel has at least one column.

The main use of special table_transformation methods is to provide more efficient than the default implementations of requested fun transformation.

This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental.

Fast paths are implemented within DataFrames.jl for the following functions fun:

  • sum, ByRow(sum), ByRow(sum∘skipmissing)
  • length, ByRow(length), ByRow(length∘skipmissing)
  • mean, ByRow(mean), ByRow(mean∘skipmissing)
  • ByRow(var), ByRow(var∘skipmissing)
  • ByRow(std), ByRow(std∘skipmissing)
  • ByRow(median), ByRow(median∘skipmissing)
  • minimum, ByRow(minimum), ByRow(minimum∘skipmissing)
  • maximum, ByRow(maximum), ByRow(maximum∘skipmissing)
  • fun∘collect and ByRow(fun∘collect) where fun is any function

Note that in order to improve the performance ByRow(sum), ByRow(sum∘skipmissing), ByRow(mean), and ByRow(mean∘skipmissing) perform all operations in the target element type. In some very rare cases (like mixing very large Int64 values and Float64 values) it can lead to a result different from the one that would be obtained by calling the function outside of DataFrames.jl. The way to avoid this precision loss is to use an anonymous function, e.g. instead of ByRow(sum) use ByRow(x -> sum(x)). However, in general for such scenarios even standard aggregation functions should not be considered to provide reliable output, and users are recommended to switch to higher precision calculations. An example of a case when standard sum is affected by the situation discussed is:

julia> sum(Any[typemax(Int), typemax(Int), 1.0])
 -1.0
 
 julia> sum(Any[1.0, typemax(Int), typemax(Int)])
-1.8446744073709552e19
source
DataFrames.transformFunction
transform(df::AbstractDataFrame, args...;
+1.8446744073709552e19
source
DataFrames.transformFunction
transform(df::AbstractDataFrame, args...;
           copycols::Bool=true, renamecols::Bool=true, threads::Bool=true)
 transform(f::Callable, df::DataFrame;
           renamecols::Bool=true, threads::Bool=true)
@@ -1727,14 +1727,14 @@
    2 │    10
 
 julia> transform(gdf, x -> (x=10,), keepkeys=true)
-ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x

See select for more examples.

source
DataFrames.transform!Function
transform!(df::AbstractDataFrame, args...;
+ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x

See select for more examples.

source
DataFrames.transform!Function
transform!(df::AbstractDataFrame, args...;
            renamecols::Bool=true, threads::Bool=true)
 transform!(args::Callable, df::AbstractDataFrame;
            renamecols::Bool=true, threads::Bool=true)
 transform!(gd::GroupedDataFrame, args...;
            ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)
 transform!(f::Base.Callable, gd::GroupedDataFrame;
-           ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)

Mutate df or gd in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...) or select!(gd, :, args...), except that column renaming performs a copy.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform; existing columns in the source data frame are put as first columns in the result;

As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns, which must be a single name (as a Symbol or a string), a vector of names or AsTable. Additionally it can be a Function which takes a string or a vector of strings as an argument containing names of columns selected by cols, and returns the target columns names (all accepted types except AsTable are allowed).
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
  5. column-independent operations function => target_cols or just function for specific functions where the input columns are omitted; without target_cols the new column has the same name as function, otherwise it must be single name (as a Symbol or a string). Supported functions are:
    • nrow to efficiently compute the number of rows in each group.
    • proprow to efficiently compute the proportion of rows in each group.
    • eachindex to return a vector holding the number of each row within each group.
    • groupindices to return the group number.
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group if a GroupedDataFrame is processed, or with the data frame itself if an AbstractDataFrame is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.

Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, Tables.AbstractRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:

  • if copycols=true then results of such transformations always perform a copy of the source column or its view;
  • if copycols=false then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using ===, which excludes views of source columns) performs a copy;

Note that performing transform! or select! assumes that copycols=false.

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.

If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.

In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.
  • threads::Bool=true : whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.

Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.

See select for examples.

source
Base.vcatFunction
vcat(dfs::AbstractDataFrame...;
+           ungroup::Bool=true, renamecols::Bool=true, threads::Bool=true)

Mutate df or gd in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...) or select!(gd, :, args...), except that column renaming performs a copy.

Below detailed common rules for all transformation functions supported by DataFrames.jl are explained and compared.

All these operations are supported both for AbstractDataFrame (when split and combine steps are skipped) and GroupedDataFrame. Technically, AbstractDataFrame is just considered as being grouped on no columns (meaning it has a single group, or zero groups if it is empty). The only difference is that in this case the keepkeys and ungroup keyword arguments (described below) are not supported and a data frame is always returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a GroupedDataFrame object from your data frame using the groupby function that takes two arguments: (1) a data frame to be grouped, and (2) a set of columns to group by.

Operations can then be applied on each group using one of the following functions:

  • combine: does not put restrictions on number of rows returned per group; the returned values are vertically concatenated following order of groups in GroupedDataFrame; it is typically used to compute summary statistics by group; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • select: return a data frame with the number and order of rows exactly the same as the source data frame, including only new calculated columns; select! is an in-place version of select; for GroupedDataFrame if grouping columns are kept they are put as first columns in the result;
  • transform: return a data frame with the number and order of rows exactly the same as the source data frame, including all columns from the source and new calculated columns; transform! is an in-place version of transform; existing columns in the source data frame are put as first columns in the result;

As a special case, if a GroupedDataFrame that has zero groups is passed then the result of the operation is determined by performing a single call to the transformation function with a 0-row argument passed to it. The output of this operation is only used to identify the number and type of produced columns, but the result has zero rows.

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be of the following forms:

  1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All, Cols, :, Between, Not and regular expressions)
  2. a cols => function pair indicating that function should be called with positional arguments holding columns cols, which can be any valid column selector; in this case target column name is automatically generated and it is assumed that function returns a single value or a vector; the generated name is created by concatenating source column name and function name by default (see examples below).
  3. a cols => function => target_cols form additionally explicitly specifying the target column or columns, which must be a single name (as a Symbol or a string), a vector of names or AsTable. Additionally it can be a Function which takes a string or a vector of strings as an argument containing names of columns selected by cols, and returns the target columns names (all accepted types except AsTable are allowed).
  4. a col => target_cols pair, which renames the column col to target_cols, which must be single name (as a Symbol or a string), a vector of names or AsTable.
  5. column-independent operations function => target_cols or just function for specific functions where the input columns are omitted; without target_cols the new column has the same name as function, otherwise it must be single name (as a Symbol or a string). Supported functions are:
    • nrow to efficiently compute the number of rows in each group.
    • proprow to efficiently compute the proportion of rows in each group.
    • eachindex to return a vector holding the number of each row within each group.
    • groupindices to return the group number.
  6. vectors or matrices containing transformations specified by the Pair syntax described in points 2 to 5
  7. a function which will be called with a SubDataFrame corresponding to each group if a GroupedDataFrame is processed, or with the data frame itself if an AbstractDataFrame is processed; this form should be avoided due to its poor performance unless the number of groups is small or a very large number of columns are processed (in which case SubDataFrame avoids excessive compilation)

Note! If the expression of the form x => y is passed then except for the special convenience form nrow => target_cols it is always interpreted as cols => function. In particular the following expression function => target_cols is not a valid transformation specification.

Note! If cols or target_cols are one of All, Cols, Between, or Not, broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols). This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.

All functions have two types of signatures. One of them takes a GroupedDataFrame as the first argument and an arbitrary number of transformations described above as following arguments. The second type of signature is when a Function or a Type is passed as the first argument and a GroupedDataFrame as the second argument (similar to map).

As a special rule, with the cols => function and cols => function => target_cols syntaxes, if cols is wrapped in an AsTable object then a NamedTuple containing columns selected by cols is passed to function. The documentation of DataFrames.table_transformation provides more information about this functionality, in particular covering performance considerations.

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
  2. If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, a Tables.AbstractRow, or a DataFrameRow raises an error.
  3. If target_cols is a vector of Symbols or strings or AsTable it is assumed that function returns multiple columns. If function returns one of AbstractDataFrame, NamedTuple, DataFrameRow, Tables.AbstractRow, AbstractMatrix then rules described in point 1 above apply. If function returns an AbstractVector then each element of this vector must support the keys function, which must return a collection of Symbols, strings or integers; the return value of keys must be identical for all elements. Then as many columns are created as there are elements in the return value of the keys function. If target_cols is AsTable then their names are set to be equal to the key names except if keys returns integers, in which case they are prefixed by x (so the column names are e.g. x1, x2, ...). If target_cols is a vector of Symbols or strings then column names produced using the rules above are ignored and replaced by target_cols (the number of columns must be the same as the length of target_cols in this case). If fun returns a value of any other type then it is assumed that it is a table conforming to the Tables.jl API and the Tables.columntable function is called on it to get the resulting columns and their names. The names are retained when target_cols is AsTable and are replaced if target_cols is a vector of Symbols or strings.

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then treated as a single row.

select/select! and transform/transform! always return a data frame with the same number and order of rows as the source (even if GroupedDataFrame had its groups reordered), except when selection results in zero columns in the resulting data frame (in which case the result has zero rows).

For combine, rows in the returned object appear in the order of groups in the GroupedDataFrame. The functions can return an arbitrary number of rows for each group, but the kind of returned object and the number and names of columns must be the same for all groups, except when a DataFrame() or NamedTuple() is returned, in which case a given group is skipped.

It is allowed to mix single values and vectors if multiple transformations are requested. In this case single value will be repeated to match the length of columns specified by returned vectors.

To apply function to each row instead of whole columns, it can be wrapped in a ByRow struct. cols can be any column indexing syntax, in which case function will be passed one argument for each of the columns specified by cols or a NamedTuple of them if specified columns are wrapped in AsTable. If ByRow is used it is allowed for cols to select an empty set of columns, in which case function is called for each row without any arguments and an empty NamedTuple is passed if empty set of columns is wrapped in AsTable.

If a collection of column names is passed then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

In general columns returned by transformations are stored in the target data frame without copying. An exception to this rule is when columns from the source data frame are reused in the target data frame. This can happen via expressions like: :x1, [:x1, :x2], :x1 => :x2, :x1 => identity => :x2, or :x1 => (x -> @view x[inds]) (note that in the last case the source column is reused indirectly via a view). In such cases the behavior depends on the value of the copycols keyword argument:

  • if copycols=true then results of such transformations always perform a copy of the source column or its view;
  • if copycols=false then copies are only performed to avoid storing the same column several times in the target data frame; more precisely, no copy is made the first time a column is used, but each subsequent reuse of a source column (when compared using ===, which excludes views of source columns) performs a copy;

Note that performing transform! or select! assumes that copycols=false.

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns and in this case transforming or renaming columns is not allowed.

If a GroupedDataFrame is passed and threads=true (the default), a separate task is spawned for each specified transformation; each transformation then spawns as many tasks as Julia threads, and splits processing of groups across them (however, currently transformations with optimized implementations like sum and transformations that return multiple rows use a single task for all groups). This allows for parallel operation when Julia was started with more than one thread. Passed transformation functions must therefore not modify global variables (i.e. they must be pure), use locks to control parallel accesses, or threads=false must be passed to disable multithreading. In the future, parallelism may be extended to other cases, so this requirement also holds for DataFrame inputs.

In order to improve the performance of the operations some transformations invoke optimized implementation, see DataFrames.table_transformation for details.

Keyword arguments

  • renamecols::Bool=true : whether in the cols => function form automatically generated column names should include the name of transformation functions or not.
  • ungroup::Bool=true : whether the return value of the operation on gd should be a data frame or a GroupedDataFrame.
  • threads::Bool=true : whether transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.

Metadata: this function propagates table-level :note-style metadata. Column-level :note-style metadata is propagated if: a) a single column is transformed to a single column and the name of the column does not change (this includes all column selection operations), or b) a single column is transformed with identity or copy to a single column even if column name is changed (this includes column renaming). As a special case for GroupedDataFrame if the output has the same name as a grouping column and keepkeys=true, metadata is taken from original grouping column.

See select for examples.

source
Base.vcatFunction
vcat(dfs::AbstractDataFrame...;
      cols::Union{Symbol, AbstractVector{Symbol},
                  AbstractVector{<:AbstractString}}=:setequal,
      source::Union{Nothing, Symbol, AbstractString,
@@ -1841,7 +1841,7 @@
    6 │     6        6  missing  b
    7 │     7  missing        7  d
    8 │     8  missing        8  d
-   9 │     9  missing        9  d
source

Reshaping data frames between tall and wide formats

Base.stackFunction
stack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
+   9 │     9  missing        9  d
source

Reshaping data frames between tall and wide formats

Base.stackFunction
stack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
       variable_name=:variable, value_name=:value,
       view::Bool=false, variable_eltype::Type=String)

Stack a data frame df, i.e. convert it from wide to long format.

Return the long-format DataFrame with: columns for each of the id_vars, column value_name (:value by default) holding the values of the stacked columns (measure_vars), and column variable_name (:variable by default) a vector holding the name of the corresponding measure_vars variable.

If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.

Arguments

  • df : the AbstractDataFrame to be stacked
  • measure_vars : the columns to be stacked (the measurement variables), as a column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.
  • id_vars : the identifier columns that are repeated during stacking, as a column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). Defaults to all variables that are not measure_vars
  • variable_name : the name (Symbol or string) of the new stacked column that shall hold the names of each of measure_vars
  • value_name : the name (Symbol or string) of the new stacked column containing the values from each of measure_vars
  • view : whether the stacked data frame should be a view rather than contain freshly allocated vectors.
  • variable_eltype : determines the element type of column variable_name. By default a PooledArray{String} is created. If variable_eltype=Symbol a PooledVector{Symbol} is created, and if variable_eltype=CategoricalValue{String} a CategoricalArray{String} is produced (call using CategoricalArrays first if needed) Passing any other type T will produce a PooledVector{T} column as long as it supports conversion from String. When view=true, a RepeatedVector{T} is produced.

Metadata: table-level :note-style metadata and column-level :note-style metadata for identifier columns are preserved.

Examples

julia> df = DataFrame(a=repeat(1:3, inner=2),
                       b=repeat(1:2, inner=3),
@@ -1929,7 +1929,7 @@
    9 │     2      1  c       d                3
   10 │     2      2  d       d                4
   11 │     3      2  e       d                5
-  12 │     3      2  f       d                6
source
DataFrames.unstackFunction
unstack(df::AbstractDataFrame, rowkeys, colkey, value;
+  12 │     3      2  f       d                6
source
DataFrames.unstackFunction
unstack(df::AbstractDataFrame, rowkeys, colkey, value;
         renamecols::Function=identity, allowmissing::Bool=false,
         combine=only, fill=missing, threads::Bool=true)
 unstack(df::AbstractDataFrame, colkey, value;
@@ -2074,7 +2074,7 @@
  Row │ a       b
      │ Int64?  Int64?
 ─────┼────────────────
-   1 │      3       4
source
Base.permutedimsFunction
permutedims(df::AbstractDataFrame,
+   1 │      3       4
source
Base.permutedimsFunction
permutedims(df::AbstractDataFrame,
             [src_namescol::Union{Int, Symbol, AbstractString}],
             [dest_namescol::Union{Symbol, AbstractString}];
             makeunique::Bool=false, strict::Bool=true)

Turn df on its side such that rows become columns and values in the column indexed by src_namescol become the names of new columns. In the resulting DataFrame, column names of df will become the first column with name specified by dest_namescol.

Arguments

  • df : the AbstractDataFrame
  • src_namescol : the column that will become the new header. If omitted then column names :x1, :x2, ... are generated automatically.
  • dest_namescol : the name of the first column in the returned DataFrame. Defaults to the same name as src_namescol. Not supported when src_namescol is a vector or is omitted.
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate). Not supported when src_namescol is omitted.
  • strict : if true (the default), an error will be raised if the values contained in the src_namescol are not all Symbol or all AbstractString, or can all be converted to String using convert. If false then any values are accepted and the will be changed to strings using the string function. Not supported when src_namescol is a vector or is omitted.

Note: The element types of columns in resulting DataFrame (other than the first column if it is created from df column names, which always has element type String) will depend on the element types of all input columns based on the result of promote_type. That is, if the source data frame contains Int and Float64 columns, resulting columns will have element type Float64. If the source has Int and String columns, resulting columns will have element type Any.

Metadata: table-level :note-style metadata is preserved and column-level metadata is dropped.

Examples

julia> df = DataFrame(a=1:2, b=3:4)
@@ -2133,7 +2133,7 @@
 ─────┼─────────────────────────────
    1 │ b               1     two
    2 │ c               3     4
-   3 │ d               true  false
source

Sorting

Base.issortedFunction
issorted(df::AbstractDataFrame, cols=All();
+   3 │ d               true  false
source

Sorting

Base.issortedFunction
issorted(df::AbstractDataFrame, cols=All();
          lt::Union{Function, AbstractVector{<:Function}}=isless,
          by::Union{Function, AbstractVector{<:Function}}=identity,
          rev::Union{Bool, AbstractVector{Bool}}=false,
@@ -2158,7 +2158,7 @@
 false
 
 julia> issorted(df, :b, rev=true)
-true
source
DataFrames.orderFunction
order(col::ColumnIndex; kwargs...)

Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.

See also: sort!, sort

Examples

julia> df = DataFrame(x=[-3, -1, 0, 2, 4], y=1:5)
+true
source
DataFrames.orderFunction
order(col::ColumnIndex; kwargs...)

Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.

See also: sort!, sort

Examples

julia> df = DataFrame(x=[-3, -1, 0, 2, 4], y=1:5)
 5×2 DataFrame
  Row │ x      y
      │ Int64  Int64
@@ -2189,7 +2189,7 @@
    2 │    -1      2
    3 │     2      4
    4 │    -3      1
-   5 │     4      5
source
Base.sortFunction
sort(df::AbstractDataFrame, cols=All();
+   5 │     4      5
source
Base.sortFunction
sort(df::AbstractDataFrame, cols=All();
      alg::Union{Algorithm, Nothing}=nothing,
      lt::Union{Function, AbstractVector{<:Function}}=isless,
      by::Union{Function, AbstractVector{<:Function}}=identity,
@@ -2244,7 +2244,7 @@
    1 │     1  c
    2 │     1  b
    3 │     2  a
-   4 │     3  b
source
Base.sort!Function
sort!(df::AbstractDataFrame, cols=All();
+   4 │     3  b
source
Base.sort!Function
sort!(df::AbstractDataFrame, cols=All();
       alg::Union{Algorithm, Nothing}=nothing,
       lt::Union{Function, AbstractVector{<:Function}}=isless,
       by::Union{Function, AbstractVector{<:Function}}=identity,
@@ -2298,7 +2298,7 @@
    1 │     1  c
    2 │     1  b
    3 │     2  a
-   4 │     3  b
source
Base.sortpermFunction
sortperm(df::AbstractDataFrame, cols=All();
+   4 │     3  b
source
Base.sortpermFunction
sortperm(df::AbstractDataFrame, cols=All();
          alg::Union{Algorithm, Nothing}=nothing,
          lt::Union{Function, AbstractVector{<:Function}}=isless,
          by::Union{Function, AbstractVector{<:Function}}=identity,
@@ -2340,7 +2340,7 @@
  2
  4
  3
- 1
source

Joining

DataAPI.antijoinFunction
antijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.

The order of rows in the result is kept from df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : ignored as no columns are added to df1 columns (it is provided for consistency with other functions).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata are taken from df1.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
+ 1
source

Joining

DataAPI.antijoinFunction
antijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.

The order of rows in the result is kept from df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : ignored as no columns are added to df1 columns (it is provided for consistency with other functions).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata are taken from df1.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
 3×2 DataFrame
  Row │ ID     Name
      │ Int64  String
@@ -2386,7 +2386,7 @@
  Row │ ID     Name
      │ Int64  String
 ─────┼──────────────────
-   1 │     3  Joe Blogs
source
DataAPI.crossjoinFunction
crossjoin(df1::AbstractDataFrame, df2::AbstractDataFrame;
+   1 │     3  Joe Blogs
source
DataAPI.crossjoinFunction
crossjoin(df1::AbstractDataFrame, df2::AbstractDataFrame;
           makeunique::Bool=false, renamecols=identity => identity)
 crossjoin(df1, df2, dfs...; makeunique = false)

Perform a cross join of two or more data frame objects and return a DataFrame containing the result. A cross join returns the cartesian product of rows from all passed data frames, where the first passed data frame is assigned to the dimension that changes the slowest and the last data frame is assigned to the dimension that changes the fastest.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String.

If more than two data frames are passed, the join is performed recursively with left associativity.

Metadata: table-level :note-style metadata is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note-style metadata is preserved from both tables.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin.

Examples

julia> df1 = DataFrame(X=1:3)
 3×1 DataFrame
@@ -2415,7 +2415,7 @@
    3 │     2  a
    4 │     2  b
    5 │     3  a
-   6 │     3  b
source
DataAPI.innerjoinFunction
innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
+   6 │     3  b
source
DataAPI.innerjoinFunction
innerjoin(df1, df2; on, makeunique=false, validate=(false, false),
           renamecols=(identity => identity), matchmissing=:error,
           order=:undefined)
 innerjoin(df1, df2, dfs...; on, makeunique=false,
@@ -2469,7 +2469,7 @@
      │ Int64  String    String
 ─────┼─────────────────────────
    1 │     1  John Doe  Lawyer
-   2 │     2  Jane Doe  Doctor
source
DataAPI.leftjoinFunction
leftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
+   2 │     2  Jane Doe  Doctor
source
DataAPI.leftjoinFunction
leftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
          renamecols=(identity => identity), matchmissing=:error, order=:undefined)

Perform a left join of two data frame objects and return a DataFrame containing the result. A left join includes all rows from df1.

In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df1. This behavior may change in future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • source : Default: nothing. If a Symbol or string, adds indicator column with the given name, for whether a row appeared in only df1 ("left_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.
  • order : if :undefined (the default) the order of rows in the result is undefined and may change in future releases. If :left then the order of rows from the left data frame is retained. If :right then the order of rows from the right data frame is retained (non-matching rows are put at the end).

All columns of the returned data frame will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata is taken from df1 (including key columns), except for columns added to it from df2, whose column-level :note-style metadata is taken from df2.

See also: innerjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
 3×2 DataFrame
  Row │ ID     Name
@@ -2522,7 +2522,7 @@
 ─────┼───────────────────────────
    1 │     1  John Doe   Lawyer
    2 │     2  Jane Doe   Doctor
-   3 │     3  Joe Blogs  missing
source
DataFrames.leftjoin!Function
leftjoin!(df1, df2; on, makeunique=false, source=nothing,
+   3 │     3  Joe Blogs  missing
source
DataFrames.leftjoin!Function
leftjoin!(df1, df2; on, makeunique=false, source=nothing,
           matchmissing=:error)

Perform a left join of two data frame objects by updating the df1 with the joined columns from df2.

A left join includes all rows from df1 and leaves all rows and columns from df1 untouched. Note that each row in df1 must have at most one match in df2. Otherwise, this function would not be able to execute the join in-place since new rows would need to be added to df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • source : Default: nothing. If a Symbol or string, adds indicator column with the given name, for whether a row appeared in only df1 ("left_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.

The columns added to df1 from df2 will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

Metadata: table-level and column-level :note-style metadata are taken from df1 (including key columns), except for columns added to it from df2, whose column-level :note-style metadata is taken from df2.

See also: leftjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
 3×2 DataFrame
  Row │ ID     Name
@@ -2566,7 +2566,7 @@
 ─────┼───────────────────────────────────────────────
    1 │     1  John Doe   Lawyer   Lawyer   both
    2 │     2  Jane Doe   Doctor   Doctor   both
-   3 │     3  Joe Blogs  missing  missing  left_only
source
DataAPI.outerjoinFunction
outerjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
+   3 │     3  Joe Blogs  missing  missing  left_only
source
DataAPI.outerjoinFunction
outerjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false),
           renamecols=(identity => identity), matchmissing=:error, order=:undefined)
 outerjoin(df1, df2, dfs...; on, makeunique = false,
           validate = (false, false), matchmissing=:error, order=:undefined)

Perform an outer join of two or more data frame objects and return a DataFrame containing the result. An outer join includes rows with keys that appear in any of the passed data frames.

The order of rows in the result is undefined and may change in future releases.

In the returned data frame the type of the columns on which the data frames are joined is determined by the element type of these columns both df1 and df2. This behavior may change in future releases.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). When joining only two data frames, a left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • source : Default: nothing. If a Symbol or string, adds indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true. This argument is only supported when joining exactly two data frames.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched.
  • order : if :undefined (the default) the order of rows in the result is undefined and may change in future releases. If :left then the order of rows from the left data frame is retained (non-matching rows are put at the end). If :right then the order of rows from the right data frame is retained (non-matching rows are put at the end).

All columns of the returned data frame will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported and validate keyword argument is applied recursively with left associativity.

Metadata: table-level :note-style metadata and column-level :note-style metadata for key columns is preserved only for keys which are defined in all passed tables and have the same value. Column-level :note-style metadata is preserved for all other columns.

See also: innerjoin, leftjoin, rightjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
@@ -2624,7 +2624,7 @@
    1 │     1  John Doe   Lawyer
    2 │     2  Jane Doe   Doctor
    3 │     3  Joe Blogs  missing
-   4 │     4  missing    Farmer
source
DataAPI.rightjoinFunction
rightjoin(df1, df2; on, makeunique=false, source=nothing,
+   4 │     4  missing    Farmer
source
DataAPI.rightjoinFunction
rightjoin(df1, df2; on, makeunique=false, source=nothing,
           validate=(false, false), renamecols=(identity => identity),
           matchmissing=:error, order=:undefined)

Perform a right join on two data frame objects and return a DataFrame containing the result. A right join includes all rows from df2.

The order of rows in the result is undefined and may change in future releases.

In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in df2. This behavior may change in future releases.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • source : Default: nothing. If a Symbol or string, adds indicator column with the given name for whether a row appeared in only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • renamecols : a Pair specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a Symbol can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a String. Note that renamecols does not affect on columns, whose names are always taken from the left data frame and left unchanged.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df1 on columns.
  • order : if :undefined (the default) the order of rows in the result is undefined and may change in future releases. If :left then the order of rows from the left data frame is retained (non-matching rows are put at the end). If :right then the order of rows from the right data frame is retained.

All columns of the returned data frame will support missing values.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata is taken from df2 (including key columns), except for columns added to it from df1, whose column-level :note-style metadata is taken from df1.

See also: innerjoin, leftjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
 3×2 DataFrame
@@ -2678,7 +2678,7 @@
 ─────┼─────────────────────────
    1 │     1  John Doe  Lawyer
    2 │     2  Jane Doe  Doctor
-   3 │     4  missing   Farmer
source
DataAPI.semijoinFunction
semijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.

The order of rows in the result is kept from df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : ignored as no columns are added to df1 columns (it is provided for consistency with other functions).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata are taken from df1.

See also: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
+   3 │     4  missing   Farmer
source
DataAPI.semijoinFunction
semijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error)

Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.

The order of rows in the result is kept from df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : The names of the key columns on which to join the data frames. This can be a single name, or a vector of names (for joining on multiple columns). A left=>right pair of names can be used instead of a name, for the case where a key has different names in df1 and df2 (it is allowed to mix names and name pairs in a vector). Key values are compared using isequal. on is a required argument.
  • makeunique : ignored as no columns are added to df1 columns (it is provided for consistency with other functions).
  • indicator : Default: nothing. If a Symbol or string, adds categorical indicator column with the given name for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If the name is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.
  • matchmissing : if equal to :error throw an error if missing is present in on columns; if equal to :equal then missing is allowed and missings are matched; if equal to :notequal then missings are dropped in df2 on columns.

It is not allowed to join on columns that contain NaN or -0.0 in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a CategoricalVector.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

Metadata: table-level and column-level :note-style metadata are taken from df1.

See also: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
 3×2 DataFrame
  Row │ ID     Name
      │ Int64  String
@@ -2727,7 +2727,7 @@
      │ Int64  String
 ─────┼─────────────────
    1 │     1  John Doe
-   2 │     2  Jane Doe
source

Grouping

Base.getFunction
get(gd::GroupedDataFrame, key, default)

Get a group based on the values of the grouping columns.

key may be a GroupKey, NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby). It may also be an AbstractDict, in which case the order of the arguments does not matter.

Examples

julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[2]),
+   2 │     2  Jane Doe
source

Grouping

Base.getFunction
get(gd::GroupedDataFrame, key, default)

Get a group based on the values of the grouping columns.

key may be a GroupKey, NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby). It may also be an AbstractDict, in which case the order of the arguments does not matter.

Examples

julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[2]),
                       b=repeat([2, 1], outer=[3]),
                       c=1:6);
 
@@ -2763,7 +2763,7 @@
    1 │ baz         2      3
    2 │ baz         1      6
 
-julia> get(gd, (:qux,), nothing)
source
DataAPI.groupbyFunction
groupby(d::AbstractDataFrame, cols;
+julia> get(gd, (:qux,), nothing)
source
DataAPI.groupbyFunction
groupby(d::AbstractDataFrame, cols;
         sort::Union{Bool, Nothing, NamedTuple}=nothing,
         skipmissing::Bool=false)

Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.

Arguments

  • df : an AbstractDataFrame to split
  • cols : data frame columns to group by. Can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). In particular if the selector picks no columns then a single-group GroupedDataFrame is created. As a special case, if cols is a single column or a vector of columns then it can contain columns wrapped in order that will be used to determine the order of groups if sort is true or a NamedTuple (if sort is false, then passing order is an error; if sort is nothing then it is set to true when order is passed).
  • sort : if sort=true sort groups according to the values of the grouping columns cols; if sort=false groups are created in their order of appearance in df; if sort=nothing (the default) then the fastest available grouping algorithm is picked and in consequence the order of groups in the result is undefined and may change in future releases; below a description of the current implementation is provided. Additionally sort can be a NamedTuple having some or all of alg, lt, by, rev, and order fields. In this case the groups are sorted and their order follows the sortperm order.
  • skipmissing : whether to skip groups with missing values in one of the grouping columns cols

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.

A GroupedDataFrame also supports indexing by groups, select, transform, and combine (which applies a function to each group and combines the result into a data frame).

GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices. Finally, an AbstractDict can be used to index into a grouped data frame where the keys are column names of the data frame. The order of the keys does not matter in this case.

In the current implementation if sort=nothing groups are ordered following the order of appearance of values in the grouping columns, except when all grouping columns provide non-nothing DataAPI.refpool, in which case the order of groups follows the order of values returned by DataAPI.refpool. As a particular application of this rule if all cols are CategoricalVectors then groups are always sorted. Integer columns with a narrow range also use this this optimization, so to the order of groups when grouping on integer columns is undefined. A column is considered to be an integer column when deciding on the grouping algorithm choice if its eltype is a subtype of Union{Missing, Real}, all its elements are either missing or pass isinteger test, and none of them is equal to -0.0.

See also

combine, select, select!, transform, transform!

Examples

julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
                       b=repeat([2, 1], outer=[4]),
@@ -2862,7 +2862,7 @@
      │ Int64  Int64  Int64
 ─────┼─────────────────────
    1 │     4      1      4
-   2 │     4      1      8
source
DataFrames.groupcolsFunction
groupcols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) used for grouping.

source
DataFrames.groupindicesFunction
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

The groupindices => target_col_name syntax (or just groupindices without specifying the target column name) is also supported in the transformation mini-language when passing a GroupedDataFrame to transformation functions (combine, select, etc.).

Examples

julia> df = DataFrame(id=["a", "c", "b", "b", "a"])
+   2 │     4      1      8
source
DataFrames.groupcolsFunction
groupcols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) used for grouping.

source
DataFrames.groupindicesFunction
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

The groupindices => target_col_name syntax (or just groupindices without specifying the target column name) is also supported in the transformation mini-language when passing a GroupedDataFrame to transformation functions (combine, select, etc.).

Examples

julia> df = DataFrame(id=["a", "c", "b", "b", "a"])
 5×1 DataFrame
  Row │ id
      │ String
@@ -2893,7 +2893,7 @@
    2 │ c           2
    3 │ b           3
    4 │ b           3
-   5 │ a           1
source
Base.keysFunction
keys(gd::GroupedDataFrame)

Get the set of keys for each group of the GroupedDataFrame gd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple, NamedTuple, and AbstractDict, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.

Examples

julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[4]),
+   5 │ a           1
source
Base.keysFunction
keys(gd::GroupedDataFrame)

Get the set of keys for each group of the GroupedDataFrame gd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple, NamedTuple, and AbstractDict, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.

Examples

julia> df = DataFrame(a=repeat([:foo, :bar, :baz], outer=[4]),
                       b=repeat([2, 1], outer=[6]),
                       c=1:12);
 
@@ -2952,7 +2952,7 @@
    2 │ foo         2      7
 
 julia> gd[keys(gd)[1]] == gd[1]
-true
source
keys(dfc::DataFrameColumns)

Get a vector of column names of dfc as Symbols.

source
Base.parentFunction
parent(gd::GroupedDataFrame)

Return the parent data frame of gd.

source
DataFrames.proprowFunction
proprow

Compute the proportion of rows which belong to each group, i.e. its number of rows divided by the total number of rows in a GroupedDataFrame.

This function can only be used in the transformation mini-language via the proprow => target_col_name syntax (or just proprow without specifying the target column name), when passing a GroupedDataFrame to transformation functions (combine, select, etc.).

Examples

julia> df = DataFrame(id=["a", "c", "b", "b", "a", "b"])
+true
source
keys(dfc::DataFrameColumns)

Get a vector of column names of dfc as Symbols.

source
Base.parentFunction
parent(gd::GroupedDataFrame)

Return the parent data frame of gd.

source
DataFrames.proprowFunction
proprow

Compute the proportion of rows which belong to each group, i.e. its number of rows divided by the total number of rows in a GroupedDataFrame.

This function can only be used in the transformation mini-language via the proprow => target_col_name syntax (or just proprow without specifying the target column name), when passing a GroupedDataFrame to transformation functions (combine, select, etc.).

Examples

julia> df = DataFrame(id=["a", "c", "b", "b", "a", "b"])
 6×1 DataFrame
  Row │ id
      │ String
@@ -2985,7 +2985,7 @@
    3 │ b       0.5
    4 │ b       0.5
    5 │ a       0.333333
-   6 │ b       0.5
source
DataFrames.valuecolsFunction
valuecols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) not used for grouping.

source

Filtering rows

Base.alluniqueFunction
allunique(df::AbstractDataFrame, cols=:)

Return true if none of the rows of df are duplicated. Two rows are duplicates if all their columns contain equal values (according to isequal) for all columns in cols (by default, all columns).

Arguments

  • df : AbstractDataFrame
  • cols : a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted by select.

See also unique and nonunique.

Examples

julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
+   6 │ b       0.5
source
DataFrames.valuecolsFunction
valuecols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) not used for grouping.

source

Filtering rows

Base.alluniqueFunction
allunique(df::AbstractDataFrame, cols=:)

Return true if none of the rows of df are duplicated. Two rows are duplicates if all their columns contain equal values (according to isequal) for all columns in cols (by default, all columns).

Arguments

  • df : AbstractDataFrame
  • cols : a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted by select.

See also unique and nonunique.

Examples

julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
 4×2 DataFrame
  Row │ i      x
      │ Int64  Int64
@@ -3002,7 +3002,7 @@
 false
 
 julia> allunique(df, :i => ByRow(isodd))
-false
source
Base.deleteat!Function
deleteat!(df::DataFrame, inds)

Delete rows specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+false
source
Base.deleteat!Function
deleteat!(df::DataFrame, inds)

Delete rows specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3017,7 +3017,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     1      4
-   2 │     3      6
source
Base.emptyFunction
empty(df::AbstractDataFrame)

Create a new DataFrame with the same column names and column element types as df but with zero rows.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.empty!Function
empty!(df::DataFrame)

Remove all rows from df, making each of its columns empty.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+   2 │     3      6
source
Base.emptyFunction
empty(df::AbstractDataFrame)

Create a new DataFrame with the same column names and column element types as df but with zero rows.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.empty!Function
empty!(df::DataFrame)

Remove all rows from df, making each of its columns empty.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3033,7 +3033,7 @@
 ─────┴──────────────
 
 julia> df.a, df.b
-(Int64[], Int64[])
source
Base.filterFunction
filter(fun, df::AbstractDataFrame; view::Bool=false)
+(Int64[], Int64[])
source
Base.filterFunction
filter(fun, df::AbstractDataFrame; view::Bool=false)
 filter(cols => fun, df::AbstractDataFrame; view::Bool=false)

Return a data frame containing only rows from df for which fun returns true.

If cols is not specified then the predicate fun is passed DataFrameRows. Elements of a DataFrameRow may be accessed with dot syntax or column indexing inside fun.

If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Passing cols leads to a more efficient execution of the operation for large data frames.

Note

This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset function instead as it is consistent with other DataFrames.jl functions (as opposed to filter).

Note

Due to type stability the filter(cols => fun, df::AbstractDataFrame; view::Bool=false) call is preferred in performance critical applications.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: filter!

Examples

julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
 4×2 DataFrame
  Row │ x      y
@@ -3084,7 +3084,7 @@
 ─────┼───────────────
    1 │     3  b
    2 │     1  c
-   3 │     1  b
source
filter(fun, gdf::GroupedDataFrame; ungroup::Bool=false)
+   3 │     1  b
source
filter(fun, gdf::GroupedDataFrame; ungroup::Bool=false)
 filter(cols => fun, gdf::GroupedDataFrame; ungroup::Bool=false)

Return only groups in gd for which fun returns true as a GroupedDataFrame if ungroup=false (the default), or as a data frame if ungroup=true.

If cols is not specified then the predicate fun is called with a SubDataFrame for each group.

If cols is specified then the predicate fun is called for each group with views of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Note

This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset function instead as it is consistent with other DataFrames.jl functions (as opposed to filter).

Examples

julia> df = DataFrame(g=[1, 2], x=['a', 'b']);
 
 julia> gd = groupby(df, :g)
@@ -3122,7 +3122,7 @@
  Row │ g      x
      │ Int64  Char
 ─────┼─────────────
-   1 │     1  a
source
Base.filter!Function
filter!(fun, df::AbstractDataFrame)
+   1 │     1  a
source
Base.filter!Function
filter!(fun, df::AbstractDataFrame)
 filter!(cols => fun, df::AbstractDataFrame)

Remove rows from data frame df for which fun returns false.

If cols is not specified then the predicate fun is passed DataFrameRows. Elements of a DataFrameRow may be accessed with dot syntax or column indexing inside fun.

If cols is specified then the predicate fun is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Passing cols leads to a more efficient execution of the operation for large data frames.

Note

This method is defined so that DataFrames.jl implements the Julia API for collections, but it is generally recommended to use the subset! function instead as it is consistent with other DataFrames.jl functions (as opposed to filter!).

Note

Due to type stability the filter!(cols => fun, df::AbstractDataFrame) call is preferred in performance critical applications.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: filter

Examples

julia> df = DataFrame(x=[3, 1, 2, 1], y=["b", "c", "a", "b"])
 4×2 DataFrame
  Row │ x      y
@@ -3176,7 +3176,7 @@
 ─────┼───────────────
    1 │     3  b
    2 │     1  c
-   3 │     1  b
source
Base.keepat!Function
keepat!(df::DataFrame, inds)

Delete rows at all indices not specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+   3 │     1  b
source
Base.keepat!Function
keepat!(df::DataFrame, inds)

Delete rows at all indices not specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not wrapping any valid selector.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3191,7 +3191,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     1      4
-   2 │     3      6
source
Base.firstFunction
first(df::AbstractDataFrame)

Get the first row of df as a DataFrameRow.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
first(df::AbstractDataFrame, n::Integer; view::Bool=false)

Get a data frame with the n first rows of df. Get all rows if n is greater than the number of rows in df. Error if n is negative.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.lastFunction
last(df::AbstractDataFrame)

Get the last row of df as a DataFrameRow.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
last(df::AbstractDataFrame, n::Integer; view::Bool=false)

Get a data frame with the n last rows of df. Get all rows if n is greater than the number of rows in df. Error if n is negative.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.nonuniqueFunction
nonunique(df::AbstractDataFrame; keep::Symbol=:first)
+   2 │     3      6
source
Base.firstFunction
first(df::AbstractDataFrame)

Get the first row of df as a DataFrameRow.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
first(df::AbstractDataFrame, n::Integer; view::Bool=false)

Get a data frame with the n first rows of df. Get all rows if n is greater than the number of rows in df. Error if n is negative.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.lastFunction
last(df::AbstractDataFrame)

Get the last row of df as a DataFrameRow.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
last(df::AbstractDataFrame, n::Integer; view::Bool=false)

Get a data frame with the n last rows of df. Get all rows if n is greater than the number of rows in df. Error if n is negative.

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.nonuniqueFunction
nonunique(df::AbstractDataFrame; keep::Symbol=:first)
 nonunique(df::AbstractDataFrame, cols; keep::Symbol=:first)

Return a Vector{Bool} in which true entries indicate duplicate rows.

Duplicate rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is indicated with a false entry. If keep=:last, only the last occurrence of a set of duplicate rows is indicated with a false entry. If keep=:noduplicates, only rows without any duplicates are indicated with a false entry.

Arguments

  • df : AbstractDataFrame
  • cols : a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted by select that returns at least one column if df has at least one column.

See also unique and unique!.

Examples

julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
 4×2 DataFrame
  Row │ i      x
@@ -3247,7 +3247,7 @@
  1
  1
  1
- 1
source
Base.Iterators.onlyFunction
only(df::AbstractDataFrame)

If df has a single row return it as a DataFrameRow; otherwise throw ArgumentError.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.pop!Function
pop!(df::DataFrame)

Remove the last row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+ 1
source
Base.Iterators.onlyFunction
only(df::AbstractDataFrame)

If df has a single row return it as a DataFrameRow; otherwise throw ArgumentError.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
Base.pop!Function
pop!(df::DataFrame)

Remove the last row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3265,7 +3265,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     1      4
-   2 │     2      5
source
Base.popat!Function
popat!(df::DataFrame, i::Integer)

Remove the i-th row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+   2 │     2      5
source
Base.popat!Function
popat!(df::DataFrame, i::Integer)

Remove the i-th row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3283,7 +3283,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     1      4
-   2 │     3      6
source
Base.popfirst!Function
popfirst!(df::DataFrame)

Remove the first row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+   2 │     3      6
source
Base.popfirst!Function
popfirst!(df::DataFrame)

Remove the first row from df and return a NamedTuple created from this row.

Note

Using this method for very wide data frames may lead to expensive compilation.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3301,7 +3301,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     2      5
-   2 │     3      6
source
Base.resize!Function
resize!(df::DataFrame, n::Integer)

Resize df to have n rows by calling resize! on all columns of df.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
+   2 │     3      6
source
Base.resize!Function
resize!(df::DataFrame, n::Integer)

Resize df to have n rows by calling resize! on all columns of df.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
 3×2 DataFrame
  Row │ a      b
      │ Int64  Int64
@@ -3316,7 +3316,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │     1      4
-   2 │     2      5
source
DataFrames.subsetFunction
subset(df::AbstractDataFrame, args...;
+   2 │     2      5
source
DataFrames.subsetFunction
subset(df::AbstractDataFrame, args...;
        skipmissing::Bool=false, view::Bool=false, threads::Bool=true)
 subset(gdf::GroupedDataFrame, args...;
        skipmissing::Bool=false, view::Bool=false,
@@ -3379,7 +3379,7 @@
      │ Int64  Bool   Bool   Bool?    Int64
 ─────┼─────────────────────────────────────
    1 │     3   true  false  missing     11
-   2 │     4  false  false  missing     12
source
DataFrames.subset!Function
subset!(df::AbstractDataFrame, args...;
+   2 │     4  false  false  missing     12
source
DataFrames.subset!Function
subset!(df::AbstractDataFrame, args...;
         skipmissing::Bool=false, threads::Bool=true)
 subset!(gdf::GroupedDataFrame{DataFrame}, args...;
         skipmissing::Bool=false, ungroup::Bool=true, threads::Bool=true)

Update data frame df or the parent of gdf in place to contain only rows for which all values produced by transformation(s) args for a given row is true. All transformations must produce vectors containing true or false. When the first argument is a GroupedDataFrame, transformations are also allowed to return a single true or false value, which results in including or excluding a whole group.

If skipmissing=false (the default) args are required to produce results containing only Bool values. If skipmissing=true, additionally missing is allowed and it is treated as false (i.e. rows for which one of the conditions returns missing are skipped).

Each argument passed in args can be any specifier following the rules described for select with the restriction that:

  • specifying target column name is not allowed as subset! does not create new columns;
  • every passed transformation must return a scalar or a vector (returning AbstractDataFrame, NamedTuple, DataFrameRow or AbstractMatrix is not supported).

If ungroup=false the passed GroupedDataFrame gdf is updated (preserving the order of its groups) and returned.

If threads=true (the default) transformations may be run in separate tasks which can execute in parallel (possibly being applied to multiple rows or groups at the same time). Whether or not tasks are actually spawned and their number are determined automatically. Set to false if some transformations require serial execution or are not thread-safe.

If GroupedDataFrame is subsetted then it must include all groups present in the parent data frame, like in select!. In this case the passed GroupedDataFrame is updated to have correct groups after its parent is updated.

Note

Note that as the subset! function works in exactly the same way as other transformation functions defined in DataFrames.jl this is the preferred way to subset rows of a data frame or grouped data frame. In particular it uses a different set of rules for specifying transformations than filter! which is implemented in DataFrames.jl to ensure support for the standard Julia API for collections.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: subset, filter!, select!

Examples

julia> df = DataFrame(id=1:4, x=[true, false, true, false], y=[true, true, false, false])
@@ -3460,7 +3460,7 @@
      │ Int64  Bool   Bool   Bool?    Int64
 ─────┼─────────────────────────────────────
    1 │     3   true  false  missing     11
-   2 │     4  false  false  missing     12
source
Base.uniqueFunction
unique(df::AbstractDataFrame; view::Bool=false, keep::Symbol=:first)
+   2 │     4  false  false  missing     12
source
Base.uniqueFunction
unique(df::AbstractDataFrame; view::Bool=false, keep::Symbol=:first)
 unique(df::AbstractDataFrame, cols; view::Bool=false, keep::Symbol=:first)

Return a data frame containing only unique rows in df.

Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates, only rows without any duplicates are kept.

If view=false a freshly allocated DataFrame is returned, and if view=true then a SubDataFrame view into df is returned.

Arguments

  • df : the AbstractDataFrame
  • cols : a selector specifying the column(s) or their transformations to compare. Can be any column selector or transformation accepted by select that returns at least one column if df has at least one column.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: unique!, nonunique.

Examples

julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
 4×2 DataFrame
  Row │ i      x
@@ -3507,7 +3507,7 @@
 0×2 DataFrame
  Row │ i      x
      │ Int64  Int64
-─────┴──────────────
source
Base.unique!Function
unique!(df::AbstractDataFrame; keep::Symbol=:first)
+─────┴──────────────
source
Base.unique!Function
unique!(df::AbstractDataFrame; keep::Symbol=:first)
 unique!(df::AbstractDataFrame, cols; keep::Symbol=:first)

Update df in-place to contain only unique rows.

Non-unique (duplicate) rows are those for which at least another row contains equal values (according to isequal) for all columns in cols (by default, all columns). If keep=:first (the default), only the first occurrence of a set of duplicate rows is kept. If keep=:last, only the last occurrence of a set of duplicate rows is kept. If keep=:noduplicates, only rows without any duplicates are kept.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.) specifying the column(s) to compare. Can be any column selector or transformation accepted by select that returns at least one column if df has at least one column.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: unique!, nonunique.

Examples

julia> df = DataFrame(i=1:4, x=[1, 2, 1, 2])
 4×2 DataFrame
  Row │ i      x
@@ -3546,7 +3546,7 @@
 0×2 DataFrame
  Row │ i      x
      │ Int64  Int64
-─────┴──────────────
source

Working with missing values

Missings.allowmissingFunction
allowmissing(df::AbstractDataFrame, cols=:)

Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=[1, 2])
+─────┴──────────────
source

Working with missing values

Missings.allowmissingFunction
allowmissing(df::AbstractDataFrame, cols=:)

Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=[1, 2])
 2×1 DataFrame
  Row │ a
      │ Int64
@@ -3560,7 +3560,7 @@
      │ Int64?
 ─────┼────────
    1 │      1
-   2 │      2
source
DataFrames.allowmissing!Function
allowmissing!(df::DataFrame, cols=:)

Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.completecasesFunction
completecases(df::AbstractDataFrame, cols=:)

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) that returns at least one column if df has at least one column.

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i=1:5,
+   2 │      2
source
DataFrames.allowmissing!Function
allowmissing!(df::DataFrame, cols=:)

Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.completecasesFunction
completecases(df::AbstractDataFrame, cols=:)

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) that returns at least one column if df has at least one column.

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i=1:5,
                       x=[missing, 4, missing, 2, 1],
                       y=[missing, missing, "c", "d", "e"])
 5×3 DataFrame
@@ -3595,7 +3595,7 @@
  0
  0
  1
- 1
source
Missings.disallowmissingFunction
disallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)

Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
+ 1
source
Missings.disallowmissingFunction
disallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)

Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(a=Union{Int, Missing}[1, 2])
 2×1 DataFrame
  Row │ a
      │ Int64?
@@ -3625,7 +3625,7 @@
      │ Int64?
 ─────┼─────────
    1 │       1
-   2 │ missing
source
DataFrames.disallowmissing!Function
disallowmissing!(df::DataFrame, cols=:; error::Bool=true)

Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.dropmissingFunction
dropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)

Return a data frame excluding rows with missing values in df.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned. In this case disallowmissing must be false.

If disallowmissing is true (the default when view is false) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.

See also: completecases and dropmissing!.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(i=1:5,
+   2 │ missing
source
DataFrames.disallowmissing!Function
disallowmissing!(df::DataFrame, cols=:; error::Bool=true)

Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Metadata: this function preserves table-level and column-level :note-style metadata.

source
DataFrames.dropmissingFunction
dropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)

Return a data frame excluding rows with missing values in df.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If view=false a freshly allocated DataFrame is returned. If view=true then a SubDataFrame view into df is returned. In this case disallowmissing must be false.

If disallowmissing is true (the default when view is false) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.

See also: completecases and dropmissing!.

Metadata: this function preserves table-level and column-level :note-style metadata.

Examples

julia> df = DataFrame(i=1:5,
                       x=[missing, 4, missing, 2, 1],
                       y=[missing, missing, "c", "d", "e"])
 5×3 DataFrame
@@ -3669,7 +3669,7 @@
      │ Int64  Int64  String
 ─────┼──────────────────────
    1 │     4      2  d
-   2 │     5      1  e
source
DataFrames.dropmissing!Function
dropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)

Remove rows with missing values from data frame df and return it.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: dropmissing and completecases.

julia> df = DataFrame(i=1:5,
+   2 │     5      1  e
source
DataFrames.dropmissing!Function
dropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)

Remove rows with missing values from data frame df and return it.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, Cols, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.

Metadata: this function preserves table-level and column-level :note-style metadata.

See also: dropmissing and completecases.

julia> df = DataFrame(i=1:5,
                       x=[missing, 4, missing, 2, 1],
                       y=[missing, missing, "c", "d", "e"])
 5×3 DataFrame
@@ -3713,7 +3713,7 @@
      │ Int64  Int64  String
 ─────┼──────────────────────
    1 │     4      2  d
-   2 │     5      1  e
source

Iteration

Base.eachcolFunction
eachcol(df::AbstractDataFrame)

Return a DataFrameColumns object that is a vector-like that allows iterating an AbstractDataFrame column by column.

Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
+   2 │     5      1  e
source

Iteration

Base.eachcolFunction
eachcol(df::AbstractDataFrame)

Return a DataFrameColumns object that is a vector-like that allows iterating an AbstractDataFrame column by column.

Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
 4×2 DataFrame
  Row │ x      y
      │ Int64  Int64
@@ -3748,7 +3748,7 @@
 julia> sum.(eachcol(df))
 2-element Vector{Int64}:
  10
- 50
source
Base.eachrowFunction
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
+ 50
source
Base.eachrowFunction
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
 4×2 DataFrame
  Row │ x      y
      │ Int64  Int64
@@ -3781,7 +3781,7 @@
      │ Int64  Int64
 ─────┼──────────────
    1 │    14      4
-   2 │    13      3
source
Base.valuesFunction
values(dfc::DataFrameColumns)

Get a vector of columns from dfc.

source
Base.pairsFunction
pairs(dfc::DataFrameColumns)

Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.

source
Base.Iterators.partitionFunction
Iterators.partition(df::AbstractDataFrame, n::Integer)

Iterate over df data frame n rows at a time, returning each block as a SubDataFrame.

Examples

julia> collect(Iterators.partition(DataFrame(x=1:5), 2))
+   2 │    13      3
source
Base.valuesFunction
values(dfc::DataFrameColumns)

Get a vector of columns from dfc.

source
Base.pairsFunction
pairs(dfc::DataFrameColumns)

Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.

source
Base.Iterators.partitionFunction
Iterators.partition(df::AbstractDataFrame, n::Integer)

Iterate over df data frame n rows at a time, returning each block as a SubDataFrame.

Examples

julia> collect(Iterators.partition(DataFrame(x=1:5), 2))
 3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}:
  2×1 SubDataFrame
  Row │ x
@@ -3799,7 +3799,7 @@
  Row │ x
      │ Int64
 ─────┼───────
-   1 │     5
source
Iterators.partition(dfr::DataFrameRows, n::Integer)

Iterate over DataFrameRows dfr n rows at a time, returning each block as a DataFrameRows over a view of rows of parent of dfr.

Examples

julia> collect(Iterators.partition(eachrow(DataFrame(x=1:5)), 2))
+   1 │     5
source
Iterators.partition(dfr::DataFrameRows, n::Integer)

Iterate over DataFrameRows dfr n rows at a time, returning each block as a DataFrameRows over a view of rows of parent of dfr.

Examples

julia> collect(Iterators.partition(eachrow(DataFrame(x=1:5)), 2))
 3-element Vector{DataFrames.DataFrameRows{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}}:
  2×1 DataFrameRows
  Row │ x
@@ -3817,9 +3817,9 @@
  Row │ x
      │ Int64
 ─────┼───────
-   1 │     5
source

Equality

Base.isapproxFunction
isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
+   1 │     5
source

Equality

Base.isapproxFunction
isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
          rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
-         nans::Bool=false, norm::Function=norm)

Inexact equality comparison. df1 and df2 must have the same size and column names. Return true if isapprox with given keyword arguments applied to all pairs of columns stored in df1 and df2 returns true.

source

Metadata

DataAPI.metadataFunction
metadata(df::AbstractDataFrame, key::AbstractString, [default]; style::Bool=false)
+         nans::Bool=false, norm::Function=norm)

Inexact equality comparison. df1 and df2 must have the same size and column names. Return true if isapprox with given keyword arguments applied to all pairs of columns stored in df1 and df2 returns true.

source

Metadata

DataAPI.metadataFunction
metadata(df::AbstractDataFrame, key::AbstractString, [default]; style::Bool=false)
 metadata(dfr::DataFrameRow, key::AbstractString, [default]; style::Bool=false)
 metadata(dfc::DataFrameColumns, key::AbstractString, [default]; style::Bool=false)
 metadata(dfr::DataFrameRows, key::AbstractString, [default]; style::Bool=false)

Return table-level metadata value associated with df for key key. If style=true return a tuple of metadata value and metadata style.

SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.

If default is passed then return it if key does not exist; if style=true return (default, :default).

See also: metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3842,7 +3842,7 @@
 julia> deletemetadata!(df, "name");
 
 julia> metadatakeys(df)
-()

```

source
DataAPI.metadatakeysFunction
metadatakeys(df::AbstractDataFrame)
+()

```

source
DataAPI.metadatakeysFunction
metadatakeys(df::AbstractDataFrame)
 metadatakeys(dfr::DataFrameRow)
 metadatakeys(dfc::DataFrameColumns)
 metadatakeys(dfr::DataFrameRows)

Return an iterator of table-level metadata keys which are set in the object.

Values can be accessed using metadata(df, key).

SubDataFrame and DataFrameRow expose only :note-style metadata keys of their parent.

See also: metadata, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3865,7 +3865,7 @@
 julia> deletemetadata!(df, "name");
 
 julia> metadatakeys(df)
-()
source
DataAPI.metadata!Function
metadata!(df::AbstractDataFrame, key::AbstractString, value; style::Symbol=:default)
+()
source
DataAPI.metadata!Function
metadata!(df::AbstractDataFrame, key::AbstractString, value; style::Symbol=:default)
 metadata!(dfr::DataFrameRow, key::AbstractString, value; style::Symbol=:default)
 metadata!(dfc::DataFrameColumns, key::AbstractString, value; style::Symbol=:default)
 metadata!(dfr::DataFrameRows, key::AbstractString, value; style::Symbol=:default)

Set table-level metadata for object df for key key to have value value and style style (:default by default) and return df.

For SubDataFrame and DataFrameRow only :note-style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.

See also: metadata, metadatakeys, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3888,7 +3888,7 @@
 julia> deletemetadata!(df, "name");
 
 julia> metadatakeys(df)
-()

```

source
DataAPI.deletemetadata!Function
deletemetadata!(df::AbstractDataFrame, key::AbstractString)
+()

```

source
DataAPI.deletemetadata!Function
deletemetadata!(df::AbstractDataFrame, key::AbstractString)
 deletemetadata!(dfr::DataFrameRow, key::AbstractString)
 deletemetadata!(dfc::DataFrameColumns, key::AbstractString)
 deletemetadata!(dfr::DataFrameRows, key::AbstractString)

Delete table-level metadata from object df for key key and return df. If key does not exist, return df without modification.

For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).

See also: metadata, metadatakeys, metadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3911,7 +3911,7 @@
 julia> deletemetadata!(df, "name");
 
 julia> metadatakeys(df)
-()

```

source
DataAPI.emptymetadata!Function
emptymetadata!(df::AbstractDataFrame)
+()

```

source
DataAPI.emptymetadata!Function
emptymetadata!(df::AbstractDataFrame)
 emptymetadata!(dfr::DataFrameRow)
 emptymetadata!(dfc::DataFrameColumns)
 emptymetadata!(dfr::DataFrameRows)

Delete all table-level metadata from object df.

For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).

See also: metadata, metadatakeys, metadata!, deletemetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3934,7 +3934,7 @@
 julia> emptymetadata!(df);
 
 julia> metadatakeys(df)
-()
source
DataAPI.colmetadataFunction
colmetadata(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
+()
source
DataAPI.colmetadataFunction
colmetadata(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
 colmetadata(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
 colmetadata(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)
 colmetadata(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, [default]; style::Bool=false)

Return column-level metadata value associated with df for column col and key key.

SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.

If default is passed then return it if key does not exist for column col; if style=true return (default, :default). If col does not exist in df always throw an error.

See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadatakeys, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3961,7 +3961,7 @@
 julia> deletecolmetadata!(df, :a, "name");
 
 julia> colmetadatakeys(df)
-()

```

source
DataAPI.colmetadatakeysFunction
colmetadatakeys(df::AbstractDataFrame, [col::ColumnIndex])
+()

```

source
DataAPI.colmetadatakeysFunction
colmetadatakeys(df::AbstractDataFrame, [col::ColumnIndex])
 colmetadatakeys(dfr::DataFrameRow, [col::ColumnIndex])
 colmetadatakeys(dfc::DataFrameColumns, [col::ColumnIndex])
 colmetadatakeys(dfr::DataFrameRows, [col::ColumnIndex])

If col is passed return an iterator of column-level metadata keys which are set for column col. If col is not passed return an iterator of col => colmetadatakeys(x, col) pairs for all columns that have metadata, where col are Symbol.

Values can be accessed using colmetadata(df, col, key).

SubDataFrame and DataFrameRow expose only :note-style metadata of their parent.

See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadata!, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -3988,7 +3988,7 @@
 julia> deletecolmetadata!(df, :a, "name");
 
 julia> colmetadatakeys(df)
-()

```

source
DataAPI.colmetadata!Function
colmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
+()

```

source
DataAPI.colmetadata!Function
colmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
 colmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
 colmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)
 colmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString, value; style::Symbol=:default)

Set column-level metadata in df for column col and key key to have value value and style style (:default by default) and return df.

For SubDataFrame and DataFrameRow only :note style is allowed. Trying to set a key-value pair for which the key already exists in the parent data frame with another style throws an error.

See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, deletecolmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -4015,7 +4015,7 @@
 julia> deletecolmetadata!(df, :a, "name");
 
 julia> colmetadatakeys(df)
-()

```

source
DataAPI.deletecolmetadata!Function
deletecolmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString)
+()

```

source
DataAPI.deletecolmetadata!Function
deletecolmetadata!(df::AbstractDataFrame, col::ColumnIndex, key::AbstractString)
 deletecolmetadata!(dfr::DataFrameRow, col::ColumnIndex, key::AbstractString)
 deletecolmetadata!(dfc::DataFrameColumns, col::ColumnIndex, key::AbstractString)
 deletecolmetadata!(dfr::DataFrameRows, col::ColumnIndex, key::AbstractString)

Delete column-level metadata set in df for column col and key key and return df.

For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).

See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, emptycolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -4042,7 +4042,7 @@
 julia> deletecolmetadata!(df, :a, "name");
 
 julia> colmetadatakeys(df)
-()
source
DataAPI.emptycolmetadata!Function
emptycolmetadata!(df::AbstractDataFrame, [col::ColumnIndex])
+()
source
DataAPI.emptycolmetadata!Function
emptycolmetadata!(df::AbstractDataFrame, [col::ColumnIndex])
 emptycolmetadata!(dfr::DataFrameRow, [col::ColumnIndex])
 emptycolmetadata!(dfc::DataFrameColumns, [col::ColumnIndex])
 emptycolmetadata!(dfr::DataFrameRows, [col::ColumnIndex])

Delete column-level metadata set in df for column col and key key and return df.

For SubDataFrame and DataFrameRow only :note-style metadata from their parent can be deleted (as other styles are not propagated to views).

See also: metadata, metadatakeys, metadata!, deletemetadata!, emptymetadata!, colmetadata, colmetadatakeys, colmetadata!, deletecolmetadata!.

Examples

julia> df = DataFrame(a=1, b=2);
@@ -4066,4 +4066,4 @@
 julia> emptycolmetadata!(df, :a);
 
 julia> colmetadatakeys(df)
-()
source
+()source diff --git a/dev/lib/indexing/index.html b/dev/lib/indexing/index.html index a4829a0b7..0b71c4e5f 100644 --- a/dev/lib/indexing/index.html +++ b/dev/lib/indexing/index.html @@ -1,2 +1,2 @@ -Indexing · DataFrames.jl

Indexing

    General rules

    The following rules explain target functionality of how getindex, setindex!, view, and broadcasting are intended to work with DataFrame, SubDataFrame and DataFrameRow objects.

    The following values are a valid column index:

    • a scalar, later denoted as col:
      • a Symbol;
      • an AbstractString;
      • an Integer that is not Bool;
    • a vector, later denoted as cols:
      • a vector of Symbol (does not have to be a subtype of AbstractVector{Symbol});
      • a vector of AbstractString (does not have to be a subtype of AbstractVector{<:AbstractString});
      • a vector of Integer that are not Bool (does not have to be a subtype of AbstractVector{<:Integer});
      • a vector of Bool (must be a subtype of AbstractVector{Bool});
      • a regular expression (will be expanded to a vector of matching column names);
      • a Not expression (see InvertedIndices.jl); Not(idx) selects all indices not in the passed idx; when passed as column selector Not(idx...) is equivalent to Not(Cols(idx...)).
      • a Cols expression (see DataAPI.jl); Cols(idxs...) selects the union of the selections in idxs; in particular Cols() selects no columns and Cols(:) selects all columns; a special rule is Cols(predicate), where predicate is a predicate function; in this case the columns whose names passed to predicate as strings return true are selected.
      • a Between expression (see DataAPI.jl); Between(first, last) selects the columns between first and last inclusively;
      • an All expression (see DataAPI.jl); All() selects all columns, equivalent to :;
      • a literal colon : (selects all columns).

    The following values are a valid row index:

    • a scalar, later denoted as row:
      • an Integer that is not Bool;
    • a vector, later denoted as rows:
      • a vector of Integer that are not Bool (does not have to be a subtype of AbstractVector{<:Integer});
      • a vector of Bool (must be a subtype of AbstractVector{Bool});
      • a Not expression (see InvertedIndices.jl);
      • a literal colon : (selects all rows with copying);
      • a literal exclamation mark ! (selects all rows without copying).

    Additionally it is allowed to index into an AbstractDataFrame using a two-dimensional CartesianIndex.

    In the descriptions below df represents a DataFrame, sdf is a SubDataFrame and dfr is a DataFrameRow.

    : always expands to axes(df, 1) or axes(sdf, 1).

    df.col works like df[!, col] and sdf.col works like sdf[!, col] in all cases. An exception is that under Julia 1.6 or earlier df.col .= v and sdf.col .= v performs in-place broadcasting if col is present in df/sdf and is a valid identifier (this inconsistency is not present under Julia 1.7 and later).

    getindex and view

    The following list specifies the behavior of getindex and view operations depending on argument types.

    In particular a description explicitly mentions that the data is copied or reused without copying.

    For performance reasons, accessing, via getindex or view, a single row and multiple cols of a DataFrame, a SubDataFrame or a DataFrameRow always returns a DataFrameRow (which is a view type).

    getindex on DataFrame:

    • df[row, col] -> the value contained in row row of column col, the same as df[!, col][row];
    • df[CartesianIndex(row, col)] -> the same as df[row, col];
    • df[row, cols] -> a DataFrameRow with parent df;
    • df[rows, col] -> a copy of the vector df[!, col] with only the entries corresponding to rows selected, the same as df[!, col][rows];
    • df[rows, cols] -> a DataFrame containing copies of columns cols with only the entries corresponding to rows selected;
    • df[!, col] -> the vector contained in column col returned without copying; the same as df.col if col is a valid identifier.
    • df[!, cols] -> create a new DataFrame with columns cols without copying of columns; the same as select(df, cols, copycols=false).

    view on DataFrame:

    • @view df[row, col] -> a 0-dimensional view into df[!, col] in row row, the same as view(df[!, col], row);
    • @view df[CartesianIndex(row, col)] -> the same as @view df[row, col];
    • @view df[row, cols] -> the same as df[row, cols];
    • @view df[rows, col] -> a view into df[!, col] with rows selected, the same as view(df[!, col], rows);
    • @view df[rows, cols] -> a SubDataFrame with rows selected with parent df;
    • @view df[!, col] -> a view into df[!, col] with all rows.
    • @view df[!, cols] -> the same as @view df[:, cols].

    getindex on SubDataFrame:

    • sdf[row, col] -> a value contained in row row of column col;
    • sdf[CartesianIndex(row, col)] -> the same as sdf[row, col];
    • sdf[row, cols] -> a DataFrameRow with parent parent(sdf);
    • sdf[rows, col] -> a copy of sdf[!, col] with only rows rows selected, the same as sdf[!, col][rows];
    • sdf[rows, cols] -> a DataFrame containing columns cols and sdf[rows, col] as a vector for each col in cols;
    • sdf[!, col] -> a view of entries corresponding to sdf in the vector parent(sdf)[!, col]; the same as sdf.col if col is a valid identifier.
    • sdf[!, cols] -> create a new SubDataFrame with columns cols, the same parent as sdf, and the same rows selected; the same as select(sdf, cols, copycols=false).

    view on SubDataFrame:

    • @view sdf[row, col] -> a 0-dimensional view into df[!, col] at row row, the same as view(sdf[!, col], row);
    • @view sdf[CartesianIndex(row, col)] -> the same as @view sdf[row, col];
    • @view sdf[row, cols] -> a DataFrameRow with parent parent(sdf);
    • @view sdf[rows, col] -> a view into sdf[!, col] vector with rows selected, the same as view(sdf[!, col], rows);
    • @view sdf[rows, cols] -> a SubDataFrame with parent parent(sdf);
    • @view sdf[!, col] -> a view into sdf[!, col] vector with all rows.
    • @view sdf[!, cols] -> the same as @view sdf[:, cols].

    getindex on DataFrameRow:

    • dfr[col] -> the value contained in column col of dfr; the same as dfr.col if col is a valid identifier;
    • dfr[cols] -> a DataFrameRow with parent parent(dfr);

    view on DataFrameRow:

    • @view dfr[col] -> a 0-dimensional view into parent(dfr)[DataFrames.row(dfr), col];
    • @view dfr[cols] -> a DataFrameRow with parent parent(dfr);

    Note that views created with columns selector set to : change their columns' count if columns are added/removed/renamed in the parent; if column selector is other than : then view points to selected columns by their number at the moment of creation of the view.

    setindex!

    The following list specifies the behavior of setindex! operations depending on argument types.

    In particular a description explicitly mentions if the assignment is in-place.

    Note that if a setindex! operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).

    setindex! on DataFrame:

    • df[row, col] = v -> set value of col in row row to v in-place;
    • df[CartesianIndex(row, col)] = v -> the same as df[row, col] = v;
    • df[row, cols] = v -> set row row of columns cols in-place; the same as dfr = df[row, cols]; dfr[:] = v;
    • df[rows, col] = v -> set rows rows of column col in-place; v must be an AbstractVector; if rows is : and col is a Symbol or AbstractString that is not present in df then a new column in df is created and holds a copy of v; equivalent to df.col = copy(v) if col is a valid identifier;
    • df[rows, cols] = v -> set rows rows of columns cols in-place; v must be an AbstractMatrix or an AbstractDataFrame (in this case column names must match);
    • df[!, col] = v -> replaces col with v without copying (with the exception that if v is an AbstractRange it gets converted to a Vector); also if col is a Symbol or AbstractString that is not present in df then a new column in df is created and holds v; equivalent to df.col = v if col is a valid identifier; this is allowed if ncol(df) == 0 || length(v) == nrow(df);
    • df[!, cols] = v -> replaces existing columns cols in data frame df with copying; v must be an AbstractMatrix or an AbstractDataFrame (in the latter case column names must match);

    setindex! on SubDataFrame:

    • sdf[row, col] = v -> set value of col in row row to v in-place;
    • sdf[CartesianIndex(row, col)] = v -> the same as sdf[row, col] = v;
    • sdf[row, cols] = v -> the same as dfr = df[row, cols]; dfr[:] = v in-place;
    • sdf[rows, col] = v -> set rows rows of column col, in-place; v must be an abstract vector;
    • sdf[rows, cols] = v -> set rows rows of columns cols in-place; v can be an AbstractMatrix or v can be AbstractDataFrame in which case column names must match;
    • sdf[!, col] = v -> replaces col with v with copying; if col is present in sdf then filtered-out rows in newly created vector are filled with values already present in that column and promote_type is used to determine the eltype of the new column; if col is not present in sdf then the operation is only allowed if sdf was created with : as column selector, in which case filtered-out rows are filled with missing; equivalent to sdf.col = v if col is a valid identifier; operation is allowed if length(v) == nrow(sdf);
    • sdf[!, cols] = v -> replaces existing columns cols in data frame sdf with copying; v must be an AbstractMatrix or an AbstractDataFrame (in the latter case column names must match); filtered-out rows in newly created vectors are filled with values already present in respective columns and promote_type is used to determine the eltype of the new columns;
    Note

    The rules above mean that sdf[:, col] = v is an in-place operation if col is present in sdf, therefore it will be fast in general. On the other hand using sdf[!, col] = v or sdf.col = v will always allocate a new vector, which is more expensive computationally.

    setindex! on DataFrameRow:

    • dfr[col] = v -> set value of col in row row to v in-place; equivalent to dfr.col = v if col is a valid identifier;
    • dfr[cols] = v -> set values of entries in columns cols in dfr by elements of v in place; v can be: 1) a Tuple or an AbstractArray, in which cases it must have a number of elements equal to length(dfr), 2) an AbstractDict, in which case column names must match, 3) a NamedTuple or DataFrameRow, in which case column names and order must match;

    Broadcasting

    The following broadcasting rules apply to AbstractDataFrame objects:

    • AbstractDataFrame behaves in broadcasting like a two-dimensional collection compatible with matrices.
    • If an AbstractDataFrame takes part in broadcasting then a DataFrame is always produced as a result. In this case the requested broadcasting operation produces an object with exactly two dimensions. An exception is when an AbstractDataFrame is used only as a source of broadcast assignment into an object of dimensionality higher than two.
    • If multiple AbstractDataFrame objects take part in broadcasting then they have to have identical column names.

    Note that if broadcasting assignment operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).

    Broadcasting DataFrameRow is currently not allowed (which is consistent with NamedTuple).

    It is possible to assign a value to AbstractDataFrame and DataFrameRow objects using the .= operator. In such an operation AbstractDataFrame is considered as two-dimensional and DataFrameRow as single-dimensional.

    Note

    The rule above means that, similar to single-dimensional objects in Base (e.g. vectors), DataFrameRow is considered to be column-oriented.

    Additional rules:

    • in the df[CartesianIndex(row, col)] .= v, df[row, col] .= v syntaxes v is broadcasted into the contents of df[row, col] (this is consistent with Julia Base);
    • in the df[row, cols] .= v syntaxes the assignment to df is performed in-place;
    • in the df[rows, col] .= v and df[rows, cols] .= v syntaxes the assignment to df is performed in-place; if rows is : and col is Symbol or AbstractString and it is missing from df then a new column is allocated and added; the length of the column is always the value of nrow(df) before the assignment takes place;
    • in the df[!, col] .= v syntax column col is replaced by a freshly allocated vector; if col is Symbol or AbstractString and it is missing from df then a new column is allocated added; the length of the column is always the value of nrow(df) before the assignment takes place;
    • the df[!, cols] .= v syntax replaces existing columns cols in data frame df with freshly allocated vectors;
    • df.col .= v syntax currently performs in-place assignment to an existing vector df.col; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if :col is not present in df then a new column will be created in df.
    • in the sdf[CartesianIndex(row, col)] .= v, sdf[row, col] .= v and sdf[row, cols] .= v syntaxes the assignment to sdf is performed in-place;
    • in the sdf[rows, col] .= v and sdf[rows, cols] .= v syntaxes the assignment to sdf is performed in-place; if rows is : and col is a Symbol or AbstractString referring to a column missing from sdf and sdf was created with : as column selector then a new column is allocated and added; the filtered-out rows are filled with missing;
    • in the sdf[!, col] .= v syntax column col is replaced by a freshly allocated vector; the filtered-out rows are filled with values already present in col; if col is a Symbol or AbstractString referring to a column missing from sdf and was sdf created with : as column selector then a new column is allocated and added; in this case the filtered-out rows are filled with missing;
    • the sdf[!, cols] .= v syntax replaces existing columns cols in data frame sdf with freshly allocated vectors; the filtered-out rows are filled with values already present in cols;
    • sdf.col .= v syntax currently performs in-place assignment to an existing vector sdf.col; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if :col is not present in sdf then a new column will be created in sdf if sdf was created with : as a column selector.
    • dfr.col .= v syntax is allowed and performs in-place assignment to a value extracted by dfr.col.

    Note that sdf[!, col] .= v and sdf[!, cols] .= v syntaxes are not allowed as sdf can be only modified in-place.

    If column indexing using Symbol or AbstractString names in cols is performed, the order of columns in the operation is specified by the order of names.

    Indexing GroupedDataFrames

    A GroupedDataFrame can behave as either an AbstractVector or AbstractDict depending on the type of index used. Integers (or arrays of them) trigger vector-like indexing while Tupless and NamedTuples trigger dictionary-like indexing. An intermediate between the two is the GroupKey type returned by keys(::GroupedDataFrame), which behaves similarly to a NamedTuple but has performance on par with integer indexing.

    The elements of a GroupedDataFrame are SubDataFrames of its parent.

    • gd[i::Integer] -> Get the ith group.
    • gd[key::NamedTuple] -> Get the group corresponding to the given values of the grouping columns. The fields of the NamedTuple must match the grouping columns columns passed to groupby (including order).
    • gd[key::Tuple] -> Same as previous, but omitting the names on key.
    • get(gd, key::Union{Tuple, NamedTuple}, default) -> Get group for key key, returning default if it does not exist.
    • gd[key::GroupKey] -> Get the group corresponding to the GroupKey key (one of the elements of the vector returned by keys(::GroupedDataFrame)). This should be nearly as fast as integer indexing.
    • gd[a::AbstractVector] -> Select multiple groups and return them in a new GroupedDataFrame object. Groups may be selected by integer position using an array of Integers or Bools, similar to a standard array. Alternatively the array may contain keys of any of the types supported for dictionary-like indexing (GroupKey, Tuple, or NamedTuple). Selected groups must be unique, and different types of indices cannot be mixed.
    • gd[n::Not] -> Any of the above types wrapped in Not. The result will be a new GroupedDataFrame containing all groups in gd not selected by the wrapped index.

    Common API for types defined in DataFrames.jl

    This table presents return value types of calling names, propertynames, keys, length and ndims on types exposed to the user by DataFrames.jl:

    Typenamespropertynameskeyslengthndims
    AbstractDataFrameVector{String}Vector{Symbol}undefinedundefined2
    DataFrameRowVector{String}Vector{Symbol}Vector{Symbol}Int1
    DataFrameRowsVector{String}Vector{Symbol}vector of IntInt1
    DataFrameColumnsVector{String}Vector{Symbol}Vector{Symbol}Int1
    GroupedDataFrameVector{String}tuple of fieldsGroupKeysInt1
    GroupKeysundefinedtuple of fieldsvector of IntInt1
    GroupKeyVector{String}Vector{Symbol}Vector{Symbol}Int1

    Additionally the above types T (i.e. AbstractDataFrame, DataFrameRow, DataFrameRows, DataFrameColumns, GroupedDataFrame, GroupKeys, GroupKey) the following methods are defined:

    • size(::T) returning a Tuple of Int.
    • size(::T, ::Integer) returning an Int.
    • axes(::T) returning a Tuple of Int vectors.
    • axes(::T, ::Integer) returning an Int vector for a valid dimension (except DataFrameRows and GroupKeys for which Base.OneTo(1) is also returned for a dimension higher than a valid one because they are AbstractVector).
    • firstindex(::T) returning 1 (except AbstractDataFrame for which it is undefined).
    • firstindex(::T, ::Integer) returning 1 for a valid dimension (except DataFrameRows and GroupKeys for which 1 is also returned for a dimension higher than a valid one because they are AbstractVector).
    • lastindex(::T) returning Int (except AbstractDataFrame for which it is undefined).
    • lastindex(::T, ::Integer) returning Int for a valid dimension (except DataFrameRows and GroupKeys for which 1 is also returned for a dimension higher than a valid one because they are AbstractVector).
    +Indexing · DataFrames.jl

    Indexing

      General rules

      The following rules explain target functionality of how getindex, setindex!, view, and broadcasting are intended to work with DataFrame, SubDataFrame and DataFrameRow objects.

      The following values are a valid column index:

      • a scalar, later denoted as col:
        • a Symbol;
        • an AbstractString;
        • an Integer that is not Bool;
      • a vector, later denoted as cols:
        • a vector of Symbol (does not have to be a subtype of AbstractVector{Symbol});
        • a vector of AbstractString (does not have to be a subtype of AbstractVector{<:AbstractString});
        • a vector of Integer that are not Bool (does not have to be a subtype of AbstractVector{<:Integer});
        • a vector of Bool (must be a subtype of AbstractVector{Bool});
        • a regular expression (will be expanded to a vector of matching column names);
        • a Not expression (see InvertedIndices.jl); Not(idx) selects all indices not in the passed idx; when passed as column selector Not(idx...) is equivalent to Not(Cols(idx...)).
        • a Cols expression (see DataAPI.jl); Cols(idxs...) selects the union of the selections in idxs; in particular Cols() selects no columns and Cols(:) selects all columns; a special rule is Cols(predicate), where predicate is a predicate function; in this case the columns whose names passed to predicate as strings return true are selected.
        • a Between expression (see DataAPI.jl); Between(first, last) selects the columns between first and last inclusively;
        • an All expression (see DataAPI.jl); All() selects all columns, equivalent to :;
        • a literal colon : (selects all columns).

      The following values are a valid row index:

      • a scalar, later denoted as row:
        • an Integer that is not Bool;
      • a vector, later denoted as rows:
        • a vector of Integer that are not Bool (does not have to be a subtype of AbstractVector{<:Integer});
        • a vector of Bool (must be a subtype of AbstractVector{Bool});
        • a Not expression (see InvertedIndices.jl);
        • a literal colon : (selects all rows with copying);
        • a literal exclamation mark ! (selects all rows without copying).

      Additionally it is allowed to index into an AbstractDataFrame using a two-dimensional CartesianIndex.

      In the descriptions below df represents a DataFrame, sdf is a SubDataFrame and dfr is a DataFrameRow.

      : always expands to axes(df, 1) or axes(sdf, 1).

      df.col works like df[!, col] and sdf.col works like sdf[!, col] in all cases. An exception is that under Julia 1.6 or earlier df.col .= v and sdf.col .= v performs in-place broadcasting if col is present in df/sdf and is a valid identifier (this inconsistency is not present under Julia 1.7 and later).

      getindex and view

      The following list specifies the behavior of getindex and view operations depending on argument types.

      In particular a description explicitly mentions that the data is copied or reused without copying.

      For performance reasons, accessing, via getindex or view, a single row and multiple cols of a DataFrame, a SubDataFrame or a DataFrameRow always returns a DataFrameRow (which is a view type).

      getindex on DataFrame:

      • df[row, col] -> the value contained in row row of column col, the same as df[!, col][row];
      • df[CartesianIndex(row, col)] -> the same as df[row, col];
      • df[row, cols] -> a DataFrameRow with parent df;
      • df[rows, col] -> a copy of the vector df[!, col] with only the entries corresponding to rows selected, the same as df[!, col][rows];
      • df[rows, cols] -> a DataFrame containing copies of columns cols with only the entries corresponding to rows selected;
      • df[!, col] -> the vector contained in column col returned without copying; the same as df.col if col is a valid identifier.
      • df[!, cols] -> create a new DataFrame with columns cols without copying of columns; the same as select(df, cols, copycols=false).

      view on DataFrame:

      • @view df[row, col] -> a 0-dimensional view into df[!, col] in row row, the same as view(df[!, col], row);
      • @view df[CartesianIndex(row, col)] -> the same as @view df[row, col];
      • @view df[row, cols] -> the same as df[row, cols];
      • @view df[rows, col] -> a view into df[!, col] with rows selected, the same as view(df[!, col], rows);
      • @view df[rows, cols] -> a SubDataFrame with rows selected with parent df;
      • @view df[!, col] -> a view into df[!, col] with all rows.
      • @view df[!, cols] -> the same as @view df[:, cols].

      getindex on SubDataFrame:

      • sdf[row, col] -> a value contained in row row of column col;
      • sdf[CartesianIndex(row, col)] -> the same as sdf[row, col];
      • sdf[row, cols] -> a DataFrameRow with parent parent(sdf);
      • sdf[rows, col] -> a copy of sdf[!, col] with only rows rows selected, the same as sdf[!, col][rows];
      • sdf[rows, cols] -> a DataFrame containing columns cols and sdf[rows, col] as a vector for each col in cols;
      • sdf[!, col] -> a view of entries corresponding to sdf in the vector parent(sdf)[!, col]; the same as sdf.col if col is a valid identifier.
      • sdf[!, cols] -> create a new SubDataFrame with columns cols, the same parent as sdf, and the same rows selected; the same as select(sdf, cols, copycols=false).

      view on SubDataFrame:

      • @view sdf[row, col] -> a 0-dimensional view into df[!, col] at row row, the same as view(sdf[!, col], row);
      • @view sdf[CartesianIndex(row, col)] -> the same as @view sdf[row, col];
      • @view sdf[row, cols] -> a DataFrameRow with parent parent(sdf);
      • @view sdf[rows, col] -> a view into sdf[!, col] vector with rows selected, the same as view(sdf[!, col], rows);
      • @view sdf[rows, cols] -> a SubDataFrame with parent parent(sdf);
      • @view sdf[!, col] -> a view into sdf[!, col] vector with all rows.
      • @view sdf[!, cols] -> the same as @view sdf[:, cols].

      getindex on DataFrameRow:

      • dfr[col] -> the value contained in column col of dfr; the same as dfr.col if col is a valid identifier;
      • dfr[cols] -> a DataFrameRow with parent parent(dfr);

      view on DataFrameRow:

      • @view dfr[col] -> a 0-dimensional view into parent(dfr)[DataFrames.row(dfr), col];
      • @view dfr[cols] -> a DataFrameRow with parent parent(dfr);

      Note that views created with columns selector set to : change their columns' count if columns are added/removed/renamed in the parent; if column selector is other than : then view points to selected columns by their number at the moment of creation of the view.

      setindex!

      The following list specifies the behavior of setindex! operations depending on argument types.

      In particular a description explicitly mentions if the assignment is in-place.

      Note that if a setindex! operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).

      setindex! on DataFrame:

      • df[row, col] = v -> set value of col in row row to v in-place;
      • df[CartesianIndex(row, col)] = v -> the same as df[row, col] = v;
      • df[row, cols] = v -> set row row of columns cols in-place; the same as dfr = df[row, cols]; dfr[:] = v;
      • df[rows, col] = v -> set rows rows of column col in-place; v must be an AbstractVector; if rows is : and col is a Symbol or AbstractString that is not present in df then a new column in df is created and holds a copy of v; equivalent to df.col = copy(v) if col is a valid identifier;
      • df[rows, cols] = v -> set rows rows of columns cols in-place; v must be an AbstractMatrix or an AbstractDataFrame (in this case column names must match);
      • df[!, col] = v -> replaces col with v without copying (with the exception that if v is an AbstractRange it gets converted to a Vector); also if col is a Symbol or AbstractString that is not present in df then a new column in df is created and holds v; equivalent to df.col = v if col is a valid identifier; this is allowed if ncol(df) == 0 || length(v) == nrow(df);
      • df[!, cols] = v -> replaces existing columns cols in data frame df with copying; v must be an AbstractMatrix or an AbstractDataFrame (in the latter case column names must match);

      setindex! on SubDataFrame:

      • sdf[row, col] = v -> set value of col in row row to v in-place;
      • sdf[CartesianIndex(row, col)] = v -> the same as sdf[row, col] = v;
      • sdf[row, cols] = v -> the same as dfr = df[row, cols]; dfr[:] = v in-place;
      • sdf[rows, col] = v -> set rows rows of column col, in-place; v must be an abstract vector;
      • sdf[rows, cols] = v -> set rows rows of columns cols in-place; v can be an AbstractMatrix or v can be AbstractDataFrame in which case column names must match;
      • sdf[!, col] = v -> replaces col with v with copying; if col is present in sdf then filtered-out rows in newly created vector are filled with values already present in that column and promote_type is used to determine the eltype of the new column; if col is not present in sdf then the operation is only allowed if sdf was created with : as column selector, in which case filtered-out rows are filled with missing; equivalent to sdf.col = v if col is a valid identifier; operation is allowed if length(v) == nrow(sdf);
      • sdf[!, cols] = v -> replaces existing columns cols in data frame sdf with copying; v must be an AbstractMatrix or an AbstractDataFrame (in the latter case column names must match); filtered-out rows in newly created vectors are filled with values already present in respective columns and promote_type is used to determine the eltype of the new columns;
      Note

      The rules above mean that sdf[:, col] = v is an in-place operation if col is present in sdf, therefore it will be fast in general. On the other hand using sdf[!, col] = v or sdf.col = v will always allocate a new vector, which is more expensive computationally.

      setindex! on DataFrameRow:

      • dfr[col] = v -> set value of col in row row to v in-place; equivalent to dfr.col = v if col is a valid identifier;
      • dfr[cols] = v -> set values of entries in columns cols in dfr by elements of v in place; v can be: 1) a Tuple or an AbstractArray, in which cases it must have a number of elements equal to length(dfr), 2) an AbstractDict, in which case column names must match, 3) a NamedTuple or DataFrameRow, in which case column names and order must match;

      Broadcasting

      The following broadcasting rules apply to AbstractDataFrame objects:

      • AbstractDataFrame behaves in broadcasting like a two-dimensional collection compatible with matrices.
      • If an AbstractDataFrame takes part in broadcasting then a DataFrame is always produced as a result. In this case the requested broadcasting operation produces an object with exactly two dimensions. An exception is when an AbstractDataFrame is used only as a source of broadcast assignment into an object of dimensionality higher than two.
      • If multiple AbstractDataFrame objects take part in broadcasting then they have to have identical column names.

      Note that if broadcasting assignment operation throws an error the target data frame may be partially changed so it is unsafe to use it afterwards (the column length correctness will be preserved).

      Broadcasting DataFrameRow is currently not allowed (which is consistent with NamedTuple).

      It is possible to assign a value to AbstractDataFrame and DataFrameRow objects using the .= operator. In such an operation AbstractDataFrame is considered as two-dimensional and DataFrameRow as single-dimensional.

      Note

      The rule above means that, similar to single-dimensional objects in Base (e.g. vectors), DataFrameRow is considered to be column-oriented.

      Additional rules:

      • in the df[CartesianIndex(row, col)] .= v, df[row, col] .= v syntaxes v is broadcasted into the contents of df[row, col] (this is consistent with Julia Base);
      • in the df[row, cols] .= v syntaxes the assignment to df is performed in-place;
      • in the df[rows, col] .= v and df[rows, cols] .= v syntaxes the assignment to df is performed in-place; if rows is : and col is Symbol or AbstractString and it is missing from df then a new column is allocated and added; the length of the column is always the value of nrow(df) before the assignment takes place;
      • in the df[!, col] .= v syntax column col is replaced by a freshly allocated vector; if col is Symbol or AbstractString and it is missing from df then a new column is allocated added; the length of the column is always the value of nrow(df) before the assignment takes place;
      • the df[!, cols] .= v syntax replaces existing columns cols in data frame df with freshly allocated vectors;
      • df.col .= v syntax currently performs in-place assignment to an existing vector df.col; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if :col is not present in df then a new column will be created in df.
      • in the sdf[CartesianIndex(row, col)] .= v, sdf[row, col] .= v and sdf[row, cols] .= v syntaxes the assignment to sdf is performed in-place;
      • in the sdf[rows, col] .= v and sdf[rows, cols] .= v syntaxes the assignment to sdf is performed in-place; if rows is : and col is a Symbol or AbstractString referring to a column missing from sdf and sdf was created with : as column selector then a new column is allocated and added; the filtered-out rows are filled with missing;
      • in the sdf[!, col] .= v syntax column col is replaced by a freshly allocated vector; the filtered-out rows are filled with values already present in col; if col is a Symbol or AbstractString referring to a column missing from sdf and was sdf created with : as column selector then a new column is allocated and added; in this case the filtered-out rows are filled with missing;
      • the sdf[!, cols] .= v syntax replaces existing columns cols in data frame sdf with freshly allocated vectors; the filtered-out rows are filled with values already present in cols;
      • sdf.col .= v syntax currently performs in-place assignment to an existing vector sdf.col; this behavior is deprecated and a new column will be allocated in the future. Starting from Julia 1.7 if :col is not present in sdf then a new column will be created in sdf if sdf was created with : as a column selector.
      • dfr.col .= v syntax is allowed and performs in-place assignment to a value extracted by dfr.col.

      Note that sdf[!, col] .= v and sdf[!, cols] .= v syntaxes are not allowed as sdf can be only modified in-place.

      If column indexing using Symbol or AbstractString names in cols is performed, the order of columns in the operation is specified by the order of names.

      Indexing GroupedDataFrames

      A GroupedDataFrame can behave as either an AbstractVector or AbstractDict depending on the type of index used. Integers (or arrays of them) trigger vector-like indexing while Tupless and NamedTuples trigger dictionary-like indexing. An intermediate between the two is the GroupKey type returned by keys(::GroupedDataFrame), which behaves similarly to a NamedTuple but has performance on par with integer indexing.

      The elements of a GroupedDataFrame are SubDataFrames of its parent.

      • gd[i::Integer] -> Get the ith group.
      • gd[key::NamedTuple] -> Get the group corresponding to the given values of the grouping columns. The fields of the NamedTuple must match the grouping columns columns passed to groupby (including order).
      • gd[key::Tuple] -> Same as previous, but omitting the names on key.
      • get(gd, key::Union{Tuple, NamedTuple}, default) -> Get group for key key, returning default if it does not exist.
      • gd[key::GroupKey] -> Get the group corresponding to the GroupKey key (one of the elements of the vector returned by keys(::GroupedDataFrame)). This should be nearly as fast as integer indexing.
      • gd[a::AbstractVector] -> Select multiple groups and return them in a new GroupedDataFrame object. Groups may be selected by integer position using an array of Integers or Bools, similar to a standard array. Alternatively the array may contain keys of any of the types supported for dictionary-like indexing (GroupKey, Tuple, or NamedTuple). Selected groups must be unique, and different types of indices cannot be mixed.
      • gd[n::Not] -> Any of the above types wrapped in Not. The result will be a new GroupedDataFrame containing all groups in gd not selected by the wrapped index.

      Common API for types defined in DataFrames.jl

      This table presents return value types of calling names, propertynames, keys, length and ndims on types exposed to the user by DataFrames.jl:

      Typenamespropertynameskeyslengthndims
      AbstractDataFrameVector{String}Vector{Symbol}undefinedundefined2
      DataFrameRowVector{String}Vector{Symbol}Vector{Symbol}Int1
      DataFrameRowsVector{String}Vector{Symbol}vector of IntInt1
      DataFrameColumnsVector{String}Vector{Symbol}Vector{Symbol}Int1
      GroupedDataFrameVector{String}tuple of fieldsGroupKeysInt1
      GroupKeysundefinedtuple of fieldsvector of IntInt1
      GroupKeyVector{String}Vector{Symbol}Vector{Symbol}Int1

      Additionally the above types T (i.e. AbstractDataFrame, DataFrameRow, DataFrameRows, DataFrameColumns, GroupedDataFrame, GroupKeys, GroupKey) the following methods are defined:

      • size(::T) returning a Tuple of Int.
      • size(::T, ::Integer) returning an Int.
      • axes(::T) returning a Tuple of Int vectors.
      • axes(::T, ::Integer) returning an Int vector for a valid dimension (except DataFrameRows and GroupKeys for which Base.OneTo(1) is also returned for a dimension higher than a valid one because they are AbstractVector).
      • firstindex(::T) returning 1 (except AbstractDataFrame for which it is undefined).
      • firstindex(::T, ::Integer) returning 1 for a valid dimension (except DataFrameRows and GroupKeys for which 1 is also returned for a dimension higher than a valid one because they are AbstractVector).
      • lastindex(::T) returning Int (except AbstractDataFrame for which it is undefined).
      • lastindex(::T, ::Integer) returning Int for a valid dimension (except DataFrameRows and GroupKeys for which 1 is also returned for a dimension higher than a valid one because they are AbstractVector).
      diff --git a/dev/lib/internals/index.html b/dev/lib/internals/index.html index a84e3378e..0be9fa815 100644 --- a/dev/lib/internals/index.html +++ b/dev/lib/internals/index.html @@ -1,9 +1,9 @@ -Internals · DataFrames.jl

      Internals

      Internal API

      The functions, methods and types listed on this page are internal to DataFrames and are not considered to be part of the public API.

      DataFrames.compacttypeFunction
      compacttype(T::Type, maxwidth::Int=8, initial::Bool=true)

      Return compact string representation of type T.

      For displaying data frame we do not want string representation of type to be longer than maxwidth. This function implements rules how type names are cropped if they are longer than maxwidth.

      source
      DataFrames.gennamesFunction
      gennames(n::Integer)

      Generate standardized names for columns of a DataFrame. The first name will be :x1, the second :x2, etc.

      source
      DataFrames.getmaxwidthsFunction
      DataFrames.getmaxwidths(df::AbstractDataFrame,
      +Internals · DataFrames.jl

      Internals

      Internal API

      The functions, methods and types listed on this page are internal to DataFrames and are not considered to be part of the public API.

      DataFrames.compacttypeFunction
      compacttype(T::Type, maxwidth::Int=8, initial::Bool=true)

      Return compact string representation of type T.

      For displaying data frame we do not want string representation of type to be longer than maxwidth. This function implements rules how type names are cropped if they are longer than maxwidth.

      source
      DataFrames.gennamesFunction
      gennames(n::Integer)

      Generate standardized names for columns of a DataFrame. The first name will be :x1, the second :x2, etc.

      source
      DataFrames.getmaxwidthsFunction
      DataFrames.getmaxwidths(df::AbstractDataFrame,
                               io::IO,
                               rowindices1::AbstractVector{Int},
                               rowindices2::AbstractVector{Int},
                               rowlabel::Symbol,
                               rowid::Union{Integer, Nothing},
                               show_eltype::Bool,
      -                        buffer::IOBuffer)

      Calculate, for each column of an AbstractDataFrame, the maximum string width used to render the name of that column, its type, and the longest entry in that column – among the rows of the data frame will be rendered to IO. The widths for all columns are returned as a vector.

      Return a Vector{Int} giving the maximum string widths required to render each column, including that column's name and type.

      NOTE: The last entry of the result vector is the string width of the implicit row ID column contained in every AbstractDataFrame.

      Arguments

      • df::AbstractDataFrame: The data frame whose columns will be printed.
      • io::IO: The IO to which df is to be printed
      • `rowindices1::AbstractVector{Int}: A set of indices of the first chunk of the AbstractDataFrame that would be rendered to IO.
      • `rowindices2::AbstractVector{Int}: A set of indices of the second chunk of the AbstractDataFrame that would be rendered to IO. Can be empty if the AbstractDataFrame would be printed without any ellipses.
      • rowlabel::AbstractString: The label that will be used when rendered the numeric ID's of each row. Typically, this will be set to "Row".
      • rowid: Used to handle showing DataFrameRow.
      • show_eltype: Whether to print the column type under the column name in the heading.
      • buffer: buffer passed around to avoid reallocations in ourstrwidth
      source
      DataFrames.ourshowFunction
      DataFrames.ourshow(io::IO, x::Any, truncstring::Int)

      Render a value to an IO object compactly using print. truncstring indicates the approximate number of text characters width to truncate the output (if it is a non-positive value then no truncation is applied).

      source
      DataFrames.ourstrwidthFunction
      DataFrames.ourstrwidth(io::IO, x::Any, buffer::IOBuffer, truncstring::Int)

      Determine the number of characters that would be used to print a value.

      source
      DataFrames.@spawn_for_chunksMacro
      @spawn_for_chunks basesize for i in range ... end

      Parallelize a for loop by spawning separate tasks iterating each over a chunk of at least basesize elements in range.

      A number of tasks higher than Threads.nthreads() may be spawned, since that can allow for a more efficient load balancing in case some threads are busy (nested parallelism).

      source
      DataFrames.@spawn_or_run_taskMacro
      @spawn_or_run_task threads expr

      Equivalent to Threads.@spawn if threads === true, otherwise run expr and return a Task that returns its value.

      source
      DataFrames.default_table_transformationFunction
      default_table_transformation(df_sel::AbstractDataFrame, fun)

      This is a default implementation called when AsTable(...) => fun is requested. The df_sel argument is a data frame storing columns selected by AsTable(...) selector.

      source
      DataFrames.isreadonlyFunction
      isreadonly(fun)

      Trait returning a Bool indicator if function fun is only reading the passed argument. Such a function guarantees not to modify nor return in any form the passed argument. By default false is returned.

      This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental. Adding a method to isreadonly for a specific function fun will improve performance of AsTable(...) => ByRow(fun∘collect) operation.

      source
      + buffer::IOBuffer)

      Calculate, for each column of an AbstractDataFrame, the maximum string width used to render the name of that column, its type, and the longest entry in that column – among the rows of the data frame will be rendered to IO. The widths for all columns are returned as a vector.

      Return a Vector{Int} giving the maximum string widths required to render each column, including that column's name and type.

      NOTE: The last entry of the result vector is the string width of the implicit row ID column contained in every AbstractDataFrame.

      Arguments

      • df::AbstractDataFrame: The data frame whose columns will be printed.
      • io::IO: The IO to which df is to be printed
      • `rowindices1::AbstractVector{Int}: A set of indices of the first chunk of the AbstractDataFrame that would be rendered to IO.
      • `rowindices2::AbstractVector{Int}: A set of indices of the second chunk of the AbstractDataFrame that would be rendered to IO. Can be empty if the AbstractDataFrame would be printed without any ellipses.
      • rowlabel::AbstractString: The label that will be used when rendered the numeric ID's of each row. Typically, this will be set to "Row".
      • rowid: Used to handle showing DataFrameRow.
      • show_eltype: Whether to print the column type under the column name in the heading.
      • buffer: buffer passed around to avoid reallocations in ourstrwidth
      source
      DataFrames.ourshowFunction
      DataFrames.ourshow(io::IO, x::Any, truncstring::Int)

      Render a value to an IO object compactly using print. truncstring indicates the approximate number of text characters width to truncate the output (if it is a non-positive value then no truncation is applied).

      source
      DataFrames.ourstrwidthFunction
      DataFrames.ourstrwidth(io::IO, x::Any, buffer::IOBuffer, truncstring::Int)

      Determine the number of characters that would be used to print a value.

      source
      DataFrames.@spawn_for_chunksMacro
      @spawn_for_chunks basesize for i in range ... end

      Parallelize a for loop by spawning separate tasks iterating each over a chunk of at least basesize elements in range.

      A number of tasks higher than Threads.nthreads() may be spawned, since that can allow for a more efficient load balancing in case some threads are busy (nested parallelism).

      source
      DataFrames.@spawn_or_run_taskMacro
      @spawn_or_run_task threads expr

      Equivalent to Threads.@spawn if threads === true, otherwise run expr and return a Task that returns its value.

      source
      DataFrames.default_table_transformationFunction
      default_table_transformation(df_sel::AbstractDataFrame, fun)

      This is a default implementation called when AsTable(...) => fun is requested. The df_sel argument is a data frame storing columns selected by AsTable(...) selector.

      source
      DataFrames.isreadonlyFunction
      isreadonly(fun)

      Trait returning a Bool indicator if function fun is only reading the passed argument. Such a function guarantees not to modify nor return in any form the passed argument. By default false is returned.

      This function might become a part of the public API of DataFrames.jl in the future, currently it should be considered experimental. Adding a method to isreadonly for a specific function fun will improve performance of AsTable(...) => ByRow(fun∘collect) operation.

      source
      diff --git a/dev/lib/metadata/index.html b/dev/lib/metadata/index.html index 8b18caf56..21fc2c443 100644 --- a/dev/lib/metadata/index.html +++ b/dev/lib/metadata/index.html @@ -72,4 +72,4 @@ julia> emptycolmetadata!(df); julia> colmetadatakeys(df) -()

      Propagation of :note-style metadata

      An important design feature of :note-style metadata is how it is handled when data frames are transformed.

      Note

      The provided rules might slightly change in the future. Any change to :note-style metadata propagation rules will not be considered as breaking and can be done in any minor release of DataFrames.jl. Such changes might be made based on users' feedback about what metadata propagation rules are most convenient in practice.

      The general design rules for propagation of :note-style metadata are as follows.

      For operations that take a single data frame as an input:

      For operations that take multiple data frames as their input two cases are distinguished:

      In all these operations when metadata is preserved the values in the key-value pairs are not copied (this is relevant in case of mutable values).

      Note

      The rules for column-level :note-style metadata propagation are designed to make the right decision in common cases. In particular, they assume that if source and target column name is the same then the metadata for the column is not changed. While this is valid for many operations, it is not always true in general. For example the :x => ByRow(log) => :x transformation might invalidate metadata if it contained unit of measure of the variable. In such cases user must either use a different name for the output column, set metadata style to :default before the operation, or manually drop or update such metadata from the :x column after the transformation.

      Operations that preserve :note-style metadata

      Most of the functions in DataFrames.jl only preserve table and column metadata whose style is :note. Some functions use a more complex logic, even if they follow the general rules described above (in particular under any transformation all non-:note-style metadata is always dropped). These are:

      +()

      Propagation of :note-style metadata

      An important design feature of :note-style metadata is how it is handled when data frames are transformed.

      Note

      The provided rules might slightly change in the future. Any change to :note-style metadata propagation rules will not be considered as breaking and can be done in any minor release of DataFrames.jl. Such changes might be made based on users' feedback about what metadata propagation rules are most convenient in practice.

      The general design rules for propagation of :note-style metadata are as follows.

      For operations that take a single data frame as an input:

      For operations that take multiple data frames as their input two cases are distinguished:

      In all these operations when metadata is preserved the values in the key-value pairs are not copied (this is relevant in case of mutable values).

      Note

      The rules for column-level :note-style metadata propagation are designed to make the right decision in common cases. In particular, they assume that if source and target column name is the same then the metadata for the column is not changed. While this is valid for many operations, it is not always true in general. For example the :x => ByRow(log) => :x transformation might invalidate metadata if it contained unit of measure of the variable. In such cases user must either use a different name for the output column, set metadata style to :default before the operation, or manually drop or update such metadata from the :x column after the transformation.

      Operations that preserve :note-style metadata

      Most of the functions in DataFrames.jl only preserve table and column metadata whose style is :note. Some functions use a more complex logic, even if they follow the general rules described above (in particular under any transformation all non-:note-style metadata is always dropped). These are:

      diff --git a/dev/lib/types/index.html b/dev/lib/types/index.html index abd1f62b0..db4203948 100644 --- a/dev/lib/types/index.html +++ b/dev/lib/types/index.html @@ -1,5 +1,5 @@ -Types · DataFrames.jl

      Types

      Type hierarchy design

      AbstractDataFrame is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.

      DataFrame is the most fundamental subtype of AbstractDataFrame, which stores a set of columns as AbstractVector objects. Indexing of all stored columns must be 1-based. Also, all functions exposed by DataFrames.jl API make sure to collect passed AbstractRange source columns before storing them in a DataFrame.

      SubDataFrame is an AbstractDataFrame subtype representing a view into a DataFrame. It stores only a reference to the parent DataFrame and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view function or is returned by indexing into a GroupedDataFrame object.

      GroupedDataFrame is a type that stores the result of a grouping operation performed on an AbstractDataFrame. It is intended to be created as a result of a call to the groupby function.

      DataFrameRow is a view into a single row of an AbstractDataFrame. It stores only a reference to a parent DataFrame and information about which row and columns from the parent are selected (both as integer indices referring to the parent). The DataFrameRow type supports iteration over columns of the row and is similar in functionality to the NamedTuple type, but allows for modification of data stored in the parent DataFrame and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow type are encountered when returned by the eachrow function, or when accessing a single row of a DataFrame or SubDataFrame via getindex or view.

      The eachrow function returns a value of the DataFrameRows type, which serves as an iterator over rows of an AbstractDataFrame, returning DataFrameRow objects. The DataFrameRows is a subtype of AbstractVector and supports its interface with the exception that it is read-only.

      Similarly, the eachcol function returns a value of the DataFrameColumns type, which is not an AbstractVector, but supports most of its API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

      Note that DataFrameRows and DataFrameColumns are not exported and should not be constructed directly, but using the eachrow and eachcol functions.

      The RepeatedVector and StackedVector types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame returned by stack with view=true.

      The ByRow type is a special type used for selection operations to signal that the wrapped function should be applied to each element (row) of the selection.

      The AsTable type is a special type used for selection operations to signal that the columns selected by a wrapped selector should be passed as a NamedTuple to the function or to signal that it is requested to expand the return value of a transformation into multiple columns.

      The design of handling of columns of a DataFrame

      When a DataFrame is constructed columns are copied by default. You can disable this behavior by setting copycols keyword argument to false. The exception is if an AbstractRange is passed as a column, then it is always collected to a Vector.

      Also functions that transform a DataFrame to produce a new DataFrame perform a copy of the columns, unless they are passed copycols=false (available only for functions that could perform a transformation without copying the columns). Examples of such functions are vcat, hcat, filter, dropmissing, getindex, copy or the DataFrame constructor mentioned above.

      The generic single-argument constructor DataFrame(table) has copycols=nothing by default, meaning that columns are copied unless table signals that a copy of columns doesn't need to be made (this is done by wrapping the source table in Tables.CopiedColumns). CSV.jl does this when CSV.read(file, DataFrame) is called, since columns are built only for the purpose of use in a DataFrame constructor. Another example is Arrow.Table, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized, which can be accomplished via DataFrame(arrow_table, copycols=true).

      On the contrary, functions that create a view of a DataFrame do not by definition make copies of the columns, and therefore require particular caution. This includes view, which returns a SubDataFrame or a DataFrameRow, and groupby, which returns a GroupedDataFrame.

      A partial exception to this rule is the stack function with view=true which creates a DataFrame that contains views of the columns from the source DataFrame.

      In-place functions whose names end with ! (like sort! or dropmissing!, setindex!, push!, append!) may mutate the column vectors of the DataFrame they take as an argument. These functions are safe to call due to the rules described above, except when a view of the DataFrame is in use (via a SubDataFrame, a DataFrameRow or a GroupedDataFrame). In the latter case, calling such a function on the parent might corrupt the view, which make trigger errors, silently return invalid data or even cause Julia to crash. The same caution applies when DataFrame was created using columns of another DataFrame without copying (for instance when copycols=false in functions such as DataFrame or hcat).

      It is possible to have a direct access to a column col of a DataFrame df (e.g. this can be useful in performance critical code to avoid copying), using one of the following methods:

      • via the getproperty function using the syntax df.col;
      • via the getindex function using the syntax df[!, :col] (note this is in contrast to df[:, :col] which copies);
      • by creating a DataFrameColumns object using the eachcol function;
      • by calling the parent function on a view of a column of the DataFrame, e.g. parent(@view df[:, :col]);
      • by storing the reference to the column before creating a DataFrame with copycols=false;

      A column obtained from a DataFrame using one of the above methods should not be mutated without caution because:

      • resizing a column vector will corrupt its parent DataFrame and any associated views as methods only check the length of the column when it is added to the DataFrame and later assume that all columns have the same length;
      • reordering values in a column vector (e.g. using sort!) will break the consistency of rows with other columns, which will also affect views (if any);
      • changing values contained in a column vector is acceptable as long as it is not used as a grouping column in a GroupedDataFrame created based on the DataFrame.

      Types specification

      DataFrames.AbstractDataFrameType
      AbstractDataFrame

      An abstract type for which all concrete types expose an interface for working with tabular data.

      An AbstractDataFrame is a two-dimensional table with Symbols or strings for column names.

      DataFrames.jl defines two types that are subtypes of AbstractDataFrame: DataFrame and SubDataFrame.

      Indexing and broadcasting

      AbstractDataFrame can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame using getproperty and setproperty! functions. Columns can be selected using integers, Symbols, or strings. In broadcasting AbstractDataFrame behavior is similar to a Matrix.

      A detailed description of getindex, setindex!, getproperty, setproperty!, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.

      source
      DataFrames.AsTableType
      AsTable(cols)

      A type having a special meaning in source => transformation => destination selection operations supported by combine, select, select!, transform, transform!, subset, and subset!.

      If AsTable(cols) is used in source position it signals that the columns selected by the wrapped selector cols should be passed as a NamedTuple to the function.

      If AsTable is used in destination position it means that the result of the transformation operation is a vector of containers (or a single container if ByRow(transformation) is used) that should be expanded into multiple columns using keys to get column names.

      Examples

      julia> df1 = DataFrame(a=1:3, b=11:13)
      +Types · DataFrames.jl

      Types

      Type hierarchy design

      AbstractDataFrame is an abstract type that provides an interface for data frame types. It is not intended as a fully generic interface for working with tabular data, which is the role of interfaces defined by Tables.jl instead.

      DataFrame is the most fundamental subtype of AbstractDataFrame, which stores a set of columns as AbstractVector objects. Indexing of all stored columns must be 1-based. Also, all functions exposed by DataFrames.jl API make sure to collect passed AbstractRange source columns before storing them in a DataFrame.

      SubDataFrame is an AbstractDataFrame subtype representing a view into a DataFrame. It stores only a reference to the parent DataFrame and information about which rows and columns from the parent are selected (both as integer indices referring to the parent). Typically it is created using the view function or is returned by indexing into a GroupedDataFrame object.

      GroupedDataFrame is a type that stores the result of a grouping operation performed on an AbstractDataFrame. It is intended to be created as a result of a call to the groupby function.

      DataFrameRow is a view into a single row of an AbstractDataFrame. It stores only a reference to a parent DataFrame and information about which row and columns from the parent are selected (both as integer indices referring to the parent). The DataFrameRow type supports iteration over columns of the row and is similar in functionality to the NamedTuple type, but allows for modification of data stored in the parent DataFrame and reflects changes done to the parent after the creation of the view. Typically objects of the DataFrameRow type are encountered when returned by the eachrow function, or when accessing a single row of a DataFrame or SubDataFrame via getindex or view.

      The eachrow function returns a value of the DataFrameRows type, which serves as an iterator over rows of an AbstractDataFrame, returning DataFrameRow objects. The DataFrameRows is a subtype of AbstractVector and supports its interface with the exception that it is read-only.

      Similarly, the eachcol function returns a value of the DataFrameColumns type, which is not an AbstractVector, but supports most of its API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

      Note that DataFrameRows and DataFrameColumns are not exported and should not be constructed directly, but using the eachrow and eachcol functions.

      The RepeatedVector and StackedVector types are subtypes of AbstractVector and support its interface with the exception that they are read only. Note that they are not exported and should not be constructed directly, but they are columns of a DataFrame returned by stack with view=true.

      The ByRow type is a special type used for selection operations to signal that the wrapped function should be applied to each element (row) of the selection.

      The AsTable type is a special type used for selection operations to signal that the columns selected by a wrapped selector should be passed as a NamedTuple to the function or to signal that it is requested to expand the return value of a transformation into multiple columns.

      The design of handling of columns of a DataFrame

      When a DataFrame is constructed columns are copied by default. You can disable this behavior by setting copycols keyword argument to false. The exception is if an AbstractRange is passed as a column, then it is always collected to a Vector.

      Also functions that transform a DataFrame to produce a new DataFrame perform a copy of the columns, unless they are passed copycols=false (available only for functions that could perform a transformation without copying the columns). Examples of such functions are vcat, hcat, filter, dropmissing, getindex, copy or the DataFrame constructor mentioned above.

      The generic single-argument constructor DataFrame(table) has copycols=nothing by default, meaning that columns are copied unless table signals that a copy of columns doesn't need to be made (this is done by wrapping the source table in Tables.CopiedColumns). CSV.jl does this when CSV.read(file, DataFrame) is called, since columns are built only for the purpose of use in a DataFrame constructor. Another example is Arrow.Table, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized, which can be accomplished via DataFrame(arrow_table, copycols=true).

      On the contrary, functions that create a view of a DataFrame do not by definition make copies of the columns, and therefore require particular caution. This includes view, which returns a SubDataFrame or a DataFrameRow, and groupby, which returns a GroupedDataFrame.

      A partial exception to this rule is the stack function with view=true which creates a DataFrame that contains views of the columns from the source DataFrame.

      In-place functions whose names end with ! (like sort! or dropmissing!, setindex!, push!, append!) may mutate the column vectors of the DataFrame they take as an argument. These functions are safe to call due to the rules described above, except when a view of the DataFrame is in use (via a SubDataFrame, a DataFrameRow or a GroupedDataFrame). In the latter case, calling such a function on the parent might corrupt the view, which make trigger errors, silently return invalid data or even cause Julia to crash. The same caution applies when DataFrame was created using columns of another DataFrame without copying (for instance when copycols=false in functions such as DataFrame or hcat).

      It is possible to have a direct access to a column col of a DataFrame df (e.g. this can be useful in performance critical code to avoid copying), using one of the following methods:

      • via the getproperty function using the syntax df.col;
      • via the getindex function using the syntax df[!, :col] (note this is in contrast to df[:, :col] which copies);
      • by creating a DataFrameColumns object using the eachcol function;
      • by calling the parent function on a view of a column of the DataFrame, e.g. parent(@view df[:, :col]);
      • by storing the reference to the column before creating a DataFrame with copycols=false;

      A column obtained from a DataFrame using one of the above methods should not be mutated without caution because:

      • resizing a column vector will corrupt its parent DataFrame and any associated views as methods only check the length of the column when it is added to the DataFrame and later assume that all columns have the same length;
      • reordering values in a column vector (e.g. using sort!) will break the consistency of rows with other columns, which will also affect views (if any);
      • changing values contained in a column vector is acceptable as long as it is not used as a grouping column in a GroupedDataFrame created based on the DataFrame.

      Types specification

      DataFrames.AbstractDataFrameType
      AbstractDataFrame

      An abstract type for which all concrete types expose an interface for working with tabular data.

      An AbstractDataFrame is a two-dimensional table with Symbols or strings for column names.

      DataFrames.jl defines two types that are subtypes of AbstractDataFrame: DataFrame and SubDataFrame.

      Indexing and broadcasting

      AbstractDataFrame can be indexed by passing two indices specifying row and column selectors. The allowed indices are a superset of indices that can be used for standard arrays. You can also access a single column of an AbstractDataFrame using getproperty and setproperty! functions. Columns can be selected using integers, Symbols, or strings. In broadcasting AbstractDataFrame behavior is similar to a Matrix.

      A detailed description of getindex, setindex!, getproperty, setproperty!, broadcasting and broadcasting assignment for data frames is given in the "Indexing" section of the manual.

      source
      DataFrames.AsTableType
      AsTable(cols)

      A type having a special meaning in source => transformation => destination selection operations supported by combine, select, select!, transform, transform!, subset, and subset!.

      If AsTable(cols) is used in source position it signals that the columns selected by the wrapped selector cols should be passed as a NamedTuple to the function.

      If AsTable is used in destination position it means that the result of the transformation operation is a vector of containers (or a single container if ByRow(transformation) is used) that should be expanded into multiple columns using keys to get column names.

      Examples

      julia> df1 = DataFrame(a=1:3, b=11:13)
       3×2 DataFrame
        Row │ a      b
            │ Int64  Int64
      @@ -33,7 +33,7 @@
       ─────┼──────────────
          1 │     1    121
          2 │     4    144
      -   3 │     9    169
      source
      DataFrames.DataFrameType
      DataFrame <: AbstractDataFrame

      An AbstractDataFrame that stores a set of named columns.

      The columns are normally AbstractVectors stored in memory, particularly a Vector, PooledVector or CategoricalVector.

      Constructors

      DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
      +   3 │     9    169
      source
      DataFrames.DataFrameType
      DataFrame <: AbstractDataFrame

      An AbstractDataFrame that stores a set of named columns.

      The columns are normally AbstractVectors stored in memory, particularly a Vector, PooledVector or CategoricalVector.

      Constructors

      DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
       DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
       DataFrame(ds::AbstractDict; copycols::Bool=true)
       DataFrame(; kwargs..., copycols::Bool=true)
      @@ -107,7 +107,7 @@
            │ Int64  Int64
       ─────┼──────────────
          1 │     1      0
      -   2 │     2      0
      source
      DataFrames.DataFrameRowType
      DataFrameRow{<:AbstractDataFrame, <:AbstractIndex}

      A view of one row of an AbstractDataFrame.

      A DataFrameRow is returned by getindex or view functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow function.

      The DataFrameRow constructor can also be called directly:

      DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)

      A DataFrameRow supports the iteration interface and can therefore be passed to functions that expect a collection as an argument. Its element type is always Any.

      Indexing is one-dimensional like specifying a column of a DataFrame. You can also access the data in a DataFrameRow using the getproperty and setproperty! functions and convert it to a Tuple, NamedTuple, or Vector using the corresponding functions.

      If the selection of columns in a parent data frame is passed as : (a colon) then DataFrameRow will always have all columns from the parent, even if they are added or removed after its creation.

      Examples

      julia> df = DataFrame(a=repeat([1, 2], outer=[2]),
      +   2 │     2      0
      source
      DataFrames.DataFrameRowType
      DataFrameRow{<:AbstractDataFrame, <:AbstractIndex}

      A view of one row of an AbstractDataFrame.

      A DataFrameRow is returned by getindex or view functions when one row and a selection of columns are requested, or when iterating the result of the call to the eachrow function.

      The DataFrameRow constructor can also be called directly:

      DataFrameRow(parent::AbstractDataFrame, row::Integer, cols=:)

      A DataFrameRow supports the iteration interface and can therefore be passed to functions that expect a collection as an argument. Its element type is always Any.

      Indexing is one-dimensional like specifying a column of a DataFrame. You can also access the data in a DataFrameRow using the getproperty and setproperty! functions and convert it to a Tuple, NamedTuple, or Vector using the corresponding functions.

      If the selection of columns in a parent data frame is passed as : (a colon) then DataFrameRow will always have all columns from the parent, even if they are added or removed after its creation.

      Examples

      julia> df = DataFrame(a=repeat([1, 2], outer=[2]),
                             b=repeat(["a", "b"], inner=[2]),
                             c=1:4)
       4×3 DataFrame
      @@ -150,7 +150,7 @@
       3-element Vector{Any}:
        1
         "a"
      - 1
      source
      DataFrames.GroupedDataFrameType
      GroupedDataFrame

      The result of a groupby operation on an AbstractDataFrame; a view into the AbstractDataFrame grouped by rows.

      Not meant to be constructed directly, see groupby.

      One can get the names of columns used to create GroupedDataFrame using the groupcols function. Similarly the groupindices function returns a vector of group indices for each row of the parent data frame.

      After its creation, a GroupedDataFrame reflects the grouping of rows that was valid at its creation time. Therefore grouping columns of its parent data frame must not be mutated, and rows must not be added nor removed from it. To safeguard the user against such cases, if the number of rows in the parent data frame changes then trying to use GroupedDataFrame will throw an error. However, one can add or remove columns to the parent data frame without invalidating the GroupedDataFrame provided that columns used for grouping are not changed.

      source
      DataFrames.GroupKeyType
      GroupKey{T<:GroupedDataFrame}

      Key for one of the groups of a GroupedDataFrame. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple, but using it to index its GroupedDataFrame is more efficient than using the equivalent Tuple and NamedTuple, and much more efficient than using the equivalent AbstractDict.

      Instances of this type are returned by keys(::GroupedDataFrame) and are not meant to be constructed directly.

      Indexing fields of GroupKey is allowed using an integer, a Symbol, or a string. It is also possible to access the data in a GroupKey using the getproperty function. A GroupKey can be converted to a Tuple, NamedTuple, a Vector, or a Dict. When converted to a Dict, the keys of the Dict are Symbols.

      See keys(::GroupedDataFrame) for more information.

      source
      DataFrames.SubDataFrameType
      SubDataFrame{<:AbstractDataFrame, <:AbstractIndex, <:AbstractVector{Int}} <: AbstractDataFrame

      A view of an AbstractDataFrame. It is returned by a call to the view function on an AbstractDataFrame if a collections of rows and columns are specified.

      A SubDataFrame is an AbstractDataFrame, so expect that most DataFrame functions should work. Such methods include describe, summary, nrow, size, by, stack, and join.

      If the selection of columns in a parent data frame is passed as : (a colon) then SubDataFrame will always have all columns from the parent, even if they are added or removed after its creation.

      Examples

      julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
      + 1
      source
      DataFrames.GroupedDataFrameType
      GroupedDataFrame

      The result of a groupby operation on an AbstractDataFrame; a view into the AbstractDataFrame grouped by rows.

      Not meant to be constructed directly, see groupby.

      One can get the names of columns used to create GroupedDataFrame using the groupcols function. Similarly the groupindices function returns a vector of group indices for each row of the parent data frame.

      After its creation, a GroupedDataFrame reflects the grouping of rows that was valid at its creation time. Therefore grouping columns of its parent data frame must not be mutated, and rows must not be added nor removed from it. To safeguard the user against such cases, if the number of rows in the parent data frame changes then trying to use GroupedDataFrame will throw an error. However, one can add or remove columns to the parent data frame without invalidating the GroupedDataFrame provided that columns used for grouping are not changed.

      source
      DataFrames.GroupKeyType
      GroupKey{T<:GroupedDataFrame}

      Key for one of the groups of a GroupedDataFrame. Contains the values of the corresponding grouping columns and behaves similarly to a NamedTuple, but using it to index its GroupedDataFrame is more efficient than using the equivalent Tuple and NamedTuple, and much more efficient than using the equivalent AbstractDict.

      Instances of this type are returned by keys(::GroupedDataFrame) and are not meant to be constructed directly.

      Indexing fields of GroupKey is allowed using an integer, a Symbol, or a string. It is also possible to access the data in a GroupKey using the getproperty function. A GroupKey can be converted to a Tuple, NamedTuple, a Vector, or a Dict. When converted to a Dict, the keys of the Dict are Symbols.

      See keys(::GroupedDataFrame) for more information.

      source
      DataFrames.SubDataFrameType
      SubDataFrame{<:AbstractDataFrame, <:AbstractIndex, <:AbstractVector{Int}} <: AbstractDataFrame

      A view of an AbstractDataFrame. It is returned by a call to the view function on an AbstractDataFrame if a collections of rows and columns are specified.

      A SubDataFrame is an AbstractDataFrame, so expect that most DataFrame functions should work. Such methods include describe, summary, nrow, size, by, stack, and join.

      If the selection of columns in a parent data frame is passed as : (a colon) then SubDataFrame will always have all columns from the parent, even if they are added or removed after its creation.

      Examples

      julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
                             b=repeat([2, 1], outer=[4]),
                             c=1:8)
       8×3 DataFrame
      @@ -200,6 +200,6 @@
            │ Int64  Int64  Int64
       ─────┼─────────────────────
          1 │     1      2      1
      -   2 │     1      2      5
      source
      DataFrames.DataFrameRowsType
      DataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow}

      Iterator over rows of an AbstractDataFrame, with each row represented as a DataFrameRow.

      A value of this type is returned by the eachrow function.

      source
      DataFrames.DataFrameColumnsType
      DataFrameColumns{<:AbstractDataFrame}

      A vector-like object that allows iteration over columns of an AbstractDataFrame.

      Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

      DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

      In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

      source
      DataFrames.RepeatedVectorType
      RepeatedVector{T} <: AbstractVector{T}

      An AbstractVector that is a view into another AbstractVector with repeated elements

      NOTE: Not exported.

      Constructor

      RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)

      Arguments

      • parent : the AbstractVector that's repeated
      • inner : the number of times each element is repeated
      • outer : the number of times the whole vector is repeated after expanded by inner

      inner and outer have the same meaning as similarly named arguments to repeat.

      Examples

      RepeatedVector([1, 2], 3, 1)   # [1, 1, 1, 2, 2, 2]
      +   2 │     1      2      5
      source
      DataFrames.DataFrameRowsType
      DataFrameRows{D<:AbstractDataFrame} <: AbstractVector{DataFrameRow}

      Iterator over rows of an AbstractDataFrame, with each row represented as a DataFrameRow.

      A value of this type is returned by the eachrow function.

      source
      DataFrames.DataFrameColumnsType
      DataFrameColumns{<:AbstractDataFrame}

      A vector-like object that allows iteration over columns of an AbstractDataFrame.

      Indexing into DataFrameColumns objects using integer, Symbol or string returns the corresponding column (without copying). Indexing into DataFrameColumns objects using a multiple column selector returns a subsetted DataFrameColumns object with a new parent containing only the selected columns (without copying).

      DataFrameColumns supports most of the AbstractVector API. The key differences are that it is read-only and that the keys function returns a vector of Symbols (and not integers as for normal vectors).

      In particular findnext, findprev, findfirst, findlast, and findall functions are supported, and in findnext and findprev functions it is allowed to pass an integer, string, or Symbol as a reference index.

      source
      DataFrames.RepeatedVectorType
      RepeatedVector{T} <: AbstractVector{T}

      An AbstractVector that is a view into another AbstractVector with repeated elements

      NOTE: Not exported.

      Constructor

      RepeatedVector(parent::AbstractVector, inner::Int, outer::Int)

      Arguments

      • parent : the AbstractVector that's repeated
      • inner : the number of times each element is repeated
      • outer : the number of times the whole vector is repeated after expanded by inner

      inner and outer have the same meaning as similarly named arguments to repeat.

      Examples

      RepeatedVector([1, 2], 3, 1)   # [1, 1, 1, 2, 2, 2]
       RepeatedVector([1, 2], 1, 3)   # [1, 2, 1, 2, 1, 2]
      -RepeatedVector([1, 2], 2, 2)   # [1, 1, 2, 2, 1, 1, 2, 2]
      source
      DataFrames.StackedVectorType
      StackedVector <: AbstractVector

      An AbstractVector that is a linear, concatenated view into another set of AbstractVectors

      NOTE: Not exported.

      Constructor

      StackedVector(d::AbstractVector)

      Arguments

      • d... : one or more AbstractVectors

      Examples

      StackedVector(Any[[1, 2], [9, 10], [11, 12]])  # [1, 2, 9, 10, 11, 12]
      source
      +RepeatedVector([1, 2], 2, 2) # [1, 1, 2, 2, 1, 1, 2, 2]
      source
      DataFrames.StackedVectorType
      StackedVector <: AbstractVector

      An AbstractVector that is a linear, concatenated view into another set of AbstractVectors

      NOTE: Not exported.

      Constructor

      StackedVector(d::AbstractVector)

      Arguments

      • d... : one or more AbstractVectors

      Examples

      StackedVector(Any[[1, 2], [9, 10], [11, 12]])  # [1, 2, 9, 10, 11, 12]
      source
      diff --git a/dev/man/basics/index.html b/dev/man/basics/index.html index da9a377ac..9fa89c053 100644 --- a/dev/man/basics/index.html +++ b/dev/man/basics/index.html @@ -1324,4 +1324,4 @@ 998 │ 38 2 40 999 │ 23 2 25 1000 │ 27 2 29 - 985 rows omitted

      In the examples given in this introductory tutorial we did not cover all options of the transformation mini-language. More advanced examples, in particular showing how to pass or produce multiple columns using the AsTable operation (which you might have seen in some DataFrames.jl demos) are given in the later sections of the manual.

      + 985 rows omitted

      In the examples given in this introductory tutorial we did not cover all options of the transformation mini-language. More advanced examples, in particular showing how to pass or produce multiple columns using the AsTable operation (which you might have seen in some DataFrames.jl demos) are given in the later sections of the manual.

      diff --git a/dev/man/categorical/index.html b/dev/man/categorical/index.html index 3b3e82684..23b76a367 100644 --- a/dev/man/categorical/index.html +++ b/dev/man/categorical/index.html @@ -79,4 +79,4 @@ true julia> cv1[1] < cv1[2] -true +true diff --git a/dev/man/comparisons/index.html b/dev/man/comparisons/index.html index a23397183..c8493f23f 100644 --- a/dev/man/comparisons/index.html +++ b/dev/man/comparisons/index.html @@ -26,4 +26,4 @@ z = c(3:7, NA), id = letters[1:6])
      OperationdplyrDataFrames.jl
      Reduce multiple valuessummarize(df, mean(x))combine(df, :x => mean)
      Add new columnsmutate(df, x_mean = mean(x))transform(df, :x => mean => :x_mean)
      Rename columnsrename(df, x_new = x)rename(df, :x => :x_new)
      Pick columnsselect(df, x, y)select(df, :x, :y)
      Pick & transform columnstransmute(df, mean(x), y)select(df, :x => mean, :y)
      Pick rowsfilter(df, x >= 1)subset(df, :x => ByRow(x -> x >= 1))
      Sort rowsarrange(df, x)sort(df, :x)

      As in dplyr, some of these functions can be applied to grouped data frames, in which case they operate by group:

      OperationdplyrDataFrames.jl
      Reduce multiple valuessummarize(group_by(df, grp), mean(x))combine(groupby(df, :grp), :x => mean)
      Add new columnsmutate(group_by(df, grp), mean(x))transform(groupby(df, :grp), :x => mean)
      Pick & transform columnstransmute(group_by(df, grp), mean(x), y)select(groupby(df, :grp), :x => mean, :y)

      The table below compares more advanced commands:

      OperationdplyrDataFrames.jl
      Complex Functionsummarize(df, mean(x, na.rm = T))combine(df, :x => x -> mean(skipmissing(x)))
      Transform several columnssummarize(df, max(x), min(y))combine(df, :x => maximum, :y => minimum)
      summarize(df, across(c(x, y), mean))combine(df, [:x, :y] .=> mean)
      summarize(df, across(starts_with("x"), mean))combine(df, names(df, r"^x") .=> mean)
      summarize(df, across(c(x, y), list(max, min)))combine(df, ([:x, :y] .=> [maximum minimum])...)
      Multivariate functionmutate(df, cor(x, y))transform(df, [:x, :y] => cor)
      Row-wisemutate(rowwise(df), min(x, y))transform(df, [:x, :y] => ByRow(min))
      mutate(rowwise(df), which.max(c_across(matches("^x"))))transform(df, AsTable(r"^x") => ByRow(argmax))
      DataFrame as inputsummarize(df, head(across(), 2))combine(d -> first(d, 2), df)
      DataFrame as outputsummarize(df, tibble(value = c(min(x), max(x))))combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)

      Comparison with the R package data.table

      The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).

      library(data.table)
       df  <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
                         z = c(3:7, NA), id = letters[1:6])
      -df2 <- data.table(grp=c(1,3), w = c(10,11))
      Operationdata.tableDataFrames.jl
      Reduce multiple valuesdf[, .(mean(x))]combine(df, :x => mean)
      Add new columnsdf[, x_mean:=mean(x) ]transform!(df, :x => mean => :x_mean)
      Rename column (in place)setnames(df, "x", "x_new")rename!(df, :x => :x_new)
      Rename multiple columns (in place)setnames(df, c("x", "y"), c("x_new", "y_new"))rename!(df, [:x, :y] .=> [:x_new, :y_new])
      Pick columns as dataframedf[, .(x, y)]select(df, :x, :y)
      Pick column as a vectordf[, x]df[!, :x]
      Remove columnsdf[, -"x"]select(df, Not(:x))
      Remove columns (in place)df[, x:=NULL]select!(df, Not(:x))
      Remove columns (in place)df[, c("x", "y"):=NULL]select!(df, Not([:x, :y]))
      Pick & transform columnsdf[, .(mean(x), y)]select(df, :x => mean, :y)
      Pick rowsdf[ x >= 1 ]filter(:x => >=(1), df)
      Sort rows (in place)setorder(df, x)sort!(df, :x)
      Sort rowsdf[ order(x) ]sort(df, :x)

      Grouping data and aggregation

      Operationdata.tableDataFrames.jl
      Reduce multiple valuesdf[, mean(x), by=id ]combine(groupby(df, :id), :x => mean)
      Add new columns (in place)df[, x_mean:=mean(x), by=id]transform!(groupby(df, :id), :x => mean)
      Pick & transform columnsdf[, .(x_mean = mean(x), y), by=id]select(groupby(df, :id), :x => mean, :y)

      More advanced commands

      Operationdata.tableDataFrames.jl
      Complex Functiondf[, .(mean(x, na.rm=TRUE)) ]combine(df, :x => x -> mean(skipmissing(x)))
      Transform certain rows (in place)df[x<=0, x:=0]df.x[df.x .<= 0] .= 0
      Transform several columnsdf[, .(max(x), min(y)) ]combine(df, :x => maximum, :y => minimum)
      df[, lapply(.SD, mean), .SDcols = c("x", "y") ]combine(df, [:x, :y] .=> mean)
      df[, lapply(.SD, mean), .SDcols = patterns("*x") ]combine(df, names(df, r"^x") .=> mean)
      df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]combine(df, ([:x, :y] .=> [maximum minimum])...)
      Multivariate functiondf[, .(cor(x,y)) ]transform(df, [:x, :y] => cor)
      Row-wisedf[, min_xy := min(x, y), by = 1:nrow(df)]transform!(df, [:x, :y] => ByRow(min))
      df[, argmax_xy := which.max(.SD) , .SDcols = patterns("*x"), by = 1:nrow(df) ]transform!(df, AsTable(r"^x") => ByRow(argmax))
      DataFrame as outputdf[, .SD[1], by=grp]combine(groupby(df, :grp), first)
      DataFrame as outputdf[, .SD[which.max(x)], by=grp]combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])

      Joining data frames

      Operationdata.tableDataFrames.jl
      Inner joinmerge(df, df2, on = "grp")innerjoin(df, df2, on = :grp)
      Outer joinmerge(df, df2, all = TRUE, on = "grp")outerjoin(df, df2, on = :grp)
      Left joinmerge(df, df2, all.x = TRUE, on = "grp")leftjoin(df, df2, on = :grp)
      Right joinmerge(df, df2, all.y = TRUE, on = "grp")rightjoin(df, df2, on = :grp)
      Anti join (filtering)df[!df2, on = "grp" ]antijoin(df, df2, on = :grp)
      Semi join (filtering)merge(df1, df2[, .(grp)])semijoin(df, df2, on = :grp)

      Comparison with Stata (version 8 and above)

      The following table compares the main functions of DataFrames.jl with Stata:

      OperationStataDataFrames.jl
      Reduce multiple valuescollapse (mean) xcombine(df, :x => mean)
      Add new columnsegen x_mean = mean(x)transform!(df, :x => mean => :x_mean)
      Rename columnsrename x x_newrename!(df, :x => :x_new)
      Pick columnskeep x yselect!(df, :x, :y)
      Pick rowskeep if x >= 1subset!(df, :x => ByRow(x -> x >= 1))
      Sort rowssort xsort!(df, :x)

      Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata

      Some of these functions can be applied to grouped data frames, in which case they operate by group:

      OperationStataDataFrames.jl
      Add new columnsegen x_mean = mean(x), by(grp)transform!(groupby(df, :grp), :x => mean)
      Reduce multiple valuescollapse (mean) x, by(grp)combine(groupby(df, :grp), :x => mean)

      The table below compares more advanced commands:

      OperationStataDataFrames.jl
      Transform certain rowsreplace x = 0 if x <= 0transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
      Transform several columnscollapse (max) x (min) ycombine(df, :x => maximum, :y => minimum)
      collapse (mean) x ycombine(df, [:x, :y] .=> mean)
      collapse (mean) x*combine(df, names(df, r"^x") .=> mean)
      collapse (max) x y (min) x ycombine(df, ([:x, :y] .=> [maximum minimum])...)
      Multivariate functionegen z = corr(x y)transform!(df, [:x, :y] => cor => :z)
      Row-wiseegen z = rowmin(x y)transform!(df, [:x, :y] => ByRow(min) => :z)
      +df2 <- data.table(grp=c(1,3), w = c(10,11))
      Operationdata.tableDataFrames.jl
      Reduce multiple valuesdf[, .(mean(x))]combine(df, :x => mean)
      Add new columnsdf[, x_mean:=mean(x) ]transform!(df, :x => mean => :x_mean)
      Rename column (in place)setnames(df, "x", "x_new")rename!(df, :x => :x_new)
      Rename multiple columns (in place)setnames(df, c("x", "y"), c("x_new", "y_new"))rename!(df, [:x, :y] .=> [:x_new, :y_new])
      Pick columns as dataframedf[, .(x, y)]select(df, :x, :y)
      Pick column as a vectordf[, x]df[!, :x]
      Remove columnsdf[, -"x"]select(df, Not(:x))
      Remove columns (in place)df[, x:=NULL]select!(df, Not(:x))
      Remove columns (in place)df[, c("x", "y"):=NULL]select!(df, Not([:x, :y]))
      Pick & transform columnsdf[, .(mean(x), y)]select(df, :x => mean, :y)
      Pick rowsdf[ x >= 1 ]filter(:x => >=(1), df)
      Sort rows (in place)setorder(df, x)sort!(df, :x)
      Sort rowsdf[ order(x) ]sort(df, :x)

      Grouping data and aggregation

      Operationdata.tableDataFrames.jl
      Reduce multiple valuesdf[, mean(x), by=id ]combine(groupby(df, :id), :x => mean)
      Add new columns (in place)df[, x_mean:=mean(x), by=id]transform!(groupby(df, :id), :x => mean)
      Pick & transform columnsdf[, .(x_mean = mean(x), y), by=id]select(groupby(df, :id), :x => mean, :y)

      More advanced commands

      Operationdata.tableDataFrames.jl
      Complex Functiondf[, .(mean(x, na.rm=TRUE)) ]combine(df, :x => x -> mean(skipmissing(x)))
      Transform certain rows (in place)df[x<=0, x:=0]df.x[df.x .<= 0] .= 0
      Transform several columnsdf[, .(max(x), min(y)) ]combine(df, :x => maximum, :y => minimum)
      df[, lapply(.SD, mean), .SDcols = c("x", "y") ]combine(df, [:x, :y] .=> mean)
      df[, lapply(.SD, mean), .SDcols = patterns("*x") ]combine(df, names(df, r"^x") .=> mean)
      df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]combine(df, ([:x, :y] .=> [maximum minimum])...)
      Multivariate functiondf[, .(cor(x,y)) ]transform(df, [:x, :y] => cor)
      Row-wisedf[, min_xy := min(x, y), by = 1:nrow(df)]transform!(df, [:x, :y] => ByRow(min))
      df[, argmax_xy := which.max(.SD) , .SDcols = patterns("*x"), by = 1:nrow(df) ]transform!(df, AsTable(r"^x") => ByRow(argmax))
      DataFrame as outputdf[, .SD[1], by=grp]combine(groupby(df, :grp), first)
      DataFrame as outputdf[, .SD[which.max(x)], by=grp]combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])

      Joining data frames

      Operationdata.tableDataFrames.jl
      Inner joinmerge(df, df2, on = "grp")innerjoin(df, df2, on = :grp)
      Outer joinmerge(df, df2, all = TRUE, on = "grp")outerjoin(df, df2, on = :grp)
      Left joinmerge(df, df2, all.x = TRUE, on = "grp")leftjoin(df, df2, on = :grp)
      Right joinmerge(df, df2, all.y = TRUE, on = "grp")rightjoin(df, df2, on = :grp)
      Anti join (filtering)df[!df2, on = "grp" ]antijoin(df, df2, on = :grp)
      Semi join (filtering)merge(df1, df2[, .(grp)])semijoin(df, df2, on = :grp)

      Comparison with Stata (version 8 and above)

      The following table compares the main functions of DataFrames.jl with Stata:

      OperationStataDataFrames.jl
      Reduce multiple valuescollapse (mean) xcombine(df, :x => mean)
      Add new columnsegen x_mean = mean(x)transform!(df, :x => mean => :x_mean)
      Rename columnsrename x x_newrename!(df, :x => :x_new)
      Pick columnskeep x yselect!(df, :x, :y)
      Pick rowskeep if x >= 1subset!(df, :x => ByRow(x -> x >= 1))
      Sort rowssort xsort!(df, :x)

      Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata

      Some of these functions can be applied to grouped data frames, in which case they operate by group:

      OperationStataDataFrames.jl
      Add new columnsegen x_mean = mean(x), by(grp)transform!(groupby(df, :grp), :x => mean)
      Reduce multiple valuescollapse (mean) x, by(grp)combine(groupby(df, :grp), :x => mean)

      The table below compares more advanced commands:

      OperationStataDataFrames.jl
      Transform certain rowsreplace x = 0 if x <= 0transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
      Transform several columnscollapse (max) x (min) ycombine(df, :x => maximum, :y => minimum)
      collapse (mean) x ycombine(df, [:x, :y] .=> mean)
      collapse (mean) x*combine(df, names(df, r"^x") .=> mean)
      collapse (max) x y (min) x ycombine(df, ([:x, :y] .=> [maximum minimum])...)
      Multivariate functionegen z = corr(x y)transform!(df, [:x, :y] => cor => :z)
      Row-wiseegen z = rowmin(x y)transform!(df, [:x, :y] => ByRow(min) => :z)
      diff --git a/dev/man/getting_started/index.html b/dev/man/getting_started/index.html index e606f9516..c09be72b7 100644 --- a/dev/man/getting_started/index.html +++ b/dev/man/getting_started/index.html @@ -264,4 +264,4 @@ julia> Tables.rowtable(df) 2-element Vector{@NamedTuple{a::Int64, b::Int64}}: (a = 1, b = 2) - (a = 3, b = 4) + (a = 3, b = 4) diff --git a/dev/man/importing_and_exporting/index.html b/dev/man/importing_and_exporting/index.html index bbe7ef8bb..8f1a4eb1d 100644 --- a/dev/man/importing_and_exporting/index.html +++ b/dev/man/importing_and_exporting/index.html @@ -51,4 +51,4 @@ 148 │ 6.5 3.0 5.2 2.0 Iris-virginica 149 │ 6.2 3.4 5.4 2.3 Iris-virginica 150 │ 5.9 3.0 5.1 1.8 Iris-virginica - 135 rows omitted

      Observe that in our example:

      All such operations (and many more) are automatically handled by CSV.jl.

      Similarly, you can use the writedlm function from the DelimitedFiles module to save a data frame like this:

      writedlm("test.csv", Iterators.flatten(([names(iris)], eachrow(iris))), ',')

      As you can see the code required to transform iris into a proper input to the writedlm function so that you can create the CSV file having the expected format is not easy. Therefore CSV.jl is the preferred package to write CSV files for data stored in data frames.

      Other formats

      Other data formats are supported for reading and writing in the following packages (non exhaustive list):

      + 135 rows omitted

      Observe that in our example:

      All such operations (and many more) are automatically handled by CSV.jl.

      Similarly, you can use the writedlm function from the DelimitedFiles module to save a data frame like this:

      writedlm("test.csv", Iterators.flatten(([names(iris)], eachrow(iris))), ',')

      As you can see the code required to transform iris into a proper input to the writedlm function so that you can create the CSV file having the expected format is not easy. Therefore CSV.jl is the preferred package to write CSV files for data stored in data frames.

      Other formats

      Other data formats are supported for reading and writing in the following packages (non exhaustive list):

      diff --git a/dev/man/joins/index.html b/dev/man/joins/index.html index 4a56b6156..15aad96b5 100644 --- a/dev/man/joins/index.html +++ b/dev/man/joins/index.html @@ -283,4 +283,4 @@ 2 │ 2 2 a 3 │ 3 3 missing 4 │ 4 4 b

      Note that in this case the order and number of rows in the left table is not changed. Therefore, in particular, it is not allowed to have duplicate keys in the right table:

      julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
      -ERROR: ArgumentError: duplicate rows found in right table
      +ERROR: ArgumentError: duplicate rows found in right table diff --git a/dev/man/missing/index.html b/dev/man/missing/index.html index 51c2ad743..7e2cec70a 100644 --- a/dev/man/missing/index.html +++ b/dev/man/missing/index.html @@ -137,4 +137,4 @@ julia> missings(Int, 1, 3) 1×3 Matrix{Union{Missing, Int64}}: - missing missing missing

      See the Julia manual for more information about missing values.

      + missing missing missing

      See the Julia manual for more information about missing values.

      diff --git a/dev/man/querying_frameworks/index.html b/dev/man/querying_frameworks/index.html index dcbb14701..bb095d2f9 100644 --- a/dev/man/querying_frameworks/index.html +++ b/dev/man/querying_frameworks/index.html @@ -242,4 +242,4 @@ end 1-element Vector{String}: "Roger" -

      A query that ends with a @collect statement without a specific type will materialize the query results into an array. Note also the difference in the @select statement: The previous queries all used the {} syntax in the @select statement to project results into a tabular format. The last query instead just selects a single value from each row in the @select statement.

      These examples only scratch the surface of what one can do with Query.jl, and the interested reader is referred to the Query.jl documentation for more information.

      +

      A query that ends with a @collect statement without a specific type will materialize the query results into an array. Note also the difference in the @select statement: The previous queries all used the {} syntax in the @select statement to project results into a tabular format. The last query instead just selects a single value from each row in the @select statement.

      These examples only scratch the surface of what one can do with Query.jl, and the interested reader is referred to the Query.jl documentation for more information.

      diff --git a/dev/man/reshaping_and_pivoting/index.html b/dev/man/reshaping_and_pivoting/index.html index cc286e551..101a13fb4 100644 --- a/dev/man/reshaping_and_pivoting/index.html +++ b/dev/man/reshaping_and_pivoting/index.html @@ -316,4 +316,4 @@ ─────┼───────────────────────────── 1 │ b 1 two 2 │ c 3 4 - 3 │ d true false + 3 │ d true false diff --git a/dev/man/sorting/index.html b/dev/man/sorting/index.html index fdee99250..5effcb4fa 100644 --- a/dev/man/sorting/index.html +++ b/dev/man/sorting/index.html @@ -158,4 +158,4 @@ 148 │ 5.1 3.3 1.7 0.5 Iris-setosa 149 │ 5.1 3.8 1.9 0.4 Iris-setosa 150 │ 4.8 3.4 1.9 0.2 Iris-setosa - 135 rows omitted + 135 rows omitted diff --git a/dev/man/split_apply_combine/index.html b/dev/man/split_apply_combine/index.html index 86dce0935..2c514c214 100644 --- a/dev/man/split_apply_combine/index.html +++ b/dev/man/split_apply_combine/index.html @@ -814,4 +814,4 @@ 3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}: GroupKey: (customer_id = "c",) GroupKey: (customer_id = "b",) - GroupKey: (customer_id = "a",) + GroupKey: (customer_id = "a",) diff --git a/dev/man/working_with_dataframes/index.html b/dev/man/working_with_dataframes/index.html index f11767165..3eaf06ac1 100644 --- a/dev/man/working_with_dataframes/index.html +++ b/dev/man/working_with_dataframes/index.html @@ -710,4 +710,4 @@ 1 │ a 1 missing x 2 │ missing 2 j y 3 │ b 3 k missing - 4 │ missing 4 h z + 4 │ missing 4 h z