Skip to content

Commit

Permalink
ENH read multiple sheets in read_excel()
Browse files Browse the repository at this point in the history
  • Loading branch information
jnmclarty committed Feb 22, 2015
1 parent fa2b684 commit d8a2893
Show file tree
Hide file tree
Showing 5 changed files with 300 additions and 112 deletions.
118 changes: 86 additions & 32 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1949,56 +1949,106 @@ module and use the same parsing code as the above to convert tabular data into
a DataFrame. See the :ref:`cookbook<cookbook.excel>` for some
advanced strategies

Besides ``read_excel`` you can also read Excel files using the ``ExcelFile``
class. The following two commands are equivalent:
Reading Excel Files
~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.16

``read_excel`` can read more than one sheet, by setting ``sheetname`` to either
a list of sheet names, a list of sheet positions, or ``None`` to read all sheets.

.. versionadded:: 0.13

Sheets can be specified by sheet index or sheet name, using an integer or string,
respectively.

.. versionadded:: 0.12

``ExcelFile`` has been moved to the top level namespace.

There are two approaches to reading an excel file. The ``read_excel`` function
and the ``ExcelFile`` class. ``read_excel`` is for reading one file
with file-specific arguments (ie. identical data formats across sheets).
``ExcelFile`` is for reading one file with sheet-specific arguments (ie. various data
formats across sheets). Choosing the approach is largely a question of
code readability and execution speed.

Equivalent class and function approaches to read a single sheet:

.. code-block:: python
# using the ExcelFile class
xls = pd.ExcelFile('path_to_file.xls')
xls.parse('Sheet1', index_col=None, na_values=['NA'])
data = xls.parse('Sheet1', index_col=None, na_values=['NA'])
# using the read_excel function
read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
data = read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
The class based approach can be used to read multiple sheets or to introspect
the sheet names using the ``sheet_names`` attribute.
Equivalent class and function approaches to read multiple sheets:

.. note::
.. code-block:: python
The prior method of accessing ``ExcelFile`` has been moved from
``pandas.io.parsers`` to the top level namespace starting from pandas
0.12.0.
data = {}
# For when Sheet1's format differs from Sheet2
xls = pd.ExcelFile('path_to_file.xls')
data['Sheet1'] = xls.parse('Sheet1', index_col=None, na_values=['NA'])
data['Sheet2'] = xls.parse('Sheet2', index_col=1)
# For when Sheet1's format is identical to Sheet2
data = read_excel('path_to_file.xls', ['Sheet1','Sheet2'], index_col=None, na_values=['NA'])
Specifying Sheets
+++++++++++++++++
.. _io.specifying_sheets:

.. versionadded:: 0.13
.. note :: The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names``
There are now two ways to read in sheets from an Excel file. You can provide
either the index of a sheet or its name to by passing different values for
``sheet_name``.
.. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets.
- The arguments ``sheetname`` allows specifying the sheet or sheets to read.
- The default value for ``sheetname`` is 0, indicating to read the first sheet
- Pass a string to refer to the name of a particular sheet in the workbook.
- Pass an integer to refer to the index of a sheet. Indices follow Python
convention, beginning at 0.
- The default value is ``sheet_name=0``. This reads the first sheet.

Using the sheet name:
- Pass a list of either strings or integers, to return a dictionary of specified sheets.
- Pass a ``None`` to return a dictionary of all available sheets.

.. code-block:: python
# Returns a DataFrame
read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
Using the sheet index:

.. code-block:: python
read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
# Returns a DataFrame
read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
Using all default values:

.. code-block:: python
# Returns a DataFrame
read_excel('path_to_file.xls')
Using None to get all sheets:

.. code-block:: python
# Returns a dictionary of DataFrames
read_excel('path_to_file.xls',sheetname=None)
Using a list to get multiple sheets:

.. code-block:: python
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
read_excel('path_to_file.xls',sheetname=['Sheet1',3])
Parsing Specific Columns
++++++++++++++++++++++++

It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. `read_excel` takes
a `parse_cols` keyword to allow you to specify a subset of columns to parse.
Expand All @@ -2017,26 +2067,30 @@ indices to be parsed.
read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3])
.. note::
Cell Converters
+++++++++++++++

It is possible to transform the contents of Excel cells via the `converters`
option. For instance, to convert a column to boolean:
It is possible to transform the contents of Excel cells via the `converters`
option. For instance, to convert a column to boolean:

.. code-block:: python
.. code-block:: python
read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
This options handles missing values and treats exceptions in the converters
as missing data. Transformations are applied cell by cell rather than to the
column as a whole, so the array dtype is not guaranteed. For instance, a
column of integers with missing values cannot be transformed to an array
with integer dtype, because NaN is strictly a float. You can manually mask
missing data to recover integer dtype:
This options handles missing values and treats exceptions in the converters
as missing data. Transformations are applied cell by cell rather than to the
column as a whole, so the array dtype is not guaranteed. For instance, a
column of integers with missing values cannot be transformed to an array
with integer dtype, because NaN is strictly a float. You can manually mask
missing data to recover integer dtype:

.. code-block:: python
.. code-block:: python
cfun = lambda x: int(x) if x else -1
read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
cfun = lambda x: int(x) if x else -1
read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
Writing Excel Files
~~~~~~~~~~~~~~~~~~~

To write a DataFrame object to a sheet of an Excel file, you can use the
``to_excel`` instance method. The arguments are largely the same as ``to_csv``
Expand Down
8 changes: 8 additions & 0 deletions doc/source/whatsnew/v0.16.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,14 @@ Enhancements
- Added ``StringMethods.find()`` and ``rfind()`` which behave as the same as standard ``str`` (:issue:`9386`)

- Added ``StringMethods.isnumeric`` and ``isdecimal`` which behave as the same as standard ``str`` (:issue:`9439`)
- The ``read_excel()`` function's :ref:`sheetname <_io.specifying_sheets>` argument now accepts a list and ``None``, to get multiple or all sheets respectively. If more than one sheet is specified, a dictionary is returned. (:issue:`9450`)

.. code-block:: python

# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])

- A ``verbose`` argument has been augmented in ``io.read_excel()``, defaults to False. Set to True to print sheet names as they are parsed. (:issue:`9450`)
- Added ``StringMethods.ljust()`` and ``rjust()`` which behave as the same as standard ``str`` (:issue:`9352`)
- ``StringMethods.pad()`` and ``center()`` now accept ``fillchar`` option to specify filling character (:issue:`9352`)
- Added ``StringMethods.zfill()`` which behave as the same as standard ``str`` (:issue:`9387`)
Expand Down
Loading

0 comments on commit d8a2893

Please sign in to comment.