Skip to content

Commit

Permalink
Merge commit 'v0.14.0-345-g8cd3dd6' into debian
Browse files Browse the repository at this point in the history
* commit 'v0.14.0-345-g8cd3dd6': (73 commits)
  PERF: allow slice indexers to be computed faster
  PERF: allow dst transition computations to be handled much faster       if the end-points are ok (GH7633)
  Revert "Merge pull request pandas-dev#7591 from mcwitt/parse-index-cols-c"
  TST: fixes for 2.6 comparisons
  BUG: Error in rolling_var if window is larger than array, fixes pandas-dev#7297
  REGR: Add back #N/A N/A as a default NA value (regresion from 0.12) (GH5521)
  BUG: xlim on plots with shared axes (GH2960, GH3490)
  BUG: Bug in Series.get with a boolean accessor (GH7407)
  DOC: add v0.15.0.txt template
  DOC: small doc build fixes
  DOC: v0.14.1 edits
  BUG: doc example in groupby.rst (GH7559 / GH7628)
  PERF: optimize MultiIndex.from_product for large iterables
  ENH: change BlockManager pickle format to work with dup items
  BUG: {expanding,rolling}_{cov,corr} don't handle arguments with different index sets properly
  CLN/DEPR: Fix instances of 'U'/'rU' in open(...)
  CLN: Fix typo
  TST: fix groupby test on windows (related GH7580)
  COMPAT: make numpy NaT comparison use a view to avoid implicit conversions
  BUG: Bug in to_timedelta that accepted invalid units and misinterpreted m/h (GH7611, GH6423)
  ...
  • Loading branch information
yarikoptic committed Jul 3, 2014
2 parents 95aed53 + 8cd3dd6 commit 4300489
Show file tree
Hide file tree
Showing 78 changed files with 3,394 additions and 1,986 deletions.
1 change: 0 additions & 1 deletion ci/requirements-2.6.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ python-dateutil==1.5
pytz==2013b
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
html5lib==1.0b2
bigquery==2.0.17
numexpr==1.4.2
sqlalchemy==0.7.1
pymysql==0.6.0
Expand Down
4 changes: 3 additions & 1 deletion ci/requirements-2.7.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,7 @@ lxml==3.2.1
scipy==0.13.3
beautifulsoup4==4.2.1
statsmodels==0.5.0
bigquery==2.0.17
boto==2.26.1
httplib2==0.8
python-gflags==2.0
google-api-python-client==1.2
2 changes: 1 addition & 1 deletion ci/requirements-3.4.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ xlsxwriter
xlrd
html5lib
numpy==1.8.0
cython==0.20.0
cython==0.20.2
scipy==0.13.3
numexpr==2.4
tables==3.1.0
Expand Down
22 changes: 22 additions & 0 deletions doc/source/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -663,3 +663,25 @@ To globally provide aliases for axis names, one can define these 2 functions:
df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
df2.sum(axis='myaxis2')
clear_axis_alias(DataFrame,'columns', 'myaxis2')
Creating Example Data
---------------------

To create a dataframe from every combination of some given values, like R's ``expand.grid()``
function, we can create a dict where the keys are column names and the values are lists
of the data values:

.. ipython:: python
import itertools
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
df = expand_grid(
{'height': [60, 70],
'weight': [100, 140, 180],
'sex': ['Male', 'Female']}
)
df
4 changes: 3 additions & 1 deletion doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,9 @@ Optional Dependencies
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
distributions will have xclip and/or xsel immediately available for
installation.
* `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
* Google's `python-gflags` and `google-api-python-client`
* Needed for :mod:`~pandas.io.gbq`
* `httplib2`
* Needed for :mod:`~pandas.io.gbq`
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:
Expand Down
132 changes: 78 additions & 54 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,10 @@ They can take a number of arguments:
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
pass ``header=0`` to be able to replace existing names. The header can be
a list of integers that specify row locations for a multi-index on the columns
E.g. [0,1,3]. Intervening rows that are not specified will be skipped.
(E.g. 2 in this example are skipped)
E.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example are skipped). Note that this parameter
ignores commented lines, so header=0 denotes the first line of
data rather than the first line of the file.
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
also be an integer to skip the first ``n`` rows
- ``index_col``: column number, column name, or list of column numbers/names,
Expand Down Expand Up @@ -145,8 +147,12 @@ They can take a number of arguments:
Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
- ``skipinitialspace`` : boolean, default ``False``, Skip spaces after delimiter
- ``escapechar`` : string, to specify how to escape quoted data
- ``comment``: denotes the start of a comment and ignores the rest of the line.
Currently line commenting is not supported.
- ``comment``: Indicates remainder of line should not be parsed. If found at the
beginning of a line, the line will be ignored altogether. This parameter
must be a single character. Also, fully commented lines
are ignored by the parameter `header` but not by `skiprows`. For example,
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
result in '1,2,3' being treated as the header.
- ``nrows``: Number of rows to read out of the file. Useful to only read a
small portion of a large file
- ``iterator``: If True, return a ``TextFileReader`` to enable reading a file
Expand Down Expand Up @@ -252,6 +258,27 @@ after a delimiter:
data = 'a, b, c\n1, 2, 3\n4, 5, 6'
print(data)
pd.read_csv(StringIO(data), skipinitialspace=True)
Moreover, ``read_csv`` ignores any completely commented lines:

.. ipython:: python
data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
print(data)
pd.read_csv(StringIO(data), comment='#')
.. note::

The presence of ignored lines might create ambiguities involving line numbers;
the parameter ``header`` uses row numbers (ignoring commented
lines), while ``skiprows`` uses line numbers (including commented lines):

.. ipython:: python
data = '#comment\na,b,c\nA,B,C\n1,2,3'
pd.read_csv(StringIO(data), comment='#', header=1)
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
pd.read_csv(StringIO(data), comment='#', skiprows=2)
The parsers make every attempt to "do the right thing" and not be very
fragile. Type inference is a pretty big deal. So if a column can be coerced to
Expand Down Expand Up @@ -3373,83 +3400,80 @@ Google BigQuery (Experimental)
The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
analytics web service to simplify retrieving results from BigQuery tables
using SQL-like queries. Result sets are parsed into a pandas
DataFrame with a shape derived from the source table. Additionally,
DataFrames can be uploaded into BigQuery datasets as tables
if the source datatypes are compatible with BigQuery ones.
DataFrame with a shape and data types derived from the source table.
Additionally, DataFrames can be appended to existing BigQuery tables if
the destination table is the same shape as the DataFrame.

For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__

As an example, suppose you want to load all data from an existing table
: `test_dataset.test_table`
into BigQuery and pull it into a DataFrame.
As an example, suppose you want to load all data from an existing BigQuery
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
function.

.. code-block:: python
from pandas.io import gbq
# Insert your BigQuery Project ID Here
# Can be found in the web console, or
# using the command line tool `bq ls`
# Can be found in the Google web console
projectid = "xxxxxxxx"
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
The user will then be authenticated by the `bq` command line client -
this usually involves the default browser opening to a login page,
though the process can be done entirely from command line if necessary.
Datasets and additional parameters can be either configured with `bq`,
passed in as options to `read_gbq`, or set using Google's gflags (this
is not officially supported by this module, though care was taken
to ensure that they should be followed regardless of how you call the
method).
You will then be authenticated to the specified BigQuery account
via Google's Oauth2 mechanism. In general, this is as simple as following the
prompts in a browser window which will be opened for you. Should the browser not
be available, or fail to launch, a code will be provided to complete the process
manually. Additional information on the authentication mechanism can be found
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__

Additionally, you can define which column to use as an index as well as a preferred column order as follows:
You can define which column from BigQuery to use as an index in the
destination DataFrame as well as a preferred column order as follows:

.. code-block:: python
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order='[col1, col2, col3,...]', project_id = projectid)
Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
col_order=['col1', 'col2', 'col3'], project_id = projectid)
Finally, you can append data to a BigQuery table from a pandas DataFrame
using the :func:`~pandas.io.to_gbq` function. This function uses the
Google streaming API which requires that your destination table exists in
BigQuery. Given the BigQuery table already exists, your DataFrame should
match the destination table in column order, structure, and data types.
DataFrame indexes are not supported. By default, rows are streamed to
BigQuery in chunks of 10,000 rows, but you can pass other chuck values
via the ``chunksize`` argument. You can also see the progess of your
post via the ``verbose`` flag which defaults to ``True``. The http
response code of Google BigQuery can be successful (200) even if the
append failed. For this reason, if there is a failure to append to the
table, the complete error response from BigQuery is returned which
can be quite long given it provides a status for each row. You may want
to start with smaller chuncks to test that the size and types of your
dataframe match your destination table to make debugging simpler.

.. code-block:: python
df = pandas.DataFrame({'string_col_name' : ['hello'],
'integer_col_name' : [1],
'boolean_col_name' : [True]})
schema = ['STRING', 'INTEGER', 'BOOLEAN']
data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
if_exists='fail', schema = schema, project_id = projectid)
df.to_gbq('my_dataset.my_table', project_id = projectid)
To add more rows to this, simply:
The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__

.. code-block:: python
While BigQuery uses SQL-like syntax, it has some important differences
from traditional databases both in functionality, API limitations (size and
qunatity of queries or uploads), and how Google charges for use of the service.
You should refer to Google documentation often as the service seems to
be changing and evolving. BiqQuery is best for analyzing large sets of
data quickly, but it is not a direct replacement for a transactional database.

df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
'integer_col_name' : [2],
'boolean_col_name' : [False]})
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
.. note::

A default project id can be set using the command line:
`bq init`.

There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
see `here <https://developers.google.com/bigquery/query-reference>`__

You can access the management console to determine project id's by:
<https://code.google.com/apis/console/b/0/?noredirect>
You can access the management console to determine project id's by:
<https://code.google.com/apis/console/b/0/?noredirect>

.. warning::

To use this module, you will need a BigQuery account. See
<https://cloud.google.com/products/big-query> for details.

As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
but any client changes will not make it into 0.13.1. See:
http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
To use this module, you will need a valid BigQuery account. See
<https://cloud.google.com/products/big-query> for details on the
service.

.. _io.stata:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/remote_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Yahoo! Finance
f=web.DataReader("F", 'yahoo', start, end)
f.ix['2010-01-04']
.. _remote_data.yahoo_Options:
.. _remote_data.yahoo_options:

Yahoo! Finance Options
----------------------
Expand Down
7 changes: 4 additions & 3 deletions doc/source/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1280,9 +1280,10 @@ To supply the time zone, you can use the ``tz`` keyword to ``date_range`` and
other functions. Dateutil time zone strings are distinguished from ``pytz``
time zones by starting with ``dateutil/``.

- In ``pytz`` you can find a list of common (and less common) time zones using ``from pytz import common_timezones, all_timezones``.
- In ``pytz`` you can find a list of common (and less common) time zones using
``from pytz import common_timezones, all_timezones``.
- ``dateutil`` uses the OS timezones so there isn't a fixed list available. For
common zones, the names are the same as ``pytz``.
common zones, the names are the same as ``pytz``.

.. ipython:: python
Expand Down Expand Up @@ -1448,7 +1449,7 @@ Elements can be set to ``NaT`` using ``np.nan`` analagously to datetimes
y[1] = np.nan
y
Operands can also appear in a reversed order (a singluar object operated with a Series)
Operands can also appear in a reversed order (a singular object operated with a Series)

.. ipython:: python
Expand Down
Loading

0 comments on commit 4300489

Please sign in to comment.