Merge commit 'v0.14.0-345-g8cd3dd6' into debian

* commit 'v0.14.0-345-g8cd3dd6': (73 commits) PERF: allow slice indexers to be computed faster PERF: allow dst transition computations to be handled much faster if the end-points are ok (GH7633) Revert "Merge pull request pandas-dev#7591 from mcwitt/parse-index-cols-c" TST: fixes for 2.6 comparisons BUG: Error in rolling_var if window is larger than array, fixes pandas-dev#7297 REGR: Add back #N/A N/A as a default NA value (regresion from 0.12) (GH5521) BUG: xlim on plots with shared axes (GH2960, GH3490) BUG: Bug in Series.get with a boolean accessor (GH7407) DOC: add v0.15.0.txt template DOC: small doc build fixes DOC: v0.14.1 edits BUG: doc example in groupby.rst (GH7559 / GH7628) PERF: optimize MultiIndex.from_product for large iterables ENH: change BlockManager pickle format to work with dup items BUG: {expanding,rolling}_{cov,corr} don't handle arguments with different index sets properly CLN/DEPR: Fix instances of 'U'/'rU' in open(...) CLN: Fix typo TST: fix groupby test on windows (related GH7580) COMPAT: make numpy NaT comparison use a view to avoid implicit conversions BUG: Bug in to_timedelta that accepted invalid units and misinterpreted m/h (GH7611, GH6423) ...
neurodebian · Jul 3, 2014 · 4300489 · 4300489
2 parents 95aed53 + 8cd3dd6
commit 4300489
Show file tree

Hide file tree

Showing 78 changed files with 3,394 additions and 1,986 deletions.
diff --git a/ci/requirements-2.6.txt b/ci/requirements-2.6.txt
@@ -4,7 +4,6 @@ python-dateutil==1.5
 pytz==2013b
 http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
 html5lib==1.0b2
-bigquery==2.0.17
 numexpr==1.4.2
 sqlalchemy==0.7.1
 pymysql==0.6.0

diff --git a/ci/requirements-2.7.txt b/ci/requirements-2.7.txt
@@ -19,5 +19,7 @@ lxml==3.2.1
 scipy==0.13.3
 beautifulsoup4==4.2.1
 statsmodels==0.5.0
-bigquery==2.0.17
 boto==2.26.1
+httplib2==0.8
+python-gflags==2.0
+google-api-python-client==1.2
diff --git a/ci/requirements-3.4.txt b/ci/requirements-3.4.txt
@@ -5,7 +5,7 @@ xlsxwriter
 xlrd
 html5lib
 numpy==1.8.0
-cython==0.20.0
+cython==0.20.2
 scipy==0.13.3
 numexpr==2.4
 tables==3.1.0

diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst
@@ -663,3 +663,25 @@ To globally provide aliases for axis names, one can define these 2 functions:
    df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
    df2.sum(axis='myaxis2')
    clear_axis_alias(DataFrame,'columns', 'myaxis2')
+
+Creating Example Data
+---------------------
+
+To create a dataframe from every combination of some given values, like R's ``expand.grid()``
+function, we can create a dict where the keys are column names and the values are lists
+of the data values:
+
+.. ipython:: python
+
+    import itertools
+
+    def expand_grid(data_dict):
+        rows = itertools.product(*data_dict.values())
+        return pd.DataFrame.from_records(rows, columns=data_dict.keys())
+
+    df = expand_grid(
+        {'height': [60, 70],
+         'weight': [100, 140, 180],
+         'sex': ['Male', 'Female']}
+    )
+    df
diff --git a/doc/source/install.rst b/doc/source/install.rst
@@ -112,7 +112,9 @@ Optional Dependencies
     :func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
     distributions will have xclip and/or xsel immediately available for
     installation.
-  * `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
+  * Google's `python-gflags` and `google-api-python-client`
+    * Needed for :mod:`~pandas.io.gbq`
+  * `httplib2`
     * Needed for :mod:`~pandas.io.gbq`
   * One of the following combinations of libraries is needed to use the
     top-level :func:`~pandas.io.html.read_html` function:

diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -98,8 +98,10 @@ They can take a number of arguments:
     data.  Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
     pass ``header=0`` to be able to replace existing names. The header can be
     a list of integers that specify row locations for a multi-index on the columns
-    E.g. [0,1,3]. Intervening rows that are not specified will be skipped.
-    (E.g. 2 in this example are skipped)
+    E.g. [0,1,3]. Intervening rows that are not specified will be
+    skipped (e.g. 2 in this example are skipped). Note that this parameter
+    ignores commented lines, so header=0 denotes the first line of
+    data rather than the first line of the file.
   - ``skiprows``: A collection of numbers for rows in the file to skip. Can
     also be an integer to skip the first ``n`` rows
   - ``index_col``: column number, column name, or list of column numbers/names,
@@ -145,8 +147,12 @@ They can take a number of arguments:
     Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
   - ``skipinitialspace`` : boolean, default ``False``, Skip spaces after delimiter
   - ``escapechar`` : string, to specify how to escape quoted data
-  - ``comment``: denotes the start of a comment and ignores the rest of the line.
-    Currently line commenting is not supported.
+  - ``comment``: Indicates remainder of line should not be parsed. If found at the
+    beginning of a line, the line will be ignored altogether. This parameter
+    must be a single character. Also, fully commented lines
+    are ignored by the parameter `header` but not by `skiprows`. For example,
+    if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
+    result in '1,2,3' being treated as the header.
   - ``nrows``: Number of rows to read out of the file. Useful to only read a
     small portion of a large file
   - ``iterator``: If True, return a ``TextFileReader`` to enable reading a file
@@ -252,6 +258,27 @@ after a delimiter:
    data = 'a, b, c\n1, 2, 3\n4, 5, 6'
    print(data)
    pd.read_csv(StringIO(data), skipinitialspace=True)
+   
+Moreover, ``read_csv`` ignores any completely commented lines:
+
+.. ipython:: python
+
+   data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
+   print(data)
+   pd.read_csv(StringIO(data), comment='#')
+
+.. note::
+
+   The presence of ignored lines might create ambiguities involving line numbers;
+   the parameter ``header`` uses row numbers (ignoring commented
+   lines), while ``skiprows`` uses line numbers (including commented lines):
+
+   .. ipython:: python
+
+      data = '#comment\na,b,c\nA,B,C\n1,2,3'
+      pd.read_csv(StringIO(data), comment='#', header=1)
+      data = 'A,B,C\n#comment\na,b,c\n1,2,3'
+      pd.read_csv(StringIO(data), comment='#', skiprows=2)
 
 The parsers make every attempt to "do the right thing" and not be very
 fragile. Type inference is a pretty big deal. So if a column can be coerced to
@@ -3373,83 +3400,80 @@ Google BigQuery (Experimental)
 The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
 analytics web service to simplify retrieving results from BigQuery tables
 using SQL-like queries. Result sets are parsed into a pandas
-DataFrame with a shape derived from the source table. Additionally,
-DataFrames can be uploaded into BigQuery datasets as tables
-if the source datatypes are compatible with BigQuery ones.
+DataFrame with a shape and data types derived from the source table. 
+Additionally, DataFrames can be appended to existing BigQuery tables if 
+the destination table is the same shape as the DataFrame.
 
 For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
 
-As an example, suppose you want to load all data from an existing table
-: `test_dataset.test_table`
-into BigQuery and pull it into a DataFrame.
+As an example, suppose you want to load all data from an existing BigQuery 
+table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq` 
+function.
 
 .. code-block:: python
 
-   from pandas.io import gbq
-
    # Insert your BigQuery Project ID Here
-   # Can be found in the web console, or
-   # using the command line tool `bq ls`
+   # Can be found in the Google web console
    projectid = "xxxxxxxx"
 
-   data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
+   data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
 
-The user will then be authenticated by the `bq` command line client -
-this usually involves the default browser opening to a login page,
-though the process can be done entirely from command line if necessary.
-Datasets and additional parameters can be either configured with `bq`,
-passed in as options to `read_gbq`, or set using Google's gflags (this
-is not officially supported by this module, though care was taken
-to ensure that they should be followed regardless of how you call the
-method).
+You will then be authenticated to the specified BigQuery account
+via Google's Oauth2 mechanism. In general, this is as simple as following the
+prompts in a browser window which will be opened for you. Should the browser not
+be available, or fail to launch, a code will be provided to complete the process
+manually.  Additional information on the authentication mechanism can be found
+`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
 
-Additionally, you can define which column to use as an index as well as a preferred column order as follows:
+You can define which column from BigQuery to use as an index in the
+destination DataFrame as well as a preferred column order as follows:
 
 .. code-block:: python
 
-   data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
+   data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
                              index_col='index_column_name',
-                             col_order='[col1, col2, col3,...]', project_id = projectid)
-
-Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
+                             col_order=['col1', 'col2', 'col3'], project_id = projectid)
+
+Finally, you can append data to a BigQuery table from a pandas DataFrame
+using the :func:`~pandas.io.to_gbq` function. This function uses the
+Google streaming API which requires that your destination table exists in
+BigQuery. Given the BigQuery table already exists, your DataFrame should
+match the destination table in column order, structure, and data types. 
+DataFrame indexes are not supported. By default, rows are streamed to 
+BigQuery in chunks of 10,000 rows, but you can pass other chuck values 
+via the ``chunksize`` argument. You can also see the progess of your 
+post via the ``verbose`` flag which defaults to ``True``. The http 
+response code of Google BigQuery can be successful (200) even if the 
+append failed. For this reason, if there is a failure to append to the 
+table, the complete error response from BigQuery is returned which 
+can be quite long given it provides a status for each row. You may want
+to start with smaller chuncks to test that the size and types of your
+dataframe match your destination table to make debugging simpler.
 
 .. code-block:: python
 
    df = pandas.DataFrame({'string_col_name' : ['hello'],
          'integer_col_name' : [1],
          'boolean_col_name' : [True]})
-   schema = ['STRING', 'INTEGER', 'BOOLEAN']
-   data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
-                           if_exists='fail', schema = schema, project_id = projectid)
+   df.to_gbq('my_dataset.my_table', project_id = projectid)
 
-To add more rows to this, simply:
+The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
 
-.. code-block:: python
+While BigQuery uses SQL-like syntax, it has some important differences
+from traditional databases both in functionality, API limitations (size and
+qunatity of queries or uploads), and how Google charges for use of the service. 
+You should refer to Google documentation often as the service seems to
+be changing and evolving. BiqQuery is best for analyzing large sets of 
+data quickly, but it is not a direct replacement for a transactional database.
 
-   df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
-         'integer_col_name' : [2],
-         'boolean_col_name' : [False]})
-   data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
-
-.. note::
-
-   A default project id can be set using the command line:
-   `bq init`.
-
-   There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
-   see `here <https://developers.google.com/bigquery/query-reference>`__
-
-   You can access the management console to determine project id's by:
-   <https://code.google.com/apis/console/b/0/?noredirect>
+You can access the management console to determine project id's by:
+<https://code.google.com/apis/console/b/0/?noredirect>
 
 .. warning::
 
-   To use this module, you will need a BigQuery account. See
-   <https://cloud.google.com/products/big-query> for details.
-
-   As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
-   but any client changes will not make it into 0.13.1. See:
-   http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
+   To use this module, you will need a valid BigQuery account. See
+   <https://cloud.google.com/products/big-query> for details on the
+   service.
 
 .. _io.stata:
 

diff --git a/doc/source/remote_data.rst b/doc/source/remote_data.rst
@@ -52,7 +52,7 @@ Yahoo! Finance
     f=web.DataReader("F", 'yahoo', start, end)
     f.ix['2010-01-04']
 
-.. _remote_data.yahoo_Options:
+.. _remote_data.yahoo_options:
 
 Yahoo! Finance Options
 ----------------------

diff --git a/doc/source/timeseries.rst b/doc/source/timeseries.rst
@@ -1280,9 +1280,10 @@ To supply the time zone, you can use the ``tz`` keyword to ``date_range`` and
 other functions. Dateutil time zone strings are distinguished from ``pytz``
 time zones by starting with ``dateutil/``.
 
-- In ``pytz`` you can find a list of common (and less common) time zones using ``from pytz import common_timezones, all_timezones``.
+- In ``pytz`` you can find a list of common (and less common) time zones using
+  ``from pytz import common_timezones, all_timezones``.
 - ``dateutil`` uses the OS timezones so there isn't a fixed list available. For
-common zones, the names are the same as ``pytz``.
+  common zones, the names are the same as ``pytz``.
 
 .. ipython:: python
 
@@ -1448,7 +1449,7 @@ Elements can be set to ``NaT`` using ``np.nan`` analagously to datetimes
    y[1] = np.nan
    y
 
-Operands can also appear in a reversed order (a singluar object operated with a Series)
+Operands can also appear in a reversed order (a singular object operated with a Series)
 
 .. ipython:: python