Merge pull request #230 from Roche/dev

version 1.2.2
Roche · Jun 1, 2023 · 03214b8 · 03214b8
2 parents 06dbeec + fe3f86d
commit 03214b8
Show file tree

Hide file tree

Showing 18 changed files with 1,179 additions and 1,002 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -5,7 +5,7 @@ authors:
   given-names: "Otto"
   orcid: "https://orcid.org/0000-0002-3363-9287"
 title: "Pyreadstat"
-version: 1.2.1
+version: 1.2.2
 doi: 10.5281/zenodo.6612282
 date-released: 2018-09-24
 url: "https://github.com/Roche/pyreadstat"
diff --git a/README.md b/README.md
@@ -330,7 +330,8 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
 #### Reading files in parallel processes
 
 A challenge when reading large files is the time consumed in the operation. In order to alleviate this
-pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
+pyreadstat provides a function "read\_file\_multiprocessing" to read a file in parallel processes using
+ the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
 that is not the case look at Reading rows in chunks (next section)
 
 Speed ups in the process will depend on a number of factors such as number of processes available, RAM, 
@@ -351,6 +352,11 @@ import multiprocessing
 num_processes = multiprocessing.cpu_count()
 ```
 
+**Notes for Xport, Por and some defective SAV files not having the number of rows in the metadata**
+1. In all Xport, Por and some defective SAV files, the number of rows cannot be determined from the metadata. In such cases,
+   you can use the parameter num\_rows to be equal or larger to the number of rows in the dataset. This number can be obtained
+   reading the file without multiprocessing, reading in another application, etc.
+
 **Notes for windows**
 
 1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
@@ -410,6 +416,9 @@ for df, meta in reader:
     # do some cool calculations here for the chunk
 ```
 
+**If using multiprocessing, please read the notes in the previous section regarding Xport, Por and some defective SAV files not
+having the number of rows in the metadata**
+
 **For Windows, please check the notes on the previous section reading files in parallel processes**
 
 #### Reading value labels

diff --git a/change_log.md b/change_log.md
@@ -1,3 +1,7 @@
+# 1.2.1 (github, pypi and conda 2023.06.01)
+* added num_rows to multiprocessing to allow processing of xport, por and 
+  sav files not having the number of rows in the metadata.
+
 # 1.2.1 (github, pypi and conda 2023.02.22)
 * Readstat source updated to version 1.1.9
 * introduced recognition for pandas datatype datetime64[ns, UTC] and other datetime64 types when writing, 

diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: dc63e4405a0437fb9efe8c4f5ffb3848
+config: 321f78fa88a773e9f9ed9c32944f2233
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_build/html/_static/documentation_options.js b/docs/_build/html/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '1.2.1',
+    VERSION: '1.2.2',
     LANGUAGE: 'None',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',

diff --git a/docs/_build/html/genindex.html b/docs/_build/html/genindex.html
@@ -3,7 +3,7 @@
 <head>
   <meta charset="utf-8" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>Index &mdash; pyreadstat 1.2.1 documentation</title>
+  <title>Index &mdash; pyreadstat 1.2.2 documentation</title>
       <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
       <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   <!--[if lt IE 9]>

diff --git a/docs/_build/html/index.html b/docs/_build/html/index.html
@@ -4,7 +4,7 @@
   <meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
 
   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.1 documentation</title>
+  <title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.2 documentation</title>
       <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
       <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   <!--[if lt IE 9]>
@@ -172,6 +172,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_in_chunks">
 <span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_in_chunks</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_in_chunks" title="Permalink to this definition">¶</a></dt>
 <dd><p>Returns a generator that will allow to read a file in chunks.</p>
+<p>If using multiprocessing, for Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
+the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
+be obtained by the user before running this function.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
@@ -182,6 +185,11 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
 <li><p><strong>multiprocess</strong> (<em>bool</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
 <li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
+<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. If using multiprocessing it is obligatory for files where
+the number of rows cannot be obtained from the medatata, such as xport, por and
+some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
+larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata or not using
+multiprocessing.</p></li>
 <li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
 </ul>
 </dd>
@@ -200,14 +208,18 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
 <dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_multiprocessing">
 <span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_multiprocessing</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition">¶</a></dt>
 <dd><p>Reads a file in parallel using multiprocessing.
-Xport and Por files are not supported as they do not have the number of rows recorded in the metadata,
-information needed for this function.</p>
+For Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
+the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
+be obtained by the user before running this function.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
 <dd class="field-odd"><ul class="simple">
 <li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
 <li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
 <li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the min 4 and the max cores on the computer</p></li>
+<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. Obligatory for files where the number of rows cannot be obtained from the medatata, such as xport, por and
+some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
+larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata.</p></li>
 <li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
 </ul>
 </dd>

diff --git a/docs/_build/html/py-modindex.html b/docs/_build/html/py-modindex.html
@@ -3,7 +3,7 @@
 <head>
   <meta charset="utf-8" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>Python Module Index &mdash; pyreadstat 1.2.1 documentation</title>
+  <title>Python Module Index &mdash; pyreadstat 1.2.2 documentation</title>
       <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
       <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   <!--[if lt IE 9]>

diff --git a/docs/_build/html/search.html b/docs/_build/html/search.html
@@ -3,7 +3,7 @@
 <head>
   <meta charset="utf-8" />
   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>Search &mdash; pyreadstat 1.2.1 documentation</title>
+  <title>Search &mdash; pyreadstat 1.2.2 documentation</title>
       <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
       <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
 

diff --git a/docs/_build/html/searchindex.js b/docs/_build/html/searchindex.js
diff --git a/docs/conf.py b/docs/conf.py
@@ -26,7 +26,7 @@
 # The short X.Y version
 version = ''
 # The full version, including alpha/beta/rc tags
-release = '1.2.1'
+release = '1.2.2'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/pyreadstat/__init__.py b/pyreadstat/__init__.py
@@ -20,5 +20,5 @@
 from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
 from ._readstat_parser import ReadstatError, metadata_container
 
-__version__ = "1.2.1"
+__version__ = "1.2.2"