Skip to content

Commit

Permalink
Merge pull request #230 from Roche/dev
Browse files Browse the repository at this point in the history
version 1.2.2
  • Loading branch information
ofajardo authored Jun 1, 2023
2 parents 06dbeec + fe3f86d commit 03214b8
Show file tree
Hide file tree
Showing 18 changed files with 1,179 additions and 1,002 deletions.
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ authors:
given-names: "Otto"
orcid: "https://orcid.org/0000-0002-3363-9287"
title: "Pyreadstat"
version: 1.2.1
version: 1.2.2
doi: 10.5281/zenodo.6612282
date-released: 2018-09-24
url: "https://github.com/Roche/pyreadstat"
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,8 @@ df, meta = pyreadstat.read_sas7bdat('/path/to/a/file.sas7bdat', usecols=["variab
#### Reading files in parallel processes

A challenge when reading large files is the time consumed in the operation. In order to alleviate this
pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
pyreadstat provides a function "read\_file\_multiprocessing" to read a file in parallel processes using
the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If
that is not the case look at Reading rows in chunks (next section)

Speed ups in the process will depend on a number of factors such as number of processes available, RAM,
Expand All @@ -351,6 +352,11 @@ import multiprocessing
num_processes = multiprocessing.cpu_count()
```

**Notes for Xport, Por and some defective SAV files not having the number of rows in the metadata**
1. In all Xport, Por and some defective SAV files, the number of rows cannot be determined from the metadata. In such cases,
you can use the parameter num\_rows to be equal or larger to the number of rows in the dataset. This number can be obtained
reading the file without multiprocessing, reading in another application, etc.

**Notes for windows**

1. For this to work you must include a __name__ == "__main__" section in your script. See [this issue](#85)
Expand Down Expand Up @@ -410,6 +416,9 @@ for df, meta in reader:
# do some cool calculations here for the chunk
```

**If using multiprocessing, please read the notes in the previous section regarding Xport, Por and some defective SAV files not
having the number of rows in the metadata**

**For Windows, please check the notes on the previous section reading files in parallel processes**

#### Reading value labels
Expand Down
4 changes: 4 additions & 0 deletions change_log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# 1.2.1 (github, pypi and conda 2023.06.01)
* added num_rows to multiprocessing to allow processing of xport, por and
sav files not having the number of rows in the metadata.

# 1.2.1 (github, pypi and conda 2023.02.22)
* Readstat source updated to version 1.1.9
* introduced recognition for pandas datatype datetime64[ns, UTC] and other datetime64 types when writing,
Expand Down
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: dc63e4405a0437fb9efe8c4f5ffb3848
config: 321f78fa88a773e9f9ed9c32944f2233
tags: 645f666f9bcd5a90fca523b33c5a78b7
2 changes: 1 addition & 1 deletion docs/_build/html/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
VERSION: '1.2.1',
VERSION: '1.2.2',
LANGUAGE: 'None',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Index &mdash; pyreadstat 1.2.1 documentation</title>
<title>Index &mdash; pyreadstat 1.2.2 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
Expand Down
18 changes: 15 additions & 3 deletions docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />

<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.1 documentation</title>
<title>Welcome to pyreadstat’s documentation! &mdash; pyreadstat 1.2.2 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
Expand Down Expand Up @@ -172,6 +172,9 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_in_chunks">
<span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_in_chunks</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_in_chunks" title="Permalink to this definition"></a></dt>
<dd><p>Returns a generator that will allow to read a file in chunks.</p>
<p>If using multiprocessing, for Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
be obtained by the user before running this function.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
Expand All @@ -182,6 +185,11 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<li><p><strong>limit</strong> (<em>integer</em><em>, </em><em>optional</em>) – stop reading the file after certain number of rows, will be added to offset</p></li>
<li><p><strong>multiprocess</strong> (<em>bool</em><em>, </em><em>optional</em>) – use multiprocessing to read each chunk?</p></li>
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – in case multiprocess is true, how many workers/processes to spawn?</p></li>
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. If using multiprocessing it is obligatory for files where
the number of rows cannot be obtained from the medatata, such as xport, por and
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata or not using
multiprocessing.</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.</p></li>
</ul>
</dd>
Expand All @@ -200,14 +208,18 @@ <h1>Metadata Object Description<a class="headerlink" href="#metadata-object-desc
<dt class="sig sig-object py" id="pyreadstat.pyreadstat.read_file_multiprocessing">
<span class="sig-prename descclassname"><span class="pre">pyreadstat.pyreadstat.</span></span><span class="sig-name descname"><span class="pre">read_file_multiprocessing</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyreadstat.pyreadstat.read_file_multiprocessing" title="Permalink to this definition"></a></dt>
<dd><p>Reads a file in parallel using multiprocessing.
Xport and Por files are not supported as they do not have the number of rows recorded in the metadata,
information needed for this function.</p>
For Xport, Por and some defective sav files where the number of rows in the dataset canot be obtained from the metadata,
the parameter num_rows must be set to a number equal or larger than the number of rows in the dataset. That information must
be obtained by the user before running this function.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>read_function</strong> (<em>pyreadstat function</em>) – a pyreadstat reading function</p></li>
<li><p><strong>file_path</strong> (<em>string</em>) – path to the file to be read</p></li>
<li><p><strong>num_processes</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of processes to spawn, by default the min 4 and the max cores on the computer</p></li>
<li><p><strong>num_rows</strong> (<em>integer</em><em>, </em><em>optional</em>) – number of rows in the dataset. Obligatory for files where the number of rows cannot be obtained from the medatata, such as xport, por and
some defective sav files. The user must obtain this value by reading the file without multiprocessing first or any other means. A number
larger than the actual number of rows will work as well. Discarded if the number of rows can be obtained from the metadata.</p></li>
<li><p><strong>kwargs</strong> (<em>dict</em><em>, </em><em>optional</em>) – any other keyword argument to pass to the read_function.</p></li>
</ul>
</dd>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Python Module Index &mdash; pyreadstat 1.2.1 documentation</title>
<title>Python Module Index &mdash; pyreadstat 1.2.2 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Search &mdash; pyreadstat 1.2.1 documentation</title>
<title>Search &mdash; pyreadstat 1.2.2 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />

Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# The short X.Y version
version = ''
# The full version, including alpha/beta/rc tags
release = '1.2.1'
release = '1.2.2'


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion pyreadstat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,5 @@
from .pyreadstat import read_file_in_chunks, read_file_multiprocessing
from ._readstat_parser import ReadstatError, metadata_container

__version__ = "1.2.1"
__version__ = "1.2.2"

Loading

0 comments on commit 03214b8

Please sign in to comment.