Skip to content

Commit

Permalink
Merge branch 'release-2.1.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
mpenkov committed Jul 1, 2020
2 parents 8ed29ae + 084fe9c commit a442186
Show file tree
Hide file tree
Showing 22 changed files with 1,486 additions and 59 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Unreleased

# 2.1.0, 1 July 2020

- Azure storage blob support ([@nclsmitchell](https://github.com/nclsmitchell) and [@petedannemann](https://github.com/petedannemann))
- Correctly pass `newline` parameter to built-in `open` function (PR [#478](https://github.com/RaRe-Technologies/smart_open/pull/478), [@burkovae](https://github.com/burkovae))
- Ensure GCS objects always have a .name attribute (PR [#506](https://github.com/RaRe-Technologies/smart_open/pull/506), [@todor-markov](https://github.com/todor-markov))
- Use exception chaining to convey the original cause of the exception (PR [#508](https://github.com/RaRe-Technologies/smart_open/pull/508), [@cool-RR](https://github.com/cool-RR))

# 2.0.0, 27 April 2020, "Python 3"

- **This version supports Python 3 only** (3.5+).
Expand Down
43 changes: 42 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ smart_open — utils for streaming large files in Python
What?
=====

``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.

Expand Down Expand Up @@ -92,6 +92,7 @@ Other examples of URLs that ``smart_open`` accepts::
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
gs://my_bucket/my_blob
azure://my_bucket/my_blob
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
Expand Down Expand Up @@ -203,6 +204,22 @@ More examples
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
client: azure.storage.blob.BlobServiceClient.from_connection_string(connect_str)
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
print(line)
# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
client: azure.storage.blob.BlobServiceClient.from_connection_string(connect_str)
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'hello world')
Supported Compression Formats
-----------------------------

Expand Down Expand Up @@ -242,6 +259,7 @@ Transport-specific Options
- SSH, SCP and SFTP
- WebHDFS
- GCS
- Azure Blob Storage

Each option involves setting up its own set of parameters.
For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.
Expand Down Expand Up @@ -372,6 +390,29 @@ and pass it to the Client. To create an API token for use in the example below,
client = Client(credentials=credentials)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))
Azure Credentials
---------------
``smart_open`` uses the ``azure-storage-blob`` library to talk to Azure Blob Storage.
By default, ``smart_open`` will defer to ``azure-storage-blob`` and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing a ``azure.storage.blob.BlobServiceClient``
object as a transport parameter to the ``open`` function is required.
You can `customize the credentials <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__
when constructing the client. ``smart_open`` will then use the client when talking to. To follow allow with
the example below, `refer to Azure's guide <https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#copy-your-credentials-from-the-azure-portal>`__
to setting up authentication.

.. code-block:: python
import os
from azure.storage.blob import BlobServiceClient
azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING']
client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
fin = open('azure://my_container/my_blob.txt', transport_params=dict(client=client))
If you need more credential options, refer to the
`Azure Storage authentication guide <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__.

File-like Binary Streams
------------------------

Expand Down
15 changes: 13 additions & 2 deletions help.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ DESCRIPTION
* `register_compressor()`, which registers callbacks for transparent compressor handling

PACKAGE CONTENTS
azure
bytebuffer
compression
concurrency
Expand Down Expand Up @@ -85,6 +86,15 @@ FUNCTIONS

smart_open supports the following transport mechanisms:

azure (smart_open/azure.py)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Implements file-like objects for reading and writing to/from Azure Blob Storage.

buffer_size: int, optional
The buffer size to use when performing I/O. For reading only.
min_part_size: int, optional
The minimum part size for multipart uploads. For writing only.

file (smart_open/local_file.py)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Implements the transport for the file:// schema.
Expand Down Expand Up @@ -248,6 +258,7 @@ FUNCTIONS
-----
Supported URI schemes are:

* azure
* file
* gs
* hdfs
Expand Down Expand Up @@ -356,9 +367,9 @@ DATA
__all__ = ['open', 'parse_uri', 'register_compressor', 's3_iter_bucket...

VERSION
2.0.0
2.1.0

FILE
/home/misha/git/smart_open/smart_open/__init__.py
/Users/misha/git/smart_open/smart_open/__init__.py


137 changes: 137 additions & 0 deletions integration-tests/test_azure.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# -*- coding: utf-8 -*-
import io
import os

import azure.storage.blob

from pytest import fixture

import smart_open

_AZURE_CONTAINER = os.environ.get('SO_AZURE_CONTAINER')
_AZURE_STORAGE_CONNECTION_STRING = os.environ.get('AZURE_STORAGE_CONNECTION_STRING')
_FILE_PREFIX = '%s://%s' % (smart_open.azure.SCHEME, _AZURE_CONTAINER)

assert _AZURE_CONTAINER is not None, 'please set the SO_AZURE_CONTAINER environment variable'
assert _AZURE_STORAGE_CONNECTION_STRING is not None, \
'please set the AZURE_STORAGE_CONNECTION_STRING environment variable'


@fixture
def client():
# type: () -> azure.storage.blob.BlobServiceClient
return azure.storage.blob.BlobServiceClient.from_connection_string(_AZURE_STORAGE_CONNECTION_STRING)


def initialize_bucket(client):
container_client = client.get_container_client(_AZURE_CONTAINER)
blobs = container_client.list_blobs()
for blob in blobs:
container_client.delete_blob(blob=blob)


def write_read(key, content, write_mode, read_mode, **kwargs):
with smart_open.open(key, write_mode, **kwargs) as fout:
fout.write(content)
with smart_open.open(key, read_mode, **kwargs) as fin:
return fin.read()


def read_length_prefixed_messages(key, read_mode, **kwargs):
result = io.BytesIO()

with smart_open.open(key, read_mode, **kwargs) as fin:
length_byte = fin.read(1)
while len(length_byte):
result.write(length_byte)
msg = fin.read(ord(length_byte))
result.write(msg)
length_byte = fin.read(1)
return result.getvalue()


def test_azure_readwrite_text(benchmark, client):
initialize_bucket(client)

key = _FILE_PREFIX + '/sanity.txt'
text = 'с гранатою в кармане, с чекою в руке'
actual = benchmark(
write_read, key, text, 'w', 'r', encoding='utf-8', transport_params=dict(client=client)
)
assert actual == text


def test_azure_readwrite_text_gzip(benchmark, client):
initialize_bucket(client)

key = _FILE_PREFIX + '/sanity.txt.gz'
text = 'не чайки здесь запели на знакомом языке'
actual = benchmark(
write_read, key, text, 'w', 'r', encoding='utf-8', transport_params=dict(client=client)
)
assert actual == text


def test_azure_readwrite_binary(benchmark, client):
initialize_bucket(client)

key = _FILE_PREFIX + '/sanity.txt'
binary = b'this is a test'
actual = benchmark(write_read, key, binary, 'wb', 'rb', transport_params=dict(client=client))
assert actual == binary


def test_azure_readwrite_binary_gzip(benchmark, client):
initialize_bucket(client)

key = _FILE_PREFIX + '/sanity.txt.gz'
binary = b'this is a test'
actual = benchmark(write_read, key, binary, 'wb', 'rb', transport_params=dict(client=client))
assert actual == binary


def test_azure_performance(benchmark, client):
initialize_bucket(client)

one_megabyte = io.BytesIO()
for _ in range(1024*128):
one_megabyte.write(b'01234567')
one_megabyte = one_megabyte.getvalue()

key = _FILE_PREFIX + '/performance.txt'
actual = benchmark(write_read, key, one_megabyte, 'wb', 'rb', transport_params=dict(client=client))
assert actual == one_megabyte


def test_azure_performance_gz(benchmark, client):
initialize_bucket(client)

one_megabyte = io.BytesIO()
for _ in range(1024*128):
one_megabyte.write(b'01234567')
one_megabyte = one_megabyte.getvalue()

key = _FILE_PREFIX + '/performance.txt.gz'
actual = benchmark(write_read, key, one_megabyte, 'wb', 'rb', transport_params=dict(client=client))
assert actual == one_megabyte


def test_azure_performance_small_reads(benchmark, client):
initialize_bucket(client)

ONE_MIB = 1024**2
one_megabyte_of_msgs = io.BytesIO()
msg = b'\x0f' + b'0123456789abcde' # a length-prefixed "message"
for _ in range(0, ONE_MIB, len(msg)):
one_megabyte_of_msgs.write(msg)
one_megabyte_of_msgs = one_megabyte_of_msgs.getvalue()

key = _FILE_PREFIX + '/many_reads_performance.bin'

with smart_open.open(key, 'wb', transport_params=dict(client=client)) as fout:
fout.write(one_megabyte_of_msgs)

actual = benchmark(
read_length_prefixed_messages, key, 'rb', buffering=ONE_MIB, transport_params=dict(client=client)
)
assert actual == one_megabyte_of_msgs
5 changes: 3 additions & 2 deletions release/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ First, check that the [latest commit](https://github.com/RaRe-Technologies/smart

For the subsequent steps to work, you will need to be in the top-level subdirectory for the repo (e.g. /home/misha/git/smart_open).

Prepare the release, replacing 1.2.3 with the actual version of the new release:
Prepare the release, replacing 2.3.4 with the actual version of the new release:

bash release/prepare.sh 1.2.3
export SMART_OPEN_RELEASE=2.3.4
bash release/prepare.sh

This will create a local release branch.
Look around the branch and make sure everything is in order.
Expand Down
6 changes: 1 addition & 5 deletions release/merge.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,7 @@ set -euo pipefail

cd "$(dirname "${BASH_SOURCE[0]}")/.."

#
# env -i python seems to return the wrong Python version on MacOS...
#
my_python=$(which python3)
version=$($my_python -c "from smart_open.version import __version__; print(__version__)")
version="$SMART_OPEN_RELEASE"

read -p "Push version $version to github.com? yes or no: " reply
if [ "$reply" != "yes" ]
Expand Down
5 changes: 3 additions & 2 deletions release/prepare.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
#
# Prepare a new release of smart_open. Use it like this:
#
# bash release/prepare.sh 1.2.3
# export SMART_OPEN_RELEASE=2.3.4
# bash release/prepare.sh
#
# where 1.2.3 is the new version to release.
#
Expand All @@ -17,7 +18,7 @@
#
set -euxo pipefail

version="$1"
version="$SMART_OPEN_RELEASE"
echo "version: $version"

script_dir="$(dirname "${BASH_SOURCE[0]}")"
Expand Down
6 changes: 1 addition & 5 deletions release/push_pypi.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,7 @@
#
set -euo pipefail

#
# env -i python seems to return the wrong Python version on MacOS...
#
my_python=$(which python3)
version=$($my_python -c "from smart_open.version import __version__; print(__version__)")
version="$SMART_OPEN_RELEASE"

script_dir="$(dirname "${BASH_SOURCE[0]}")"
cd "$script_dir"
Expand Down
8 changes: 5 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,14 @@ def read(fname):

aws_deps = ['boto', 'boto3']
gcp_deps = ['google-cloud-storage']
azure_deps = ['azure-storage-blob', 'azure-common', 'azure-core']

all_deps = install_requires + aws_deps + gcp_deps
all_deps = install_requires + aws_deps + gcp_deps + azure_deps

setup(
name='smart_open',
version=__version__,
description='Utils for streaming large files (S3, HDFS, GCS, gzip, bz2...)',
description='Utils for streaming large files (S3, HDFS, GCS, Azure Blob Storage, gzip, bz2...)',
long_description=read('README.rst'),

packages=find_packages(),
Expand All @@ -79,7 +80,7 @@ def read(fname):
url='https://github.com/piskvorky/smart_open',
download_url='http://pypi.python.org/pypi/smart_open',

keywords='file streaming, s3, hdfs, gcs',
keywords='file streaming, s3, hdfs, gcs, azure blob storage',

license='MIT',
platforms='any',
Expand All @@ -93,6 +94,7 @@ def read(fname):
'test': tests_require,
'aws': aws_deps,
'gcp': gcp_deps,
'azure': azure_deps,
'all': all_deps,
},
python_requires=">=3.5.*",
Expand Down
15 changes: 8 additions & 7 deletions smart_open/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,15 @@
"""

import logging
from smart_open import version

from .smart_open_lib import open, parse_uri, smart_open, register_compressor
from .s3 import iter_bucket as s3_iter_bucket
#
# Prevent regression of #474 and #475
#
logging.getLogger(__name__).addHandler(logging.NullHandler())

from smart_open import version # noqa: E402
from .smart_open_lib import open, parse_uri, smart_open, register_compressor # noqa: E402
from .s3 import iter_bucket as s3_iter_bucket # noqa: E402

__all__ = [
'open',
Expand All @@ -36,8 +41,4 @@
'smart_open',
]


__version__ = version.__version__

logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
Loading

0 comments on commit a442186

Please sign in to comment.