Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with List Training #1089

Closed
LukeCLindgren opened this issue Nov 28, 2017 · 7 comments
Closed

Error with List Training #1089

LukeCLindgren opened this issue Nov 28, 2017 · 7 comments
Labels

Comments

@LukeCLindgren
Copy link

LukeCLindgren commented Nov 28, 2017

This is the error generated when I try to load list trainer:

C:\Python27\python.exe "E:/Jason Chat Function/Jason Chat.py"
List Trainer: [#########           ] 46%Traceback (most recent call last):
  File "E:/Jason Chat Function/Jason Chat.py", line 101, in <module>
    control()
  File "E:/Jason Chat Function/Jason Chat.py", line 95, in control
    train_from_text()
  File "E:/Jason Chat Function/Jason Chat.py", line 57, in train_from_text
    chatbot.train(conversation)
  File "C:\Python27\lib\site-packages\chatterbot\trainers.py", line 86, in train
    statement = self.get_or_create(text)
  File "C:\Python27\lib\site-packages\chatterbot\trainers.py", line 28, in get_or_create
    statement = self.storage.find(statement_text)
  File "C:\Python27\lib\site-packages\chatterbot\storage\sql_storage.py", line 157, in find
    record = query.first()
  File "C:\Python27\lib\site-packages\sqlalchemy\orm\query.py", line 2755, in first
    ret = list(self[0:1])
  File "C:\Python27\lib\site-packages\sqlalchemy\orm\query.py", line 2547, in __getitem__
    return list(res)
  File "C:\Python27\lib\site-packages\sqlalchemy\orm\query.py", line 2855, in __iter__
    return self._execute_and_instances(context)
  File "C:\Python27\lib\site-packages\sqlalchemy\orm\query.py", line 2878, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "C:\Python27\lib\site-packages\sqlalchemy\sql\elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\base.py", line 1189, in _execute_context
    context)
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\base.py", line 1402, in _handle_dbapi_exception
    exc_info
  File "C:\Python27\lib\site-packages\sqlalchemy\util\compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\base.py", line 1182, in _execute_context
    context)
  File "C:\Python27\lib\site-packages\sqlalchemy\engine\default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT "StatementTable".id AS "StatementTable_id", "StatementTable".text AS "StatementTable_text", "StatementTable".extra_data AS "StatementTable_extra_data" \nFROM "StatementTable" \nWHERE "StatementTable".text = ?\n LIMIT ? OFFSET ?'] [parameters: ('I was wondering if I could ask a major favor of you... (270 characters truncated) ... And could you pick up some potatoes?\r\r\n', 1, 0)]

Process finished with exit code 1

Also, this version of Chatterbot runs much slower than previous ones...

@vkosuri
Copy link
Collaborator

vkosuri commented Nov 28, 2017

@LukeCLindgren could you please let me know chatterbot and chatterbot-corpus versions are you using?

It seems to me, the error message related to MAX length of the statement, this issue resolved latest of chaterbot-corpus, could try to upgrade pip install -U chatterbot-corpus to latest version?

@LukeCLindgren
Copy link
Author

LukeCLindgren commented Nov 28, 2017

@vkosuri when updating chatterbot, the corpus it installs is only 1.0.1, not 1.1.1.

With 1.1.1, I still get the length error

@vkosuri
Copy link
Collaborator

vkosuri commented Nov 29, 2017

@LukeCLindgren Thanks, could yo please let me know

Which corpus are you using? Is it possible to share it here?

@LukeCLindgren
Copy link
Author

Unfortunately, I can't attach a .py file here, so I've copy-pasted all the code.

`"""
A machine readable multilingual dialog corpus.
"""
from .corpus import Corpus

version = '1.1.1'
author = 'Gunther Cox'
email = 'gunthercx@gmail.com'
url = 'https://github.com/gunthercox/chatterbot-corpus'

all = (
'Corpus',
)
`

`import os

DIALOG_MAXIMUM_CHARACTER_LENGTH = 400

class CorpusObject(list):
"""
This is a proxy object that allow additional
attributes to be added to the collections of
data that get returned by the corpus reader.
"""

def __init__(self, *args, **kwargs):
    """
    Imitate a list by allowing a value to be passed in.
    """
    if args:
        super(CorpusObject, self).__init__(args[0])
    else:
        super(CorpusObject, self).__init__()

    self.categories = []

class Corpus(object):

def __init__(self):
    current_directory = os.path.dirname(os.path.abspath(__file__))
    self.data_directory = os.path.join(current_directory, 'data')

def get_file_path(self, dotted_path, extension='json'):
    """
    Reads a dotted file path and returns the file path.
    """

    # If the operating system's file path seperator character is in the string
    if os.sep in dotted_path or '/' in dotted_path:
        # Assume the path is a valid file path
        return dotted_path

    parts = dotted_path.split('.')
    if parts[0] == 'chatterbot':
        parts.pop(0)
        parts[0] = self.data_directory

    corpus_path = os.path.join(*parts)

    if os.path.exists(corpus_path + '.{}'.format(extension)):
        corpus_path += '.{}'.format(extension)

    return corpus_path

def read_corpus(self, file_name):
    """
    Read and return the data from a corpus json file.
    """
    import io
    import yaml

    with io.open(file_name, encoding='utf-8') as data_file:
        data = yaml.load(data_file)
    return data

def list_corpus_files(self, dotted_path):
    """
    Return a list of file paths to each data file in
    the specified corpus.
    """
    CORPUS_EXTENSION = 'yml'

    corpus_path = self.get_file_path(dotted_path, extension=CORPUS_EXTENSION)
    paths = []

    if os.path.isdir(corpus_path):
        for dirname, dirnames, filenames in os.walk(corpus_path):
            for datafile in filenames:
                if datafile.endswith(CORPUS_EXTENSION):
                    paths.append(os.path.join(dirname, datafile))
    else:
        paths.append(corpus_path)

    paths.sort()
    return paths

def load_corpus(self, dotted_path):
    """
    Return the data contained within a specified corpus.
    """
    data_file_paths = self.list_corpus_files(dotted_path)

    corpora = []

    for file_path in data_file_paths:
        corpus = CorpusObject()
        corpus_data = self.read_corpus(file_path)

        conversations = corpus_data.get('conversations', [])
        corpus.categories = corpus_data.get('categories', [])
        corpus.extend(conversations)

        corpora.append(corpus)

    return corpora

`

Can I change the maximum character length myself or do you think that would be catastrophic?

`import logging
import os
import sys
from .conversation import Statement, Response
from .utils import print_progress_bar

class Trainer(object):
"""
Base class for all other trainer classes.
"""

def __init__(self, storage, **kwargs):
    self.storage = storage
    self.logger = logging.getLogger(__name__)

def train(self, *args, **kwargs):
    """
    This class must be overridden by a class the inherits from 'Trainer'.
    """
    raise self.TrainerInitializationException()

def get_or_create(self, statement_text):
    """
    Return a statement if it exists.
    Create and return the statement if it does not exist.
    """
    statement = self.storage.find(statement_text)

    if not statement:
        statement = Statement(statement_text)

    return statement

class TrainerInitializationException(Exception):
    """
    Exception raised when a base class has not overridden
    the required methods on the Trainer base class.
    """

    def __init__(self, value=None):
        default = (
            'A training class must be specified before calling train(). ' +
            'See http://chatterbot.readthedocs.io/en/stable/training.html'
        )
        self.value = value or default

    def __str__(self):
        return repr(self.value)

def _generate_export_data(self):
    result = []
    for statement in self.storage.filter():
        for response in statement.in_response_to:
            result.append([response.text, statement.text])

    return result

def export_for_training(self, file_path='./export.json'):
    """
    Create a file from the database that can be used to
    train other chat bots.
    """
    import json
    export = {'conversations': self._generate_export_data()}
    with open(file_path, 'w+') as jsonfile:
        json.dump(export, jsonfile, ensure_ascii=False)

class ListTrainer(Trainer):
"""
Allows a chat bot to be trained using a list of strings
where the list represents a conversation.
"""

def train(self, conversation):
    """
    Train the chat bot based on the provided list of
    statements that represents a single conversation.
    """
    previous_statement_text = None

    for conversation_count, text in enumerate(conversation):
        print_progress_bar("List Trainer", conversation_count + 1, len(conversation))

        statement = self.get_or_create(text)

        if previous_statement_text:
            statement.add_response(
                Response(previous_statement_text)
            )

        previous_statement_text = statement.text
        self.storage.update(statement)

class ChatterBotCorpusTrainer(Trainer):
"""
Allows the chat bot to be trained using data from the
ChatterBot dialog corpus.
"""

def __init__(self, storage, **kwargs):
    super(ChatterBotCorpusTrainer, self).__init__(storage, **kwargs)
    from .corpus import Corpus

    self.corpus = Corpus()

def train(self, *corpus_paths):

    # Allow a list of corpora to be passed instead of arguments
    if len(corpus_paths) == 1:
        if isinstance(corpus_paths[0], list):
            corpus_paths = corpus_paths[0]

    # Train the chat bot with each statement and response pair
    for corpus_path in corpus_paths:

        corpora = self.corpus.load_corpus(corpus_path)

        corpus_files = self.corpus.list_corpus_files(corpus_path)
        for corpus_count, corpus in enumerate(corpora):
            for conversation_count, conversation in enumerate(corpus):
                print_progress_bar(
                    str(os.path.basename(corpus_files[corpus_count])) + " Training",
                    conversation_count + 1,
                    len(corpus)
                )

                previous_statement_text = None

                for text in conversation:
                    statement = self.get_or_create(text)

                    if previous_statement_text:
                        statement.add_response(
                            Response(previous_statement_text)
                        )

                    previous_statement_text = statement.text
                    self.storage.update(statement)

class TwitterTrainer(Trainer):
"""
Allows the chat bot to be trained using data
gathered from Twitter.

:param random_seed_word: The seed word to be used to get random tweets from the Twitter API.
                         This parameter is optional. By default it is the word 'random'.
"""

def __init__(self, storage, **kwargs):
    super(TwitterTrainer, self).__init__(storage, **kwargs)
    from twitter import Api as TwitterApi

    # The word to be used as the first search term when searching for tweets
    self.random_seed_word = kwargs.get('random_seed_word', 'random')

    self.api = TwitterApi(
        consumer_key=kwargs.get('twitter_consumer_key'),
        consumer_secret=kwargs.get('twitter_consumer_secret'),
        access_token_key=kwargs.get('twitter_access_token_key'),
        access_token_secret=kwargs.get('twitter_access_token_secret')
    )

def random_word(self, base_word):
    """
    Generate a random word using the Twitter API.

    Search twitter for recent tweets containing the term 'random'.
    Then randomly select one word from those tweets and do another
    search with that word. Return a randomly selected word from the
    new set of results.
    """
    import random
    random_tweets = self.api.GetSearch(term=base_word, count=5)
    random_words = self.get_words_from_tweets(random_tweets)
    random_word = random.choice(list(random_words))
    tweets = self.api.GetSearch(term=random_word, count=5)
    words = self.get_words_from_tweets(tweets)
    word = random.choice(list(words))
    return word

def get_words_from_tweets(self, tweets):
    """
    Given a list of tweets, return the set of
    words from the tweets.
    """
    words = set()

    for tweet in tweets:
        tweet_words = tweet.text.split()

        for word in tweet_words:
            # If the word contains only letters with a length from 4 to 9
            if word.isalpha() and len(word) > 3 and len(word) <= 9:
                words.add(word)

    return words

def get_statements(self):
    """
    Returns list of random statements from the API.
    """
    from twitter import TwitterError
    statements = []

    # Generate a random word
    random_word = self.random_word(self.random_seed_word)

    self.logger.info(u'Requesting 50 random tweets containing the word {}'.format(random_word))
    tweets = self.api.GetSearch(term=random_word, count=50)
    for tweet in tweets:
        statement = Statement(tweet.text)

        if tweet.in_reply_to_status_id:
            try:
                status = self.api.GetStatus(tweet.in_reply_to_status_id)
                statement.add_response(Response(status.text))
                statements.append(statement)
            except TwitterError as error:
                self.logger.warning(str(error))

    self.logger.info('Adding {} tweets with responses'.format(len(statements)))

    return statements

def train(self):
    for _ in range(0, 10):
        statements = self.get_statements()
        for statement in statements:
            self.storage.update(statement)

class UbuntuCorpusTrainer(Trainer):
"""
Allow chatbots to be trained with the data from
the Ubuntu Dialog Corpus.
"""

def __init__(self, storage, **kwargs):
    super(UbuntuCorpusTrainer, self).__init__(storage, **kwargs)

    self.data_download_url = kwargs.get(
        'ubuntu_corpus_data_download_url',
        'http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz'
    )

    self.data_directory = kwargs.get(
        'ubuntu_corpus_data_directory',
        './data/'
    )

    self.extracted_data_directory = os.path.join(
        self.data_directory, 'ubuntu_dialogs'
    )

    # Create the data directory if it does not already exist
    if not os.path.exists(self.data_directory):
        os.makedirs(self.data_directory)

def is_downloaded(self, file_path):
    """
    Check if the data file is already downloaded.
    """
    if os.path.exists(file_path):
        self.logger.info('File is already downloaded')
        return True

    return False

def is_extracted(self, file_path):
    """
    Check if the data file is already extracted.
    """

    if os.path.isdir(file_path):
        self.logger.info('File is already extracted')
        return True
    return False

def download(self, url, show_status=True):
    """
    Download a file from the given url.
    Show a progress indicator for the download status.
    Based on: http://stackoverflow.com/a/15645088/1547223
    """
    import requests

    file_name = url.split('/')[-1]
    file_path = os.path.join(self.data_directory, file_name)

    # Do not download the data if it already exists
    if self.is_downloaded(file_path):
        return file_path

    with open(file_path, 'wb') as open_file:
        print('Downloading %s' % url)
        response = requests.get(url, stream=True)
        total_length = response.headers.get('content-length')

        if total_length is None:
            # No content length header
            open_file.write(response.content)
        else:
            download = 0
            total_length = int(total_length)
            for data in response.iter_content(chunk_size=4096):
                download += len(data)
                open_file.write(data)
                if show_status:
                    done = int(50 * download / total_length)
                    sys.stdout.write('\r[%s%s]' % ('=' * done, ' ' * (50 - done)))
                    sys.stdout.flush()

        # Add a new line after the download bar
        sys.stdout.write('\n')

    print('Download location: %s' % file_path)
    return file_path

def extract(self, file_path):
    """
    Extract a tar file at the specified file path.
    """
    import tarfile

    print('Extracting {}'.format(file_path))

    if not os.path.exists(self.extracted_data_directory):
        os.makedirs(self.extracted_data_directory)

    def track_progress(members):
        sys.stdout.write('.')
        for member in members:
            # This will be the current file being extracted
            yield member

    with tarfile.open(file_path) as tar:
        tar.extractall(path=self.extracted_data_directory, members=track_progress(tar))

    self.logger.info('File extracted to {}'.format(self.extracted_data_directory))

    return True

def train(self):
    import glob
    import csv

    # Download and extract the Ubuntu dialog corpus if needed
    corpus_download_path = self.download(self.data_download_url)

    # Extract if the directory doesn not already exists
    if not self.is_extracted(self.extracted_data_directory):
        self.extract(corpus_download_path)

    extracted_corpus_path = os.path.join(
        self.extracted_data_directory,
        '**', '**', '*.tsv'
    )

    file_kwargs = {}

    if sys.version_info[0] > 2:
        # Specify the encoding in Python versions 3 and up
        file_kwargs['encoding'] = 'utf-8'
        # WARNING: This might fail to read a unicode corpus file in Python 2.x

    for file in glob.iglob(extracted_corpus_path):
        self.logger.info('Training from: {}'.format(file))

        with open(file, 'r', **file_kwargs) as tsv:
            reader = csv.reader(tsv, delimiter='\t')

            previous_statement_text = None

            for row in reader:
                if len(row) > 0:
                    text = row[3]
                    statement = self.get_or_create(text)
                    print(text, len(row))

                    statement.add_extra_data('datetime', row[0])
                    statement.add_extra_data('speaker', row[1])

                    if row[2].strip():
                        statement.add_extra_data('addressing_speaker', row[2])

                    if previous_statement_text:
                        statement.add_response(
                            Response(previous_statement_text)
                        )

                    previous_statement_text = statement.text
                    self.storage.update(statement)

`
This last one was the trainer I've been using (what's been returning the error above)

This might be another issue, but this version of chatterbot boots much slower and runs much slower than previous versions as well.

@gunthercox
Copy link
Owner

It looks like the last part of the original error message might be the key here.

sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) You must not use 8-bit bytestrings
unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.
[SQL: u'SELECT "StatementTable".id AS "StatementTable_id", "StatementTable".text AS
"StatementTable_text", "StatementTable".extra_data AS "StatementTable_extra_data" \nFROM
"StatementTable" \nWHERE "StatementTable".text = ?\n LIMIT ? OFFSET ?'] [parameters: ('I was
wondering if I could ask a major favor of you... (270 characters truncated) ... And could you
pick up some potatoes?\r\r\n', 1, 0)]

Following the stack trace backwards you can then see the methods and line numbers in the files where the error was triggered.

  File "C:\Python27\lib\site-packages\chatterbot\trainers.py", line 28, in get_or_create
    statement = self.storage.find(statement_text)
  File "C:\Python27\lib\site-packages\chatterbot\storage\sql_storage.py", line 157, in find
    record = query.first()

So to try to help, I just googled the SQL Alchemy error "sqlalchemy.exc.ProgrammingError: (sqlite3.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str)".

Looks like it might be an issue with Unicode Coercion: https://stackoverflow.com/questions/23876342/sqlalchemy-programmingerror-can-interpret-8-bit-bytestrings


@LukeCLindgren Could you post the original full text of the string "I was wondering if I could ask a major favor of you..." so that I can test with it? I'll try without it in the mean time but it might be helpful.

@gunthercox
Copy link
Owner

A possible solution might be to upgrade to Python 3 to avoid the bytestring issues in Python 2.7.

@lock
Copy link

lock bot commented Mar 10, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants