Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ros2 bag reindex fails on SQLite3 E:26 #1029

Closed
Benblob688 opened this issue Jun 22, 2022 · 7 comments
Closed

ros2 bag reindex fails on SQLite3 E:26 #1029

Benblob688 opened this issue Jun 22, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@Benblob688
Copy link

Benblob688 commented Jun 22, 2022

Description

Firstly, I experienced the issue described by #395 where the ros2 bag record -a done on foxy had not saved a metadata.yaml file because the machine's power was cut, no Crtl-C was done on the ros2 bag record

Following #395, I figured out that ros2 bag reindex is only available on galactic not foxy, so I followed these instructions to install galactic https://docs.ros.org/en/galactic/Installation/Ubuntu-Install-Debians.html

Then I encountered the errors described by #240 (error code 26 not 5 though) and did the following to try to rectify it:
sudo apt-get install ros-galactic-rosbag2-storage-default-plugins ros-galactic-ros2bag

This hasn't fixed the issue for me.

Expected Behavior

ros2 bag reindex <dir> creates the metadata.yaml file that was not created due to a power loss during ros2 bag record -a

Actual Behavior

the following error:

$ ros2 bag reindex rosbag2_2022_06_20-12_12_43/ sqlite3
[INFO] [1655898267.789075775] [rosbag2_cpp]: Beginning reindexing bag in directory: rosbag2_2022_06_20-12_12_43/
[ERROR] [1655898267.792975611] [rosbag2_storage]: Could not open 'rosbag2_2022_06_20-12_12_43/rosbag2_2022_06_20-12_12_43_0.db3' with 'sqlite3'. Error: Failed to setup storage. Error: Error when processing SQL statement. SQLite error (26): file is not a database
[ERROR] [1655898267.793062573] [rosbag2_storage]: Could not load/open plugin with storage id 'sqlite3'.
No storage could be initialized. Abort

System (please complete the following information)

  • OS: Ubuntu 20.04.4
  • ROS 2 Distro: foxy to record, galactic to reindex
  • Version: not sure how to find this info
@Benblob688 Benblob688 added the bug Something isn't working label Jun 22, 2022
@MichaelOrlov
Copy link
Contributor

@Benblob688 If you try to reindex normally recorded bag i.e. with normal finishing by removing metadata file, do you see the same error from reindexer?

@mad0x60
Copy link

mad0x60 commented Sep 14, 2022

@Benblob688 If you try to reindex normally recorded bag i.e. with normal finishing by removing metadata file, do you see the same error from reindexer?

I have the same problem after a sudden shutdown to my computer during bag saving.

I tried reindexing a normally recorded bag that doesn't have a metadata file and it works without issues.

@PeterQFR
Copy link

This is a problem with SQLite3 probably because the write to bag terminates halfway through a transaction (disclaimer I don't really know much about sql databases)

However this issue can occur in any SQLite3 database and there are solutions to it.

I've linked to one that was able to reformat and recover the sqlite database

I also add the contents of the link in case it drops out

Problem

Sometimes a Storage Node Operator may encounter the "database disk image is malformed" error in their log. This could happen during unplanned shutdown or reboot. The error indicates that one or more of the sqlite3 databases may have become corrupted.
Solution

    Firstly, we should try to verify the database with an embedded SQLite3 command. So, we need to have sqlite3 installed (v3.25.2 or later). The installation steps depend on the OS.
    We will use Docker instead of direct installation (this option is available only for x86_64 CPUs, for arm-based boards you will need to install sqlite3 via the package manager of your OS).
    [Stop the storagenode](https://documentation.storj.io/resources/faq/system-maintenance)
    Make a backup of all the sqlite3 databases. They are located in the storage folder for your data storage. For example x:\storagenode\storage\bandwidth.db, where x:\storagenode is the data folder you had specified in the --mount type=bind,source=x:\storagenode,destination=/app/config option of the docker run command for your storagenode, or x:\storagenode\storage in case of using the Windows GUI, in the storage.path: option of the config.yaml file.
    Check each database for errors. We will use bandwidth.db as an example in this guide.
        Docker (replace ${PWD} with an absolute path to the databases location, or simple switch the current location to there)

        docker run --rm -it --mount type=bind,source=${PWD},destination=/data sstc/sqlite3 find . -iname "*.db" -maxdepth 1 -print0 -exec sqlite3 '{}' 'PRAGMA integrity_check;' ';'

        Direct installation

            Linux

            sudo apt update && sudo apt install sqlite3 -y

            Windows
            https://www.sqlitetutorial.net/download-install-sqlite/
            Make sure that the version is v3.25.2 or later, otherwise the check will not work correctly. 

            sqlite3 --version

            perform the integrity check (perform for each database), for example for bandwidth.db:

            sqlite3 /path/to/storage/bandwidth.db "PRAGMA integrity_check;"

            Or check all databases with help of shell commands:
                bash (Linux)

                find /path/to/storage/ -iname "*.db" -maxdepth 1 -print0 -exec sqlite3 '{}' 'PRAGMA integrity_check;' ';'

                PowerShell (Windows)

                Get-ChildItem X:\storagenode\storage\*.db -File | %{$_.Name + " " + $(sqlite3.exe $_.FullName "PRAGMA integrity_check;")}

    If you see errors in the output, then the check did not pass. We will unload all uncorrupted data and then load it back. But this could sometimes fail, too. If no errors occur here, you can skip all the following steps and start the storagenode again.

    If you were not lucky and the check failed, then please try to fix the corrupted database(s) as shown below.

    Open a shell

        Docker version of sqlite3. Open a shell Inside the container:

        docker run --rm -it --mount type=bind,source=x:\storagenode\storage,destination=/storage sstc/sqlite3 sh

        Tip. You can use tmpfs to restore your databases. It uses memory instead of disk and should take a lot less time than on HDD (you can read more about usage of tmpfs with Docker in the [Use tmpfs mounts](https://docs.docker.com/storage/tmpfs/) guide or [this forum comment](https://forum.storj.io/t/used-serial-db-malformed/7512/4?u=alexey)). For Windows or MacOS you must increase the allocated RAM for the docker's VM via Docker desktop application to fit a double size of the greatest corrupted database file in case of usage of tmpfs.
         
        You could use your shell directly if you have sqlite3 installed. In that case, use the path to your storage instead of "/storage/" across this guide below.
        For Windows with local sqlite3 installed, we recommend to use a PowerShell to execute the commands below. Don't forget to replace the "/storage/" folder with your local path to the folder where the databases are stored. If the sqlite3.exe executable is not in the system variable PATH, then you should specify the full path to it or run from the location of the executable.
    Now run the following commands in the shell. You need to repeat steps 9 to 14 for each corrupted sqlite3 database:

    cp /storage/bandwidth.db /storage/bandwidth.db.bak
    sqlite3 /storage/bandwidth.db

    You will see a prompt from sqlite3. Run this SQL script:

    .mode insert
    .output /storage/dump_all.sql
    .dump
    .exit

    We will edit the SQL file dump_all.sql

        Linux or docker version:

        cat /storage/dump_all.sql | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >/storage/dump_all_notrans.sql

        PowerShell (Windows) with a local sqlite3 version:

        Get-Content dump_all.sql | Select-String -NotMatch TRANSACTION | Select-String -NotMatch ROLLBACK | Select-String -NotMatch COMMIT | Set-Content -Encoding utf8 dump_all_notrans.sql

    Remove the corrupted database (make sure that you have a backup!)

    rm /storage/bandwidth.db

    Now we will load the unloaded data into the new database

    sqlite3 /storage/bandwidth.db ".read /storage/dump_all_notrans.sql"

    Check that the new database (bandwidth.db in our example) has a size larger than 0:

        Linux or docker version

        ls -l /storage/bandwidth.db

        PowerShell (Windows) with a local sqlite3 version:

        ls /storage/bandwidth.db

    Exit from the container (skip this step, if you use a directly installed sqlite3)

    exit

    If you are lucky and all corrupted sqlite3 databases are fixed, then you can [start the storagenode](https://documentation.storj.io/resources/faq/system-maintenance) again.

Warning. If you were not successful with the fix of the database, then your stat is lost.
Prevention

On Windows: disable the write cache. Consider migrating to the [Windows GUI](https://documentation.storj.io/setup/gui-windows) instead of using Docker.

On Unraid: update to the latest version of the platform (the bug is fixed in the 6.8.0-rc5 as seen in [this comment](https://forums.unraid.net/bug-reports/prereleases/sqlite-data-corruption-testing-r664/page/4/?tab=comments#comment-6650)) or rollback to version [6.6.7](https://forums.unraid.net/topic/80439-downgraded-back-to-667-due-to-sqlite-corruption/).

On Docker: use the updated docker run command from the documentation: https://documentation.storj.io/setup/cli/storage-node#running-the-storage-node
Common Problems

Make sure that you are not using NFS or SMB to connect to the storage, they are not compatible with SQLite. The only working network protocol is iSCSI.

Make sure that your external USB drive has enough power and it does not turn off during operations. It's better to avoid using them and use only internal drives.

I think running these commands should be possible though the python sqlite3 interface

@PeterQFR
Copy link

PeterQFR commented Nov 10, 2022

TLDR;

Assuming your bag file is called

Install the application sqlite3 through

apt install sqlite3

Check for issues in db3 file in bag.

sqlite3 /path/to/bag/bag.db3 "PRAGMA integrity_check;"

Now I have seen the above command return errors or ok with no change to the reindexing success for rosbag2.
Make a backup of the database:

 cp bag/bag.db3 bag/bag.db.bak
 sqlite3 bag/bag.db3

At the prompt dump the raw contents of the db.

.mode insert
.output bag/dump_all.sql
.dump
.exit

Use grep to extract just the relevant transactions (remember I don't know sql)

cat bag/dump_all.sql | grep -v TRANSACTION | grep -v ROLLBACK | grep -v COMMIT >bag/dump_all_notrans.sql

now load back into a new database:

 sqlite3 bag/bag.db3 ".read bag/dump_all_notrans.sql"

check that the size of the database is no longer zero

ls -al bag/bag.db3

Now you should be able to reindex successfully:

ros2 bag reindex bag/

@MichaelOrlov
Copy link
Contributor

@PeterQFR Thank you so much for the recipe!
Wouldn't you mind to post your answer on the https://answers.ros.org like a short howto as you posted it here?

As regards to the original issue I see from conversation that reindex command failed for corrupted database which had been abnormally interrupted during writing and it was used default storage preset profile which is optimized for write speed and not fault tolerant.

From our rosbag2 readme file https://github.com/ros2/rosbag2#recording-with-a-storage-configuration

By default, SQLite settings are significantly optimized for performance. This might have consequences of bag data being corrupted after an application or system-level crash. This consideration only applies to current bagfile in case bag splitting is on (through --max-bag-* parameters). If increased crash-caused corruption resistance is necessary, use resilient option for --storage-preset-profile setting.
  • Therefore the described behavior is expected and worked as designed.
    To avoid aforementioned database corruption during abnormal recording interruption (note CTRL+C or SIGINT is a normal recording interruption) need to use command line option --storage-preset-profile resilent which also has impact on the writing performances.
    Or as alternative use recipe for recovering corrupted database from previous post.

  • Closing this issue as this is not a bug and rosbag2 works as designed.

@Ecuashungo
Copy link

The solution by @PeterQFR unfortunately did not work for me, but I have come across this stack overflow answer, with which I managed to repair the corrupted database.

  1. I needed to run sqlite3 broken.db3 ".recover" | sqlite3 new.db3 to repair the rosbag .db3 file
  2. then I used ros2 bag reindex bag/ as explained by @PeterQFR in his last comment (make sure that the .db3 file has the correct naming)

Hope this helps future visitors

@jimenezjose
Copy link

In the case you need to reindex for different bag formats:

db3: ros2 bag reindex <rosbag> sqlite

mcap: ros2 bag reindex <rosbag> mcap

(Other supported storage formats can be found using ros2 bag reindex --help)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants