Skip to content

Python (+ rsync or rclone) based intelligent file sync with automatic backups and file move/delete tracking.

License

Notifications You must be signed in to change notification settings

Jwink3101/PyFiSync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyFiSync

Python (+ rsync or rclone) based intelligent file sync with automatic backups and file move/delete tracking.

Features

  • Robust tracking of file moves
    • Especially powerful on MacOS, but works well enough on linux.
  • rsync Mode:
    • Works out of the box with Python (tested on 2.7 and 3.5+) for rsync
    • Works over SSH for secure and easy connections with rsync mode
    • Uses rsync for actual file transfers to save bandwidth and make use of existing file data
  • rclone mode: (beta!)
    • Can connect to a wide variety of cloud-services and offers encryption
    • Note that rclone is still supported and works but it is better to use syncrclone instead.
      • rclone support may be deprecated in the future!
  • Extensively tested for a huge variety of edge-cases

Details

PyFiSync uses a small database of files from the last sync to track moves and deletions (based on changeable attributes such as inode numbers, sha1 hashes, and/or create time). It then compares mtime from both sides on all files to decide on transfers.

Backups

By default, any time a file is to be overwritten or modified, it is backed up on the machine first. No distinction is made in the backup for overwrite vs delete.

Attributes

Moves and deletions are tracked via attributes described below.

Move attributed are used to track if a file has moved while the prev_attributes are used to determine if a file is the same as before

Note: On HFS+ (and maybe APFS?), macOS's file system, inodes are not reused quickly. On ext3 (Linux) they are recycled rapidly leading to issues when files are deleted and new ones are made. Do not use inodes alone on these systems

Common attributes

  • path -- This essentially means that moves are not tracked. If a file has the same name, it is considered the same file
  • size -- File size. Do not use alone. Also, this attribute means that the file may not change between moves. See examples below
  • mtime -- When the file was modified. Use with ino to track files

rsync and local attributes

Attributes for the local machine and an rsync remote

  • ino (inode number)-- Track the filesystem inode number. May be safely used alone on HFS+ but not on ext3 since it reuses inodes. In that case, use with another attribute
  • hashes -- Very robust to track file moves but like size, requires the file not change. Also, slow to calculate (though, by default, they are not recalculated on every sync). Options:
    • adler -- Fast but less secure
    • dbhash -- Used for dropbox. Useful if comparing on hash
    • any hashlib.algorithms_guaranteed: sha384,sha3_224,sha3_512,md5,sha512,sha3_256,blake2b,sha3_384,shake_128,blake2s,sha256,shake_256,sha1,sha224
  • birthtime -- Use the file create time. This does not exist on some linux machines, some python implementations (PyPy), and/or is unreliable

rclone attributes

  • hash.HASH -- Use a hash from rclone. Depends on which hashes are available.

Suggested move Attribute Combinations

For rsync

  • On macOS, the following is suggested: [ino,birthtime]
  • On linux, the following is suggested: [inode,mtime]
    • This means that moved files should not be modified on that side of the sync.

Hashes

As noted, any hashlib.algorithms_guaranteed is supported for rsync mode and the local machine. In order to save time, a database is used of the previous file. This can be turned off in the config forcing all of the files to be read and hashed again.

Empty Directories

PyFiSync syncs files and therefore will not sync empty directories from one machine to the other. However, if, and only if, a directory is made empty by the sync, it will be deleted. That includes nested directories. In rclone mode, empty directories are not handled at all by PyFiSync

Install

This are no dependancies! (for rsync). Everything is included in the package (though ldtable is also separately developed here) (now DictTable)

To install:

$ python -m pip install git+https://github.com/Jwink3101/PyFiSync

Or download the zip file and run

$ python setup.py install

If using the rclone remote (see setup below), install it on the remote machine too.

Note: On the remote machine, the path to PyFiSync must be found via SSH. For example, if your python is from (Ana/Mini)conda, then it places the paths into the .bash_profile. Move the paths to .bashrc so that PyFiSync can be found.

Alternatively, specify remote_exe.

Setup

See rsync for setup of the default mode. PyFiSync must be installed on both machines (or the Python scripts must be there and configured)

Setting up rclone is a bit more involved since you must set up an appropriate rclone remote. See rclone readme for general details and rclone_b2 for a detailed walk through of setting up with B2 (and S3 with small noted changes).

To initiate an rclone-based repo, do

$ PyFiSync init --remote rclone

Settings

There are many settings, all documented in the config file written after an init. Here are a few:

Exclusions

Exclusion naming is done is such a way that it replicated a subset of rsync exclusions. That is, the following pattern is what this code follows. rsync has its own exclusion engine which is more advanced but should be have similarly.

  • If an item ends in / it is a folder exclusion
  • If an item starts with / it is a full path relative to the root
  • Wildcards and other patterns are accepted
Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq

Examples:

  • Exclude all git directories: .git/
  • Exclude a specific folder: /path/to/folder/ (where / is the start of the sync directory
  • Exclude all files that start with file: file*
  • Exclude all files that start with file in a specific directory: /path/to/file*

Exclude if Present

PyFiSync allows for exclusion of a directory due to the presence of a specified file name (the contents of the file do not matter, only the presence of it).

Unlike regular exclusions which halt traversing deeper into an excluded directory tree, exclude_if_present is a filter applied after the fact. This approach is safer as adding an exclusion file on one side will not cause a delete to be incorrectly propagated. It does come at a small performance penalty as the excluded directory is is initially traversed

Symlinks

First note that all directory links are followed regardless of setting. Use exclusions to avoid syncing a linked directory.

If copy_symlinks_as_links=False symlinked files sync their referent (and rsync uses -L) If True (default), symlinks copy the link itself (a la how git works)

WARNINGS:

  • If copy_symlinks_as_links = False and there are symlinked files to another IN sync root, there will be issues with the file tracking. Do not do this!
  • As also noted in Python's documentation, there is no safeguard against recursively symlinked directories.
  • rsync may throw warnings for broken links
  • rclone's support of symlinks is unreliable at the moment.

Pre and Post Bash

There is the option to also add some bash scripts pre and post sync. These may be useful if you wish to do a git push, pull, etc either remote or local.

They are ALWAYS executed from the sync root (a cd /path/to/syncroot is inserted above).

Running Tests

To run the test, in bash, do:

$ source run_test.sh

In addition to testing a whole slew of edge cases, it also will test all actions on a local sync, and remote to both python2 and python3 (via ssh localhost). The run script will try to call py.test for both versions of python locally.

Known Issues and Limitations

The test suite is extremely extensive as to cover tons of different and difficult scenarios. See the tests for further exploration of how the code handles these cases. Please note that unless specified explicitly in the config or the command-line flag, all deletions and (future) overwrites first perform a backup. Moves are not backed up but make likely be unwound from the logs.

A few notable limitations are as follows:

  • Symlinks are followed (optionally) but if the file they are linking to is also in the sync folder, it may confuse the move tracking
  • File move tracking
    • A file moved with a new name that is excluded will propagate as deleted. This is expected since the code no longer has a way to "see" the file on the one side.
    • A file that is moved on one side and deleted on the other will NOT have the deletion propagated regardless of modification
  • Sync is based on modification time metadata. This is fairly robust but could still have issues. In rsync mode, even if PyFiSync decides to sync the files, it may just update the metadata. In that case, you may just want to disable backups. With rclone, it depends on the remote and care should be taken.

There is also a potential issue with the test suite. In order to ensure that the files are noted as changed (since they are all modified so quickly), the times are often adjusted via some random amounts. There is a small chance some tests could fail due to a small number not changing. Running the tests again should pass.

See rclone readme for some rclone-related known issues

Other Questions

See the (growing) FAQ for some more details and/or troubleshooting

About

Python (+ rsync or rclone) based intelligent file sync with automatic backups and file move/delete tracking.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published