Utility to migrate code from SmalltalkHub (or any MCZ-based repo) to Git.
The output is in Tonel format.
Pharo 7
Metacello new
baseline: 'GitMigration';
repository: 'github://peteruhnak/git-migration/repository';
load
Pharo 6 is not supported.
Table Of Contents
- Installation
- Possible Issues
- Usage - Quick Example
- Usage - Detailed Example
- Extras
- Visualizations
- see pretty pictures of your MCZ history before you decide to migrate
- For Developers
- digging in the internals
This tool has been used in countless successful migrations, however it is possible that you will run into a very special edge case™. Feel free to open an issue, contact me directly, on Pharo's mailing list or Discord.
-
Corrupted MCZs: Sometimes the MCZ that is on SmalltalkHub is corrupted. Although the MCZ contains a "backup" in form of a fileout, Pharo cannot actually correctly read this most of the time. The recommended solution is to just add the MCZ name to
#ignoredFileNames:
. -
Private emails on GitHub: If you use private email on GitHub, you will need to provide your GitHub-generated alias, otherwise the push will be rejected. This applies only to the email of the person pushing to GitHub, not all committers.
-
performance
- downloading MCZs -- GitMigration is downloading all of your project MCZs from SmalltalkHub. This can take a while depending on the quality of your connection and how SmalltalkHub feels on any particular day
- converting (in Pharo) -- each MCZ is read from disk, parsed, and written back to disk in a different format; this can take a while for large projects
- e.g. PolyMath with 800 commits across 70 packages took ~3 minutes on a stock HDD
- importing (Git) -- this should be under a minute; in most cases it will probably take couple of seconds
-
MCVersion dependencies are not supported (but I don't think they were used outside of Slices)
- if you don't know what this is, you probably don't need to care
-
preserving proper merge history (see also #4)
- after many hours burned on this I've concluded that there is no way to do a fully automated 1:1 migration; note that your data/commits are not lost, only the merge history will not be as rich.
- git installed in the system and available in
PATH
- Pharo 7
This tool generates a file for git-fast-import.
Add your source repository (SmalltalkHub) to Pharo, e.g. via Monticello Browser
The migration will need to know from which commit it should start. This will be typically the SHA of the current commit of the master branch; you don't need the full 40-char SHA, an unambiguous prefix is enough.
The get the current commit, you can use the following:
$ git log --oneline -n 1
See further down for a detailed line-by-line explanation.
"Pharo"
migration := GitMigration on: 'peteruhnak/breaking-mcz'.
"
migration selectedPackageNames: #('Somewhere').
migration ignoredFileNames: #('Somewhere-PeterUhnak.2').
"
migration onEmptyMessage: [ :info | 'empty commit message' ].
migration downloadAllVersions.
migration populateCaches.
migration allAuthors.
migration authors: {'PeterUhnak' -> #('Peter Uhnak' '<i.uhnak@gmail.com>')}.
migration
fastImportCodeToDirectory: 'repository'
initialCommit: '5793e82'
to: 'D:/tmp/breaking-mcz2/import.txt'
# Terminal
cd D:/tmp/breaking-mcz2
git fast-import < import.txt
git reset --hard master
git gc
A longer description of the example above.
"Specify the name of the source repository; I am sourcing from peteruhnak/breaking-mcz project on SmalltalkHub"
migration := GitMigration on: 'peteruhnak/breaking-mcz'.
"optional -- migrate only some packages; if you don't specify anything, then all packages will be migrated"
migration selectedPackageNames: #('Somewhere').
"optional -- in case you have corrupted MCZs, you can ignore them and rerun the migration"
migration ignoredFileNames: #('Somewhere-PeterUhnak.2').
"if a MCZ was missing a commit message, you can provide an alternative; info is an instance of the problematic MCVersionInfo"
migration onEmptyMessage: [ :info | 'empty commit message' ].
"Download all mcz files, this will take a while"
migration downloadAllVersions.
"Preload version metadata into the image."
migration populateCaches.
"List all authors anywhere in the project's commits"
migration allAuthors. "#('PeterUhnak')"
"You must specify name and email for _every_ author"
"You must also specify the name/email for yourself (Author fullName), even if you haven't authored any code -- git treats separately the author of a commit and the commiter of a commit"
"AuthorName (as shown in #allAuthors) -> #('Nicer Name' '<email@example.com>')"
migration authors: {
'PeterUhnak' -> #('Peter Uhnak' '<i.uhnak@gmail.com>')
}.
"Run the migration, this might take a while
* the code directory is where the code will be stored (common practice is to have the code in `repository` subfolder, just like this project)
* initialCommit is the commit from which the migration should start
* to is where the git-fast-import file should be stored"
migration
fastImportCodeToDirectory: 'repository'
initialCommit: '5793e82'
to: 'D:/tmp/breaking-mcz2/import.txt'
Get a terminal, go to the target git repository, and run the migration.
# import.txt is the file that you've created earlier
$ git fast-import < import.txt
# fast-import doesn't change the working directory, so we need to update it
$ git reset --hard master
# (optional) garbage collection: fast import leaves a lot of mess behind
# happens automatically on commit since Git >=2.17
$ git gc
You should see the changes, and git log
should show you the entire history.
Forgetting all changes in the history and going back to previous state. Useful if the migration is botched and you want to rollback all changes.
$ git reset --hard SHA
If you want to play around with the version data before committing, read the following.
migration := GitMigration on: 'peteruhnak/breaking-mcz'.
Downloading all MCZs from server; this needs to happen only once and can take couple of minutes for large repos.
migration cacheAllVersions.
List all packages in the repository that have multiple roots; although rare, this could be either result of multiple people starting independently on the same package, or a mistake was made during committing. GitMigration should be able to handle this correctly regardless.
migration packagesWithMultipleRoots.
List all authors in the repository.
migration allAuthors.
Dictionary of all packages and their real (see later what's real) commits.
versionsByPackage := migration versionsByPackage.
All versions of a package, whether there is actually an MCZ or not. With Monticello it is very easy to create a commit whose ancestor is not in the repository, so it is not obvious how the commit connects the previous ones. Thankfully MCZ typically contains the hierarchy many steps back, so we can correctly reconstruct the whole tree.
allVersions := migration completeAncestryOfPackageNamed: 'Somewhere'.
The versions in mcz are random, so we need to sort them in an order in which we can commit them to git. This means that all ancestry is honored (no child is commited before its parent), and "sibling" commits are sorted by date. Note that we cannot just sort the commits by date, because the date might not follow the ancestry correctly (which can happen, especially if different timezones are involved, which MC doesn't keep track of)
sorted := migration topologicallySort: allVersions.
Get the total ordering of all commits across all packages
allVersionsOrdered := migration commitOrder.
This requires Roassal to be installed (available in catalog).
In all visualizations hovering over an item will show a popup with more information, and clicking on item will open an inspector. Keep in mind that running the command will not open a new window, so you have to either inspect it, or do-it-and-go in playground.
Looking at raw data is not very insightful, so couple visualization are included.
migration := GitMigration on: 'peteruhnak/breaking-mcz'.
migration cacheAllVersions.
visualization := migration visualization.
Show the complete ancestry of a single package.
visualization showAncestryTopologyOnPackageNamed: 'Somewhere'.
- Yellow - root versions (versions with no parents, typically only a single initial commit)
- Cyan - tail/head versions (versions with no children, typically the latest version(s))
- Magenta - "virtual" versions that do not have a corresponding commit (this happens as mentioned earlier)
The number on the third line indicates in what order the packages will be committed (magenta packages are listed, but are not committed, because there is no code to commit). Keep in mind that the number in the commit (Somewhere-PeterUhnak.15) has no meaning, and can be easily changed (and broken by hand) when committing.
To see all packages and history, you could do.
visualization showProjectAncestry.
This is useful if you want to quickly glance at a project (and is also much faster to generate and use), but if want you can also add label
visualization showProjectAncestryWithLabels.
"or"
visualization showProjectAncestryWithLabels: true.
If you have big project and want to look only at certain packages, you can do so. (In the image you can see that the longest chain has ancestry broken - red box at the end)
migration := GitMigration on: 'PolyMath/PolyMath'.
migration cacheAllVersions.
visualization := migration visualization.
"or just a collection of package names"
visualization showProjectAncestryOn: (allPackages copyWithoutAll: #('Monticello' 'ConfigurationOfSciSmalltalk' 'Math-RealInterval')).
Adding labels works the same way
visualization showProjectAncestryOn: aCollectionOfPackages withLabels: aBoolean
Some hints and random thoughts. SmalltalkHub stores every commit in a separate MCZ file, which contains some metadata about the commit (name, ancestry, etc), as well as all the code. The code itself is not incremental, rather code in each zip is as-is.
This means that when GitFileTree is exporting, it will remove all files on the disk, unpack the MCZ file, and write all the code back to disk, and commit. Git is smart enough to only commit what has actually changed, however for GFT this operation is very IO intense - if you have 5k files in your code base and you changed just a single method (which is common), then 5k files will be removed and then added back... you can imagine what this does to the disk when performed 1000x times (once for each commit).
With fast-import I've made a workaround for this. A pseudo-repository GitMigrationMemoryTreeGitRepository
is created that uses memory file system as the target directory. This way the fileout doesn't write to real disk and everything is kept in RAM, which improves the performance significantly.
Note however that instead of using MemoryStore
I had to subclass it (GitMigrationMemoryStore
) to properly handle path separators; on Windows, MemoryStore by itself will create files and directories with slashes (both forward and backward) in their names instead of creating a hierarchy, so my GitMigrationMemoryStore
fixes this.
I am also subclassing MemoryHandle
(GitMigrationMemoryHandle
) and I've changed the writeStream
of it to return MultiByteBinaryOrTextStream
. This is because MemoryStore
returns only an ordinary WriteStream
which cannot handle unicode content and 那不是很好。 :)