Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized BlackWhiteRaster/Added Needed RasterablePart Function #102

Closed
wants to merge 13 commits into from
Closed

Optimized BlackWhiteRaster/Added Needed RasterablePart Function #102

wants to merge 13 commits into from

Conversation

tatarize
Copy link
Contributor

Graphics should always been iterated looped y and then x the reason being the memory is laid out sequentially so it is always preferred to read across the scanlines as that is what all modern processors are built around. And most of the graphic formats specifically use scanlines. The elements of the array should try to avoid any unneeded overhead, such arrays of arrays. They should be packed effectively to allow shorthand operations like checking 32 bits at once whether or not they are blank.

I also added firstBlackPixelOnThisOrNextLine and lastBlackPixelOnThisOrNextLine. The RasterablePart needed a helper for this. It was a sticking point of my driver's raster optimization. I'm was only allowed one transition from right-to-left each scan line (required by the hardware). So I cannot do it before the last black pixel on either this current line or the next line. But, I also couldn't know whether I was going left or right (it's private) and thus couldn't perform the correct operation of either min or max as the correct extreme depends on the direction.

Also expanded the test for BlackWhiteRaster to prove the added shorthand functions there.

No attempt was made to actually use those functions in RasterablePart that has many of the same functions but they will be slower due to the abstraction.

@t-oster
Copy link
Owner

t-oster commented Jun 18, 2019

Looks reasonable to me. However since this is a rather big change, did you run some performance comparisons in order to make sure this does not slow down existing drivers significantly?

@tatarize
Copy link
Contributor Author

You can also just add in the changes for the RasterPart those are just a couple functions that rely on the current functions and state of the boolean. Those can be clearly identified as safe pretty quickly.


Gist of text. Source and timing of all dithering algorithms applied to a bunch of real images.
https://gist.github.com/tatarize/f1ec9736dd6aff9140344a6a3fbc2e6e#file-gistfile1-txt

tl:dr
Original took 144.524 seconds.
Optimized took 56.537 seconds.

Dithering will take only 39% as much time on average using the new version of the code as it will using the new version. I think that that's the most time intensive process the raster is asked to do. The timing includes the time to perform the dithering (so really the class itself is going to be even faster).

If raster an image with the dither time being noticable you should feel the difference.

Going 2.5x faster strongly implies it shouldn't slow down the existing drivers. It should show pretty strong increases in speed doing the core operations of rasterpart too. Namely you're getting scanline data one byte from each array while iterating the scanline. And that's going to turn the memory into a crapshow.

There's also some of the faster shorthand functions in the new raster that will outperform RasterPart's code because it cheats and knows how the data is structured.

--

Though all the actual raster code seem to use getRasterLine() which will have the memory cache-miss problem. But, admittedly I don't have ample coverage for either testing or profiling the change there. And I should certainly do both before it's accepted.

@tatarize
Copy link
Contributor Author

Really though, I just wrote a driver that did rastering and I wish the RasterizableJobPart was usable but it was not. In fact it's only used by LaserCutter.java to convert RasterableParts into VectorParts. And I couldn't even just call that function either, I'd have to make a LaserCutter instance and overload getRasterPadding().

I would rejigger those classes. I'd cut RasterizableJobPart and just make it a utility class. It's only actually used by convertRasterizableToVectorPart in LaserCutter and that function is literally one function of "rasterPadding" away from just being a static function. Everything it does is access the image functions. And it doesn't even do that in a way that helped me. It gave me a more generic GreyRaster rather than a BlackWhiteRaster. Which makes my BlackWhiteRaster helpful speed functions unable to be accessed.

Merely fixing the indexing on the BlackWhiteRaster should actually speed things up. But, if you really wanted it to go fast you'd be better off doing a number of reasonable optimizations.

Everybody is calling RasterizablePart.getRasterLine() but regardless of the whole "this save memory allocation" thing it categorically does not. The 800 pound gorilla in that function ByteArrayList actually moving the bytes at all. Really you should return a list that is merely a window into the data that remains where it is. So that getRasterLine() really just gets you a reasonable iterator that you can use the way you need to. That directly calls the functions within the BlackWhiteRaster class rather than copying it off to a sortof strange ArrayList wannabe class. You shouldn't make a copy of the data, that's a huge operation. You should return an List object that simply delegates the get() method to the getByte() method within the raster.

But, after a while you're optimizing with diminishing results. But if speed of the rasterpart builders is a thing that concerns you it wouldn't be too hard to fix them. Or expand BlackWhiteRaster to encode greys (after all the difference is implementing bit_depth). It changes the math but it's all still pretty easy. ( I've done it before: https://github.com/K40Nano/K40Nano/blob/master/k40nano/PngRaster.py). Really the import thing would be a utility class that actually helps. It would be nice to fix some of these design issues. That class just kinda hurt me. But, more pressingly the other one made it so my code will run forever to the left even though it knows the laser doesn't need to be there. Because there wasn't anything in the helper bits that helped.

Though if you wanna see what I mean by implementing the list itself there, I made an example where I implemented getLine() in BlackWhiteRaster (untested).

https://gist.github.com/tatarize/e91c077f8759d9a9b5165b782ec6a6fd

You get the List of bytes, but you have access to BlackWhiteRaster.PixelWindow class and you can tell it to give you the next line or reversed next line, or whatever and it's an instant constant time operation as the only thing it does is delegate.

@tatarize
Copy link
Contributor Author

Though really where your problems actually start is BufferedImageAdapter. If you know what type the BufferedImage make as many calls to getRGB() as you have pixels, just call the bulk function and get all the pixels at once. That will you are tanking your speed there. What you actually want to do is get rid of most of that stuff and just use BufferedImages.

Rather than getGreyscale() for that stuff, you can just write make a greyscaled bufferedimage and use that. Also, you use like one method to convert to greyscale there's actually about 6 solid and different algorithms you could use.

When you dither, make a dithered black and white bufferedimage. And when you iterate the pixels, write a helper class that gets the underlying bufferedimage memory and just build an interface to the data thing that helps you iterates through the pixels in a highly relevant way.

Like how the very lovely VectorCommand api works. That is beautiful. It would be nice if you'd crush up the rasterization into very tiny pieces like that rather than giving all drivers a chance to reinvent the wheel.

Rasterizations can need a couple different properties, DPI, buffer_padding, clip_white_pixels from the end, bidirectional/unidirectional, can_skip_blank_lines. Then you feed my driver some commands like:

  • CUT_RIGHT 6
  • MOVE_RIGHT 8
  • NEXT SCANLINE
  • MOVE_LEFT 60
  • CUT_RIGHT 6
  • MOVE_RIGHT 8.

I'm expected to get lines of bytes and do this stuff myself. As a result my driver is missing some cool features other ones have like the ability to clip off those extra white bytes.

In reality with data moved around and refigured this is just a fancy reading operation of the bufferedimage. And can be done without allocating any memory. You only need to allocate memory for the greyscale conversion and the dithering conversion (and really you could chain those together). You shouldn't ever need to copy it again. But, you seem to do no fewer than two extra copying of all the data just to format it in a way that the driver can't actually use very easily. The VectorCommand methodology there is beautiful. The api for the raster should be very similar. And it could be done in what amount to O(1).

@mgmax
Copy link
Collaborator

mgmax commented Jun 18, 2019

While your code looks good, it changes the output of the Epilog driver, which should not be the case. Saving to file shows different output in the engrave data. Probably you mixed up some index, so that the pixels are not stored correctly.

Regarding the discussion on the overall interface structure for raster engraving:

Looking at the drivers we already have, they seem to fall into two categories with very few exceptions:

  • Many machines don't have special engrave commands, but emulate that via "move to coordinate" and "cut to coordinate". This can be handled via the raster to vector conversion, so the driver doesn't need to understand raster at all. The current code certainly isn't efficient, but could be improved by converting it to an iterator instead of storing everything. I have no practical experience about how large the memory and CPU time usage is, but most of these machines are very slow with raster engraving anyway, no matter how fast we make VisiCut.

  • The more advanced boards (Epilog, LTT, ...) typically support raw raster data ("lines of bytes"), so it is important to have that as an interface as well. Sometimes, the white beginning or end of the line is cut off, but not strictly always, so the driver needs to do that on its own. Currently, for any computer except the Raspberry Pi, memory isn't really the problem - even old netbooks have 4GB+ of RAM, and computation time is negligible compared to the time the lasercutter will take for engraving. But of course any improvement that doesn't hurt code readability is welcome.

Your K40 seems to be the rare exception with a very large set of low-level instructions that doesn't really fit any of the above, so I wouldn't recommend optimizing the interface too much for your use-case.

@tatarize
Copy link
Contributor Author

And the timings with getRasterLine() are done.
https://gist.github.com/tatarize/178f1d543716613a3acb135225c2b7af

Original:
Time Dither: 30.298195669
Total Access: 0.053820466
Total Time: 30.352016135 ms

Optimized:
Time Dither: 11.790970770
Total Access: 0.051860206
Total Time: 11.842830976

The access time for the getRasterLine() is pretty unaffected.

@tatarize
Copy link
Contributor Author

My testing coverage for getRasterLine() seems pretty low, odds are good I messed up something there. In fact it's almost certain that I messed up the endedness. Yup. Okay, I'll fix that.

@tatarize
Copy link
Contributor Author

getByte() should read

  public byte getByte(int x, int y)
  {
    int index = y * scanline + (x / 4);
    int index_in_byte = (3 - (x % 4));
    int ix = (8 * index_in_byte);
    int mask = 0xFF << ix;
    return (byte) ((raster[index] >> ix) & 0xFF);
  }

If you wanna test it before I finish the push.

@tatarize
Copy link
Contributor Author

The board itself has more in common with some of the other ones that use USB. Like if you adapted to controlling Ruida devices via usb. Some of the core element would be a bit more similar. It's more that the library doesn't do such things currently. It seems alien by exclusion. Ruida does a lot of low level stuff too. It's not really as cryptic, but a USB Ruida driver would include a number of similar things.

t-oster/VisiCut#404

In fact, if you wanted to control that device properly with LibLaserCut my driver would make for a nicer starting off point. You supported the easier stuff, and that stuff is generally similar. But, a lot of the other stuff will be connecting to this thing sending this weird ass alien format to this place and jiggling the bits like this. It's pretty clear LibLaserCut kinda isn't setup well to expect a driver like the NanoK40 one. It would be entirely reasonable to be a lot more liberal with what you expect.

If getting a bunch of bytes for the raster is a thing, any good implementation of RasterPart should let you do that. But, for my uses it woulda been nice to have the option to have it pre-chewed and spoonfed to my driver. Optimizing the backend stuff where it makes too many copies and, accesses BufferImages in wrong ways, that's all just window dressing. Doing the right thing in 200 ms when you could have done it in 2 ms hardly matters if the whole project will take 2,000,000 milliseconds (33 minutes).

@t-oster
Copy link
Owner

t-oster commented Jun 19, 2019

@tatarize thank you for all you work. You seem to be really into that stuff. It's a shame I am only reading through pull requests in my spare time because my day job is far away from visicut instead of having the time to hack on it.
Your timing improvements look great and as soon as @mgmax can confirm the output for the Epilog driver stays constant, I am happy to integrate this.

@tatarize
Copy link
Contributor Author

You seem to be really into that stuff.

@t-oster scars of a misspent youth. I spent more Friday nights insuring my functions didn't allocate any unnecessary memory than I care to admit.

Still to this day one of my hobbyhorses is the flawed belief that you need to allocate memory to perform a convolution on image data. --- "You fools! That's only a requirement if the results pixel is in the center and that's arbitrary, if you place it in the upper left hand corner of the kernel you can perform the algorithm as a scanline operation within its own memory footprint!"

@tatarize
Copy link
Contributor Author

With regard to the conversations about the problems with these classes also see #104 it's likely too drastic. But, it show how I would optimally correct the problems. Namely nuke lots of it, move the stuff that's left to Visicut. Then just have RasterPart and Raster3dPart take BufferImages. That way that stuff can be actually sped up, rather than moving a couple loop structures so it fits with memory structure a bit better.

@mgmax
Copy link
Collaborator

mgmax commented Jun 20, 2019

I tested again and now everything is okay, "save to file" leads exactly the same results.

@tatarize
Copy link
Contributor Author

Mg's suggestion in the other pull request pegs this as actually low hanging fruit. I'd actually rather a different change that will similarly fix the timing. Namely back it with a BufferedImage wrapped in this class. So just delegate everything properly to a BufferedImage that is created to hold the data. That way if you want you can just get the BlackWhiteRaster.image and draw it to a screen or copy memory from it etc. Seems much easier to optimize against.

@t-oster
Copy link
Owner

t-oster commented Jun 21, 2019

I just remembered why I avoided buffered image in the first place: it is (or was?) not available on Android! So before the changes, a native Android app could use liblasercut,. https://stackoverflow.com/questions/5392695/bufferedimage-in-android

@tatarize
Copy link
Contributor Author

I write Android too.
https://play.google.com/store/apps/developer?id=David+Olsen

And have written Java/Android libraries before:
https://github.com/EmbroidePy/EmbroideryIO


My actual reasons to want this change is generally to get rid of the classes altogether. Interoperability with Android would somewhat support this idea. The Bitmap class has other ways to pack the data see (https://developer.android.com/reference/android/graphics/Bitmap.Config.html) the only real overlap with BufferedImage is that Int8888 is the same as INT_ARGB. And you're doing a lot of image manipulation and I sort of think the correct amount to do is none. One of the more pressing differences within Android though, is that memory is at a premium. You sortof can't store more than one bitmap in memory.

For a minimum amount of work, I think maybe it should be rewritten so that rasterpart and raster3dpart simply accept an array of bytes and a scanline value. With the expectation it any manipulation has already been done. They could so the highly efficiently data chunking themselves. And image manipulation could be done better and faster and spunoff. Likely through just taking a bunch of ARGB ints and turning them into proper bytes. I'll head that direction in my never to be approved break_everything pull request.

--

Also the Usb4Java utility doesn't work great on Android, the security stuff is really tight and basically impossible to do right to run, since it's natively packing up some libusb elements in jars. Android however does have USB. ( https://developer.android.com/guide/topics/connectivity/usb/host ) that would certainly work, and I did strongly muse over the idea of controlling the Stock K40 laser cutter with a usb-otg cable. It might be kind of neat to do. But, for interoperability the K40Usb class within the K40Nano driver would need a second form for working with Android and for both of them to not exist in the same world. Much as my EmbroideryIO project spins off a different version of EmbMatrix for the relevant system as AffineTransform and Matrix don't play nice together.

@tatarize
Copy link
Contributor Author

However, with that stuff in mind. Don't approve either version. I'll do a quick rerewite to use byte[], scanline, width and bitdepth. It seems like the bitpacking doesn't make much difference. The biggest point is not messing up the scanline order.

Also, there's a fun method by which even the K40 can do a 3d raster. Basically even without power control and certainly without wanting to constantly change the speed. I can actually do passes. So the first pass it burns all the pixels between 0-64 of value, then 0-128 of value, then 0-192 of value. I might as well add in that nudge that possibility in there. Even if there's no ultimate desire to add it.

@tatarize
Copy link
Contributor Author

Okay, I basically implemented a proper raster in raw java. Still stored correctly. But, backed by bytes for raw data chunking. But, also, it can store any bit depth or sample count (that comes in at less than 32). So you could store 17 bit greyscale pixels if you wanted or 10 bit, 3 sample, rgb color. It basically just fixes the locations. You could even tag on some interleaving byte, throw it through zlib and save this thing as a png. (I've implemented PNG before).

The important thing is you can have 1 sample 1 bit greys and 1 sample 8 bit greys in the same structure. The fake greyraster stuff at the end there isn't just for show anymore. It actually correctly works. If you tell it you have an bitDepth of 8. Nothing changes except all the greys have that bit depth. When you read the raster line they give you what you'd expect without having the change the structure anywhere. You get that line of 8 pixel bytes and you get that line of 1 pixel 8-bit grey bytes. It's just handing you the bytes.

Also, you can have more or less bitDepth (even those that cross byte boundaries). Pure java and will port to anything. You could store RGBA or int8888 in without issue (4 sample, Bitdepth of 8). It's like how I packed it in ints and the previous version packet it in bytes except that it's solved for the general case.

@tatarize
Copy link
Contributor Author

Also, since it's a real raster, you could actually just use the same routines for RasterPart and Raster3dPart the important thing is the amount of bits per pixel and using those bits to control the properties. You don't actually need RasterPart to give you the depth, the raster itself would have a depth. And I should likely take several of those useful functions in RasterPart and move them to BlackWhiteRaster in a coherent fashion. In 8 bit mode, not only does getRightmostBlackPixel() perform like it's in 1 bit mode, but it doesn't even make sense. It really should be the RightMostNonWhite pixel.

@tatarize
Copy link
Contributor Author

There's some incongruity cropping up with black = 1 in 1 bit and black = 0 in 8 bit. I'll have to leave it as it is. It would work for a general greyscale raster too which is great but doing that also requires having it know what black is in a consistent manner. There's no piecemeal way to fix that, especially with drivers that take the bytes and send them off to places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants