Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate core text performance with DWrite stack #1374

Closed
rajsesh opened this issue Nov 14, 2016 · 3 comments
Closed

Investigate core text performance with DWrite stack #1374

rajsesh opened this issue Nov 14, 2016 · 3 comments

Comments

@rajsesh
Copy link
Contributor

rajsesh commented Nov 14, 2016

We have a few inefficiencies:

  1. RenderTarget Begin/EndDraw() is called for each rendering block (glpyh runs), which will cause performance issues because EndDraw() is expensive. We should batch these calls and known points.

  2. Creating of numerous dwrite factory instances.

  3. Static initializations.

@rajsesh
Copy link
Contributor Author

rajsesh commented Nov 17, 2016

One demo scenario is the non-standard edit control in xamlcatalog.

@rajsesh
Copy link
Contributor Author

rajsesh commented Dec 1, 2016

There was one thing that stood out during perf analysis - we spent a lot of time in DWriteWrapper.mm _DWriteGetFontPropertiesFromName(), specifically in __CFLocaleCopyCurrent. While this is a result of #1490, the font properties don't need to be localized, and there is no need for any usage of cflocale in this file. CFLocaleCopyCurrent() can hence go away and nullptr can be used where locale is needed.

ms-jihua added a commit to ms-jihua/WinObjC that referenced this issue Dec 15, 2016
 - Remove NSLayoutManager __lineHasGlyphsAfterIndex(), which is called multiple times per line
    - Instead, directly search for the index of the last visible glyph once, compare against this

 - DWriteWrapper_CoreText
    - Skip first range of attributes in __DWriteTextLayoutCreate()
        - was redundant due to the underlaying Format already taking it into account
    - reserve() ahead of time for glyphRunDescriptionInfo._clusterMap, CTRun->_glyphOrigins, ->_glyphAdvances
        - A lot of time was being spent in reallocation, resizing

 - Remove DWriteWrapper _GetUserDefaultLocaleName(), don't wrap in a wstring, directly use a wchar_t buffer
 - Reduce the number of character buffer copies in _CFStringFromLocalizedString() by 1
 - Remove unused/not useful _characters member from CTTypesetter

 - Misc

Related to microsoft#1374
ms-jihua added a commit to ms-jihua/WinObjC that referenced this issue Dec 16, 2016
 - Remove NSLayoutManager __lineHasGlyphsAfterIndex(), which is called multiple times per line
    - Instead, directly search for the index of the last visible glyph once, compare against this
    - Was previously about ~10% of the CPU time of [NSLayoutManager __layoutAllText], now negligible

 - DWriteWrapper_CoreText
    - Skip first range of attributes in __DWriteTextLayoutCreate()
        - Was redundant due to the underlaying Format already taking it into account
        - Saves about 8% of CPU time in __DWriteTextLayoutCreate()
    - reserve() ahead of time for glyphRunDescriptionInfo._clusterMap, CTRun->_glyphOrigins, ->_glyphAdvances
        - Was previously about 8~10% of _DWriteCreateFrame()'s CPU time, now negligible

 - Remove DWriteWrapper _GetUserDefaultLocaleName(), don't wrap in a wstring, directly use a wchar_t buffer
    - Performance impact not measured, likely to be fairly small
 - Reduce the number of character buffer copies in _CFStringFromLocalizedString() by 1
    - Saves 20%~30% of CPU time in _CFStringFromLocalizedString()

 - Remove unused/not useful _characters member from CTTypesetter

 - Misc

Related to microsoft#1374
ms-jihua added a commit that referenced this issue Dec 16, 2016
- Remove NSLayoutManager __lineHasGlyphsAfterIndex(), which is called multiple times per line
    - Instead, directly search for the index of the last visible glyph once, compare against this
    - Was previously about ~10% of the CPU time of [NSLayoutManager __layoutAllText], now negligible

 - DWriteWrapper_CoreText
    - Skip first range of attributes in __DWriteTextLayoutCreate()
        - Was redundant due to the underlaying Format already taking it into account
        - Saves about 8% of CPU time in __DWriteTextLayoutCreate()
    - reserve() ahead of time for glyphRunDescriptionInfo._clusterMap, CTRun->_glyphOrigins, ->_glyphAdvances
        - Was previously about 8~10% of _DWriteCreateFrame()'s CPU time, now negligible

 - Remove DWriteWrapper _GetUserDefaultLocaleName(), don't wrap in a wstring, directly use a wchar_t buffer
    - Performance impact not measured, likely to be fairly small
 - Reduce the number of character buffer copies in _CFStringFromLocalizedString() by 1
    - Saves 20%~30% of CPU time in _CFStringFromLocalizedString()

 - Remove unused/not useful _characters member from CTTypesetter

 - Misc

Related to #1374
@ms-jihua
Copy link
Contributor

Summary of current CoreText performance issues after #1558 and assorted recommendations:

UIKit functions to watch:

  • NSString drawAtPoint:/NSString+UIKitAdditions.mm::drawString()
  • [NSString sizeWithAttributes:]
  • [NSLayoutManager __layoutAllText]

Bottlenecks:

  • CFRelease()

    • About 20% of [NSString drawAtPoint:] is spent in CFRelease(). We discussed this offline before, but our CFRelease is very slow compared to our objective-C versions. We have some code cooking to address this, but the engineering hurdle to complete the effort seems high.
  • _DWriteGetFrame()/CTFramesetterCreateFrame()/CTFramesetterSuggestFrameSizeWithConstraints()

    • Latter two functions call _DWriteGetFrame()

    • Very expensive. We spend 20% of [NSString drawAtPoint:], more than 60% of [NSString sizeWithAttributes:], and more than 70% of [NSLayoutManager __layoutAllText] (depending on number of line segments) here.

    • About 50% of _DWriteGetFrame() time is spent in IDWriteTextLayout::Draw(), which creates a TextLayout by analyzing our input and laying our the text. We should follow up with DWrite team as to whether we can make gains by structuring our input differently.

    • About 7~8% of the time in _DWriteGetFrame() is spent in CustomDWriteTextRenderer::DrawGlyphRun(), copying and translating DWRITE_GLYPH_RUN(_DESCRIPTION) structs into _DWriteGlyphRunDetails/Descriptions. About 5% of the time in _DWriteGetFrame() is spent in [NSObject new], translating these intermediatery structs into CTRun and CTLine objects. We can probably skip at least one of these copy/translations.

    • Possibly related to CFRelease() issues previously mentioned, we spend about 10% of CTFramesetterCreateFrame() (though not the other two) in CGPathRetain()

  • CGContextDrawGlyphRun()

    • About 48% of the time in [NSString drawAtPoint:] is spent here

    // Below %s are percentages of time spent in CGContextGlyphRun()

    • When breaking down the stack, very little time is actually spent in ID2D1RenderTarget::DrawGlyphRun(). Heaviest hitters are ID2D1RenderTarget::EndDraw() (67%), ID2D1RenderTarget::GetRenderTarget() (16%), ID2D1RenderTarget::BeginDraw (14%).

    • Note that CGImage is substantially different on the CGD2D branch, having a more direct relationship with its underlying WicBitmap.

    • Our CGContextDrawGlyphRun() currently have BeginDraw() and EndDraw() within its body. These functions are meant to be called at the beginning and end of a batch of drawing operations, as they are faster if executed in a batch. However, we call CGContextDrawGlyphRun() within a tight loop/in the middle of a batch. Since we control the contract of CGContextDrawGlyphRun() completely, we can move the begin/end calls out to the true bounds of the batch.

      • Note that again, this is also already changed on the CGD2D branch.
    • A bit of a stretch, but currently our render-to bitmaps are fairly small. As recommended in https://msdn.microsoft.com/en-us/library/windows/desktop/dd372260(v=vs.85).aspx#atlasofbitmaps, we may be able to gain by using subsets of large bitmaps as small bitmaps, or by otherwise reusing bitmaps.

  • [NSLayoutManager __layoutAllText]

    • About 20% of the time in this function is also spent in CFRelease()

    • Calls CTFramesetterCreateFrame() per line segment, leading to a relatively high number of calls of a very expensive function. It seems like it ought to be possible to 'batch' line segments together in one CreateFrame() operation, if they align vertically without line segments in between. This is complicated by our current algorithm not being strictly correct (see comments in Fix some CoreText perf low-hanging fruit #1558), and by line segments being influenced by metrics that can only be known after previous line segments were already drawn. Still, it seems possible to make gains here (ie: in a basic case where there are no exclusion paths, we can definitely use just one CreateFrame(). difficult to extend these gains into marginally more nuanced cases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants