Skip to content

Proposal for How to Handle Internal Data

Davide P. Cervone edited this page Jul 14, 2016 · 3 revisions

The mechanism for storing internal data is going to need to change in version 3.0. This is partly due to the desire to make MathJax independent of the DOM, and partly due to the desire to give page authors more control over the way mathematics is processed. There are some consequences for exposing the internal storage, however, and some decisions that will need to be made.

The MathJax Script Element

Currently, MathJax stores its internal data in a mixture of DOM elements and javascript objects. The DOM elements are <script> elements whose content is the math to be typeset, and whose type determines the input jax associated with the math. So

<script type="math/tex">\frac{x}{x+1}</script>

represents the TeX expression $\frac{x}{x+1}$. The MathJax pre-processors (e.g., tex2jax or mml2jax) locate the math expressions in the HTML page, remove them, and insert the required <script> tags. The reason the <script> tag was chosen is because its content is automatically CDATA (so things like $x < 1$ can be represented without special escaping), it has a type attribute that can be used to specify the input format, and its contents is not displayed by default.

The original idea was that content management systems that did server-side processing would insert the <script> tags themselves, rather than having the MathJax preprocessing step (and its associated flickering on the page as the math is removed before typesetting), while blogs and other user-generated content would use the math-delimiters-and-preprocessor approach. In practice, however, most sites use the delimiters-and-preprocessors because it is easier, and very few insert the <script> tags themselves, even those that do extensive processing on the server. That, together with the desire to separate MathJax from the DOM, and the fact that producing the final MathJax output on the server should be possible in version 3.0, means that this <script> tag is likely to be abandoned in MathJax 3.0.

The MathJax Data

When MathJax processes a math <script>, it attaches to the DOM node a new property that holds MathJax-specific data, in particular the pointer to the associated Element Jax that is produced by running the appropriate Input Jax on the math expression. There is also Output Jax information (like measured em- and ex-sizes, and container widths). That way, you can easily go from the <script> node to the internal format. The <script> tags have their IDs set by MathJax so that you can also go from the internal Element Jax to the associated <script> (and the MathJax output for that math element).

MathJax does not, in general, maintain pointers from the internal structure to the script element. This is because that would cause loops within the the MathJax objects (the Dom nodes point into the javascript, and the javascript objects point back to the DOM nodes). Such loops can cause memory leaks, particularly in older versions of IE. This could happen even if the objects were not causing a direct loop, because a closure could (in those days) include unexpected values in its scope that did cause the loop. Simply having a pointer to a DOM node together with an event handler on the DOM node could produce such a loop.

So MathJax maintained no pointers to DOM elements in order to avoid any such loops; there are pointers from DOM elements into the javascript, but none in the other direction. Instead, any reference from the javascript to the DOM is done through element IDs. The internal javascript includes the ID of the associated DOM element, and the DOM element is looked up by that ID when the node is needed. This is not ideal, but it seemed the most reasonable way to avoid the loops that were a source of memory loss in IE at least through IE8. I believe that modern browsers (even IE) no longer have the memory leak associated with this looping, but I haven't tested that recently. If we intend to maintain compatibility with IE8, we may still have to worry about these loops. In any case, some testing probably should be done to verify that such loops are not going to be a problem.

MathJax and a Changing DOM

Another reason that the Element Jax is stored on the <script> tag within the DOM itself, rather than in some internal array or other object, was that it made it possible to allow page authors to have dynamic pages without having to worry about the MathJax internal data for the math in the parts of the pages they are removing. These would be freed up automatically when the DOM nodes they are attached to are freed (since they would no longer have any pointers to them).

Had the Element Jax been stored in an array or other object, then freeing the part of the DOM that they refer to would not free the associated Element Jax, and over time, MathJax would accumulate outdated Element Jax. For applications like StackExchange's answer preview (which updates the complete answer DOM on every keystroke), this could mean that the internal data builds up quickly and without bound.

To have the data stored outside the DOM, the page author would have to ask MathJax to free the associated Element Jax before removing any piece of the DOM, and that is an extra bit of memory management that page authors are not used to doing. Indeed, it is hard enough to get them to call MathJax's Typeset() method when they add math to the page; cleaning up before they remove content from the page is just not something that is on their radar.

Implications for Version 3.0

The proposal for refactoring tex2jax and the other preprocessors does not include the insertion of math <script> tags into the DOM, as is now done by MathJax. This means that there is no longer any place to attach the Element Jax as described above. Instead, the proposal suggests that there be arrays of what are essentially today's Element Jax, and that these contain pointers to DOM elements (when they come from an HTML page rather than some other source). That means that if the DOM is changed by the page author, these arrays may need to be updated.

Question: whose responsibility is it to update these arrays? Should it be the page author's duty to keep the connection between their DOM and the math data for the page? Or should MathJax use something like Mutation Observers to identify when math is removed, and update the arrays itself (like arrays of HTML elements returned by getElementsByName() and other functions do)?

One solution would be to provide two layers of control. The lowest level requires the page author to manage the arrays. Since the page author knows when content is being added or removed, he is in the best position to update the arrays. Of course, we could provide some support routines to make that easier (e.g., one could have a routine where you pass a node whose contents is being removed and one for the element jax arrays, and the routine would remove the jax for the math within that node).

The second layer would implement an automated process much like the one used today. It would insert markers into the DOM and attach the Element Jax to them. The array of jax could be discarded. Alternatively Mutation Observers could be used to track the DOM nodes and the array could be updated automatically.