Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

segfault in native code while trying to use CustomOp #11926

Closed
mdespriee opened this issue Jul 29, 2018 · 19 comments
Closed

segfault in native code while trying to use CustomOp #11926

mdespriee opened this issue Jul 29, 2018 · 19 comments
Assignees

Comments

@mdespriee
Copy link
Contributor

mdespriee commented Jul 29, 2018

I'm trying to use CustomOp to create a Constant, as it has been suggested in #8428.
As soon as I define that my CustomOp has no inputs, it fails with a segfault, and I can't find a workaround.

Environment

Code

class ConstantOp(value: NDArray) extends CustomOp {

  def forward(isTrain: Boolean, req: Array[String], inData: Array[NDArray], outData: Array[NDArray], aux: Array[NDArray]): Unit = {
    val data = value.copyTo(outData(0).context)
    this.assign(outData(0), req(0), data)
    data.dispose()
  }

  def backward(req: Array[String], outGrad: Array[NDArray], inData: Array[NDArray], outData: Array[NDArray], inGrad: Array[NDArray], aux: Array[NDArray]): Unit = {
    throw new Exception(s"Backward not supported by Constant")
  }
}

class ConstantOpProp(needTopGrad: Boolean = false) extends CustomOpProp(needTopGrad) {

  override def listArguments(): Array[String] = Array()

  override def listOutputs(): Array[String] = Array("output")

  override def inferShape(inShape: Array[Shape]): (Array[Shape], Array[Shape], Array[Shape]) = {
    val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
    (Array(), Array(data.shape), null)
  }

  override def inferType(inType: Array[DType]): (Array[DType], Array[DType], Array[DType]) = {
    val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
    (Array(), Array(data.dtype), null)
  }

  override def createOperator(ctx: String, inShapes: Array[Array[Int]],
                              inDtypes: Array[Int]): CustomOp = {
    // hacky stuff to workaround the declaration using String
    val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
    new ConstantOp(data)
  }
}


object TestConst {
  Operator.register("constant", new ConstantOpProp())

  val value = NDArray.array(Array(1f), Shape(1))
  val const = Symbol.Custom("constant")()(
    kwargs = Map(
      "op_type" -> "constant",
      // hacky thing to workaround the fact CustomOpProp uses Map[String, String] internally for kwargs
      "value" -> String.copyValueOf(value.serialize().map(_.toChar))
    ))

  val a = Symbol.Variable("a")
  val symbol = a + const
  val e = symbol.bind(Context.defaultCtx, Map(
    "a" -> NDArray.array(Array(10f), Shape(1)))
  )

  e.forward()

  println("outputs=" + e.outputs.mkString(", "))
}

Error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f37ec25e7a9, pid=14172, tid=0x00007f379cbf7700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6797a9]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /app/hs_err_pid14172.log

in the log

[...]

Stack: [0x00007f6b28ad3000,0x00007f6b28bd4000],  sp=0x00007f6b28bcecd0,  free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6797a9]
C  [mxnet-scala+0x455dd4]  Java_org_apache_mxnet_LibInfo_mxCustomOpRegister::{lambda(char const*, int, char const**, char const**, MXCallbackList*)#1}::operator()(char const*, int, char const**, char const**, MXCallbackList*) const::{lambda(int, int*, unsigned int**, void*)#5}::_FUN(int, {lambda(char const*, int, char const**, char const**, MXCallbackList*)#1}, unsigned int*, unsigned int**)+0x454
C  [mxnet-scala+0x6acf4c]  mxnet::op::custom::InferShape(nnvm::NodeAttrs const&, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*, std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >*)+0x2dc

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.apache.mxnet.LibInfo.mxExecutorBindEX(JIII[Ljava/lang/String;[I[II[J[J[I[JJLorg/apache/mxnet/Base$RefLong;)I+0
j  org.apache.mxnet.Symbol.bindHelper(Lorg/apache/mxnet/Context;Lscala/collection/Seq;Lscala/collection/Iterable;Lscala/collection/Iterable;Lscala/collection/Iterable;Lscala/collection/Iterable;Lscala/collection/immutable/Map;Lorg/apache/mxnet/Executor;)Lorg/apache/mxnet/Executor;+767
j  org.apache.mxnet.Symbol.bind(Lorg/apache/mxnet/Context;Lscala/collection/immutable/Map;)Lorg/apache/mxnet/Executor;+38

[...]

side-note

As you see in the code, I'm obliged to hack NDArrays into strings to transmit the data. That's because CustomOp implementation defines Map[String, String] for kwargs, whereas Symbol.Custom allows Map[String, Any]. It leads to very strange things where we actually have, at runtime, non-string objects behind java String references. But they aren't castable anyway because of the type system. Weird
A change of the def in CustomOp would be welcome.

@Roshrini
Copy link
Member

@nswamy Could you please add labels: Bug, Scala, Operator

@nswamy
Copy link
Member

nswamy commented Jul 31, 2018

@andrewfayres ?

@andrewfayres
Copy link
Contributor

I'll take a look since I happen to be working on custom operator stuff already.

@andrewfayres
Copy link
Contributor

Alright, did some investigating. First a workaround, changing the returns in your inferShape and inferType to this seems to get around the problem:
(Array(data.shape), Array(data.shape), null)
(Array(data.dtype), Array(data.dtype), null)

Likely the middle could also be null but I didn't test that explicitly. Now for a more in-depth explanation of what's happening. The segfault is happening in this inferShape JNI code and would also happen in this inferType JNI code. The issue here is that what the JNI code calls numInputs is actually defined to be the sum of the inputs, outputs, and aux. So in this particular case numInputs is 1, we go into the for loop and try to do a env->GetObjectArrayElement(ret, 0) which is failing.

Hopefully this unblocks you. I'll give some thought to how this should be fixed in the JNI code.

@mdespriee
Copy link
Contributor Author

@andrewfayres This workaround is working, thank you.

@andrewfayres
Copy link
Contributor

andrewfayres commented Aug 1, 2018

Happy to help. I've added a task to fix the JNI code to our backlog (https://issues.apache.org/jira/browse/MXNET-773).

@nswamy You can close this issue. User is unblocked and we're tracking the bug in jira.

@nswamy
Copy link
Member

nswamy commented Aug 2, 2018

@mdespriee thanks for raising this issue and glad to hear the workaround worked, i am curious to know about your use-case ?

@mdespriee
Copy link
Contributor Author

@nswamy Sure.
We have a quite big predictive pipeline, working on CRM data (we're in the marketing intelligence business). It's mainly based on an unsupervised phase (clustering, collaborative...) followed by a logistic reg. This pipeline has been built for years, using quite low level operations (linear algebra, in python).
We'd like to evolve our technologies and approach.
Not surprisingly the supervised part is a DAG of operations with some similarity to a DAG of Symbol. We also would like to experiment in a near future the integration of an LSTM to process some features.
And I'm interested in building bridges with the scala/jvm world.

MXNet looks like a good candidate for the task, doesn't it ?
So that's the reason of my experiments. The use of constants is one of our requirements (we have many of them everywhere in our current DAG), and also we're heavy users of sparse computations. That's why I'm also very interested in your roadmap regarding this topic.

@nswamy
Copy link
Member

nswamy commented Aug 2, 2018

@mdespriee Thanks for explaining your use-case, looks very interesting and yes MXNet is perfect for such a use-case.
With regards to your needs:
We have some some basic version of RNN(LSTM) support in MXNet-Scala RNN examples

We are aware of a few issues with this and working to resolve them #11753. We will further enhance this to bring parity with the MXNet Python RNN APIs.
Sparse support is also in the backlog for support in MXNet-Scala.
MXNet-Scala backlog is maintained in JIRA and cwiki here , we are happy to discuss your use-case and prioritize our backlog to suit user needs.

MXNet Python provides a very good set of APIs and RNN building blocks

Please reach out to the community on dev@mxnet.apache.org or on https://the-asf.slack.com (#mxnet/#mxnet-scala channels)

@mdespriee
Copy link
Contributor Author

mdespriee commented Aug 4, 2018

@andrewfayres Unfortnuately, I stumbled on another problem, when trying to instantiate 2 constants in a flow. Either there's something I don't get on how CustomOp should be used, or there are some subtle bugs in the code. Or both.

Here we go. I create a symbol adding a variable and 2 instances of Constant.

  val a = Symbol.Variable("a")
  val symbol = a + const1 + const2      // const1 = 10 and const2 = -10

I executed a forward(), with a bound to 1. I tracked what instances, kwargs and values were used, and I got that:

inferShape called with kwargs(id)=2    
inferShape called with kwargs(id)=2   <<-  never called for id 1
CreateOperator called with kwargs(id)=2
Building ConstantOp instance CustomOpTest$ConstantOp@27c14491 with value [-10.0]
CreateOperator called with kwargs(id)=2     <<- idem
Building ConstantOp instance CustomOpTest$ConstantOp@2515240d with value [-10.0]     <<-- one of the 2 should be 10    (probably related to kwargs mixup)
Forward called on instance CustomOpTest$ConstantOp@27c14491, having value [-10.0]
Forward called on instance CustomOpTest$ConstantOp@27c14491, having value [-10.0]      <<-- only one of the 2 instances is used (so instance variable)
Result= -19.0 (should be 1)

So 2 problems: the values in kwargs got mixed at instantiation, and only one ConstantOp instance (and value) used in forward()

Here is the full code:

import org.apache.mxnet.DType.DType
import org.apache.mxnet._

object CustomOpTest extends App {

  class ConstantOpProp(needTopGrad: Boolean = false) extends CustomOpProp(needTopGrad) {

    override def listArguments(): Array[String] = Array()

    override def listOutputs(): Array[String] = Array("output")

    override def inferShape(inShape: Array[Shape]): (Array[Shape], Array[Shape], Array[Shape]) = {
      println("inferShape called with kwargs(id)=" + this.kwargs("id"))
      val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
      (Array(data.shape), Array(data.shape), null)
    }

    override def inferType(inType: Array[DType]): (Array[DType], Array[DType], Array[DType]) = {
      val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
      (Array(data.dtype), Array(data.dtype), null)
    }

    override def createOperator(ctx: String, inShapes: Array[Array[Int]], inDtypes: Array[Int]): CustomOp = {
      println("CreateOperator called with kwargs(id)=" + this.kwargs("id"))
      val data = NDArray.deserialize(this.kwargs("value").toCharArray.map(_.toByte))
      new ConstantOp(data)
    }
  }

  class ConstantOp(value: NDArray) extends CustomOp {
    println(s"Building ConstantOp instance $this with value ${value.toArray.mkString("[", ",", "]")}")

    def forward(isTrain: Boolean, req: Array[String], inData: Array[NDArray], outData: Array[NDArray], aux: Array[NDArray]): Unit = {
      println(s"Forward called on instance $this, having value ${value.toArray.mkString("[", ",", "]")}")
      val data = value.copyTo(outData(0).context)
      this.assign(outData(0), req(0), data)
      data.dispose()
    }

    def backward(req: Array[String], outGrad: Array[NDArray], inData: Array[NDArray], outData: Array[NDArray], inGrad: Array[NDArray], aux: Array[NDArray]): Unit =
      throw new Exception(s"Not implemented")

  }

  Operator.register("constant", new ConstantOpProp())

  val value1 = NDArray.array(Array(10f), Shape(1))
  val value2 = NDArray.array(Array(-10f), Shape(1))

  val const1 = Symbol.Custom("constant-1")()(
    kwargs = Map("id" -> 1, "op_type" -> "constant", "value" -> String.copyValueOf(value1.serialize().map(_.toChar)),
      "aaa" -> "aaa"
    )
  )
  val const2 = Symbol.Custom("constant-2")()(
    kwargs = Map("id" -> 2, "op_type" -> "constant", "value" -> String.copyValueOf(value2.serialize().map(_.toChar)),
      "bbb" -> "bbb"
    )
  )

  val a = Symbol.Variable("a")
  val symbol = a + const1 + const2

  val exec = symbol.bind(Context.defaultCtx, Map(
    "a" -> NDArray.array(Array(1f), Shape(1)))
  )

  exec.forward()
  println(s"Result= ${exec.outputs.head.toScalar} (should be 1)")
}

Thanks for your help

EDIT : added a bit of logs for kwargs as well, as it may help

@mdespriee
Copy link
Contributor Author

@nswamy @andrewfayres should I open another ticket for this second problem ? a JIRA ?

@andrewfayres
Copy link
Contributor

I'll take a look and depending on complexity I might open another ticket to track this.

@apeforest
Copy link
Contributor

@andrewfayres Could you please link your ticket to this issue? What is the current status? Thanks!

@andrewfayres
Copy link
Contributor

Current status: I looked into this last week some but I couldn't find any obvious reason why it was happening. I need to make a jira ticket for the investigation & resolution. I'll try and get one made this afternoon and link it here.

@apeforest
Copy link
Contributor

apeforest commented Sep 24, 2018

@nswamy @sandeep-krishnamurthy This issue is not related to Operator implementation. Please remove the [Operator] label. We have created a JIRA ticket MXNET-933 to implement the native constant operator in MXNet backend.

@lanking520
Copy link
Member

@mdespriee Here to help, are you still facing the problems?

@lanking520
Copy link
Member

After some tests, your problem can be solved through this:

  Operator.register("constant", new ConstantOpProp())
  Operator.register("constant2", new ConstantOpProp())

  val value1 = NDArray.array(Array(-10f), Shape(1))
  val value2 = NDArray.array(Array(10f), Shape(1))

  val const1 = Symbol.Custom("constant-1")()(
    Map("id" -> 1, "op_type" -> "constant", "value" -> String.copyValueOf(value1.serialize().map(_.toChar)),
      "aaa" -> "aaa"
    )
  )
  val const2 = Symbol.Custom("constant-2")()(
    Map("id" -> 2, "op_type" -> "constant2", "value" -> String.copyValueOf(value2.serialize().map(_.toChar)),
      "bbb" -> "bbb"
    )
  )

It will actually bring you the right result you need. The weird thing is the Custom Operator that being registered is an object shared along in the same state. The state is used in the Symbolic graph as you used in here. When you have two ops are defined in the same state, they will share the same kwargs as the Engine would thought they are the same since they all registered to the same op. The weird thing with Symbolic graph, it will only be analysed at the end in order to produce the best way to execute. That cause the overwrite problem you have. The same problem is not reproducible in sequential execution with NDArray.

@andrewfayres
Copy link
Contributor

I accidentally added the wrong issue id to the pr above. It doesn't have anything to do with this.

@lanking520
Copy link
Member

@mdespriee Close this for now and link a new Issue to this since the problem seemed to be solved. Please feel free to reopen if problem persist or you would like to follow up. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants