Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add support for nonlinear operations #27

Merged
merged 24 commits into from
Nov 23, 2024

Conversation

HobbitQia
Copy link
Collaborator

I'm so excited to share my updates on the mapper with you. The main changes are listed below:

  1. In DFG.cpp, I added a function called nonlinear_combine(), which will fuse the common patterns occurring in nonlinear operations.
  2. For special functions (e.g., LUT, FP2FX), I recognize them in the DFG through the names of function calls, then demangle their names. Take LUT as an example: In C++ kernel codes, we should define LUT as a function like:
    __attribute__((noinline)) DATA_TYPE lut(DATA_TYPE x) { ... }
    Then our mapper will traverse all functional calls, find special functions by the name like demangle(newName) == "lut(float)". We determine a DFG node which contains a special function through DFGNode::getOpcodeName() and compare the operation name with names of predefined special functions. Details can be seen in DFG.cpp:408-432 and DFGNode::isLut(). l For CGRA mapping I also added special functions so that we can choose to configure a tile to equip with LUT/FP2FX or not. We can specify CGRA's nodes through additionalFunc in param.json. Here's an example:
    "additionalFunc"        : {    "lut" : [1, 2, ...], ...    }
  3. For vectorization, I leverage the original mapper to mark the vectorized operations through DFGNode::isVectorized. However for divison we cannot vectorize it since it's hard to support efficient vector divisor from the hardware perspective. Thus I chose to split a divison nodes into multiple nodes and reconnect the precursors and successors in DFG::tuneDivPattern(). Notably, different vectorization factors will lead to different number of node (e.g. if VF=4 we should split a divison into 4 nodes) thus I added a parameter in the params.json so that we can specify the VF (default to 1).
  4. Support for fine-grained fusion. When combining different patterns into a single node, I added a paramter specified by users to determine the "class" of the patterns (a "class" can have multiple fused patterns). Different tiles can support different "classes" of fused patterns, which are also specified in param.json. Here's an example:
    "additionalFunc"        : {     "complex1" : [4,5,6,7],    "complex2" : [8,9,10,11],    "complex3" : [0,1,2,3], ...   }
    In this configurations there are three classes of fused patterns and each class is supported by a set of tiles.
    Above are my main changes, along with some bug fixed. (sry that I cannot remember everything...)

Copy link
Owner

@tancheng tancheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the PR! Can you also provide a param.json to enable the functionality provided by your PR as an example, and include in the action (i.e., github testing automation)?

src/CGRA.cpp Outdated Show resolved Hide resolved
}
// for (int r=0; r<t_rows; ++r) {
// for (int c=0; c<t_columns; ++c) {
// nodes[r][c]->enableCall();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bug fix, right? So from now on, call can only be supported if user specify it in the param.json or user needs to modify this CGRA.cpp file?

And is this call actually how is the lut is recognized? i.e., instead of support call, user provide the lut func that is actually called in the IR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, call can only be supported if user specify. For lut, it should be writed as call-lut in param.json after I refactored the code.

Comment on lines +171 to +176
// for (int r=0; r<t_rows; ++r) {
// for (int c=0; c<t_columns; ++c) {
// // if(c == 0 || (r%2==1 and c%2 == 1))
// nodes[r][c]->enableComplex();
// }
// }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means heterogeneity in the param.json won't take effect any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Here I mean complex operations should be manually specified in the param.json rather than we configure it for the default.

src/CGRANode.cpp Outdated Show resolved Hide resolved
src/CGRANode.cpp Outdated Show resolved Hide resolved
src/DFGNode.cpp Outdated Show resolved Hide resolved
src/DFGNode.cpp Outdated Show resolved Hide resolved
src/DFGNode.h Outdated Show resolved Hide resolved
src/DFGNode.h Outdated Show resolved Hide resolved
src/DFGNode.h Outdated Show resolved Hide resolved
@tancheng
Copy link
Owner

Plz also resolve the conflict. Thanks!

@tancheng
Copy link
Owner

Thanks @HobbitQia, plz also put response for each of my comments and tag them as fixed/solved (if there is such tag). Thanks a lot!

@tancheng
Copy link
Owner

Can we include at least one .cpp that leverages your nonlinear_param.json for testing the new features?

@HobbitQia
Copy link
Collaborator Author

HobbitQia commented Nov 13, 2024

I included a nonlinear_test.cpp to test the new features. Later I will explain the structure of param.json and show some examples.

For previous comments that I have solved (e.g. issues about comments, codes that should be deleted), I marked them as resolved. For other comments that I think we should discuss about, I responsed to them and didn't mark them.

@tancheng
Copy link
Owner

I included a nonlinear_test.cpp to test the new features. Later I will explain the structure of param.json and show some examples.

For previous comments that I have solved (e.g. issues about comments, codes that should be deleted), I marked them as resolved. For other comments that I think we should discuss about, I responsed to them and didn't mark them.

Thanks a lot Jiajun~! Let me know when you wanna set up meeting for discussion~

@HobbitQia
Copy link
Collaborator Author

Glad to share my improvement in detail.

  1. param.json

    The main change of param.json is the paramter additionalFunc. If we want to enable a special function call in CGRA, we can write call-<function name>: [tile numbers] in additionalFunc. Then the corresponding tiles will be able to execute this function. Similar to the complex operations (i.e. the fused operations like phi-add-add), we can also write complex-<function name>: [tile numbers] in additionalFunc. The corresponding tiles will be able to perform this complex operation.

    For compatibility with previous code, we can also write complex:[tile numbers] (i.e. no specific function name). Then all complex operations like phi-add, mul-add... will be regarded as the same kind, which I called general fusion rather than fine-grained fusion.

    Take test/nonlinear_param.json as an example. In the code below, there is a special function call fp2fx, enabled in the tile 4,8,7,11, and two complex operations BrT enabled in the tile 4,5,6,7 and CoT enabled in the tile 8,9,10,11.

    "additionalFunc"        : {
                                "call-fp2fx" : [4,8,7,11],
                                "load" : [0,1,2,3],
                                "store": [0,1,2,3],
                                "complex-BrT" : [4,5,6,7],
                                "complex-CoT" : [8,9,10,11],
                                "div" : [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
                              }

    It's worth noting that param.json only configures the tiles, and the kernel codes are not effected by this file. So whether a tile can support a special function call or complex operation is also determined by the fusion process in the DFG manipulation in the mapper, which I will illustrate in the next section.

  2. Fusion in the mapper

    Currently, I still choose to fuse operation manually, which means we need to change the code of DFG.cpp and add the fusion patterns in C++. When we do fusion, we need to pass a name for the new combined pattern, and the name should be consistent with the param.json. Take nonlinear_combine() in DFG.cpp:42-53 as an example. In the code below, there are 7 fused patterns and I classify them into 2 categories: BrT and CoT, which are supported by the tiles 4,5,6,7 and 8,9,10,11 respectively, as we have configured in the param.json.

    combineMulAdd("CoT");
    combinePhiAdd("BrT");
    combine("fcmp", "select", "BrT");
    combine("icmp", "select", "BrT");
    combine("icmp", "br", "CoT");
    combine("fcmp", "br", "CoT");
    combineAddAdd("BrT");

    Similiary, to be compatible, when calling combine() we can pass the empty string as the paramter, which means this pattern is combined in the general fusion.

  3. The special function call

    The special function call is a little different from the complex operation. The name is determined by the name of kernel code. Take fp2fx as an example. The code below will be regarded as a special function call and in the mapper will get its function name through the method demangle. Then to support this call, there must be call-fp2fx in the param.json.

    __attribute__((noinline)) float fp2fx(float x) {
        return x + 1.0;    
    }
    ...
    float x = fp2fx(1.0);
    ...
  4. Example of tuning division pattern.

    The left is the snapshot of the original DFG and the right is the new one under a vector factor of 4. We can see that the division is splitted into 4 nodes.

  5. Example of nonlinear_test.cpp

    The DFG is shown below, and we can see the fp2fx and faddmuladd (i.e. CoT)

@HobbitQia
Copy link
Collaborator Author

One more point:
I want to discuss whether we should provide interface for users to specify the fused patterns so that they don't need to change the code of the mapper. Since I remember there are similar functions in the project mlir-cgra and I am not sure about it's necessary to have the similar interface in the mapper.

@tancheng
Copy link
Owner

when calling combine() we can pass the empty string as the paramter, which means this pattern is combined in the general fusion.

You mean empty string for type, right? like combine("fcmp", "select", ""); And we currently don't have such use-case, right?

  • Can you please help to include the vectorFactorForIdiv into the nonlinear_test.cpp as nonlinear_div_test.cpp, so all the three features are tested.
  • And include this test into the testing flow:

I really appreciate the contribution!

User interface for custom pattern.

Sure, but this could be our future work in another PR when you have bandwidth.

@cwz920716 Just FYI :-) Jiajun is one of the best students we have worked with :-)

@HobbitQia
Copy link
Collaborator Author

You mean empty string for type, right? like combine("fcmp", "select", ""); And we currently don't have such use-case, right?

Yes, and actually there are many cases in DFG.cpp written in the past (since the default value of type is the empty string).

I tried to test vectorFactorForIdiv in nonlinear_test.cpp, however, I found it hard to test all three features in a single file. Since to test function calls, we need to call non-inline special functions which are regared as non-vectorized, thus LLVM Auto-Vectorization Pass could not vectorize the whole loop. I didn't find a method to solve it. Do you have any ideas?

@tancheng
Copy link
Owner

Then let's have nonlinear_test.cc and div_test.cc?


- name: Test Idiv Feature
working-directory: ${{github.workspace}}/test
run: clang-12 -emit-llvm -O3 -fno-unroll-loops -fno-vectorize -o idiv_test.bc -c idiv_test.cpp && opt-12 -load ../build/src/libmapperPass.so -mapperPass idiv_test.bc
working-directory: ${{github.workspace}}/test/nonlinear_test
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiv_test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@HobbitQia
Copy link
Collaborator Author

A little strange...let me check it carefully

@tancheng
Copy link
Owner

A little strange...let me check it carefully

No worry~ Thanks!

@tancheng
Copy link
Owner

Based on the error msg, seems failed at isVectorized().

@HobbitQia HobbitQia closed this Nov 17, 2024
@HobbitQia HobbitQia reopened this Nov 17, 2024
@HobbitQia
Copy link
Collaborator Author

HobbitQia commented Nov 17, 2024

Based on the error msg, seems failed at isVectorized().

I found something strange...We used raw_string_ostream() in DFGNode::isVectorized() and raw_fd_ostream() in DFG::generateDot. I deleted them and the workflow can be run correctly. However, I don't know why they will cause interrupt of our programs and when I can use these functions in my local environment everything is ok...

@tancheng
Copy link
Owner

Based on the error msg, seems failed at isVectorized().

I found something strange...We used raw_string_ostream() in DFGNode::isVectorized() and raw_fd_ostream() in DFG::generateDot. I deleted them and the workflow can be run correctly. However, I don't know why they will cause interrupt of our programs and when I can use these functions in my local environment everything is ok...

Seems some library missing: jupyter-xeus/xeus-cling#234 (comment)?

Or you can try to use some C++ standard string write/read/stream functions to replace the LLVM raw_xx_ostream()?

@HobbitQia
Copy link
Collaborator Author

Based on the error msg, seems failed at isVectorized().

I found something strange...We used raw_string_ostream() in DFGNode::isVectorized() and raw_fd_ostream() in DFG::generateDot. I deleted them and the workflow can be run correctly. However, I don't know why they will cause interrupt of our programs and when I can use these functions in my local environment everything is ok...

Seems some library missing: jupyter-xeus/xeus-cling#234 (comment)?

Or you can try to use some C++ standard string write/read/stream functions to replace the LLVM raw_xx_ostream()?

It's hard to replace raw_xx_ostream() currently. It will take me more time to study the source code of LLVM so that I can know how to manipulate instructions as strings to print.

For now, I came up with a temporary method to remove raw_xx_ostream().

  • For DFGNode::isVectorized, I found another efficient method to decide whether an instruction is vectorized or not. The code snippet is shown below:
    Value* psVal = cast<Value>(m_inst);
    return psVal->getType()->isVectorTy();
  • For DFG::generateDot, I used std::ofstream to replace raw_fd_stream so that we can output the dfg information info a .dot file. However, this method can only deal with t_isTrimmedDemo=true, since under this setting we only need to print the operation name of an instruction rather than the whole instruction. When t_isTrimmedDemo=false, this method cannot solve without support of printing the whole instruction content to std::ofstream.

@tancheng WDYT? If the current solution is acceptable, I will push my changes to the current branch.

@tancheng
Copy link
Owner

return psVal->getType()->isVectorTy(); looks great and it is the formal/correct way in LLVM infra/world.

@HobbitQia
Copy link
Collaborator Author

There are still something buggy...I print the generated dfg.json by shell commands and it seems that the DFG is wrong and totally different from the one generated under my local environment. I guess there may be some difference between my environment and the workflow, which may also suggest that our codes lack of migratability due to memory leak or other problems?

@tancheng
Copy link
Owner

Thanks Jiajun. I saw the log and the .cpp is correctly mapped, but as you mentioned the .json is messed up. You are free to keep pushing commits to test/debug the github actions (via printing).

Maybe the issue is the DFG's pointer is somehow freed before being stored?

@HobbitQia
Copy link
Collaborator Author

HobbitQia commented Nov 22, 2024

Thanks Jiajun. I saw the log and the .cpp is correctly mapped, but as you mentioned the .json is messed up. You are free to keep pushing commits to test/debug the github actions (via printing).

Maybe the issue is the DFG's pointer is somehow freed before being stored?

I tried to print the instructions and their opcodename within the function at the beginning of our pass, i.e., the start of runOnFunction() in mapperPass.cpp. And the results are shown here. As you can see, the instructions that I printed through errs() << *inst are right while the operation names I printed by inst->getOpcodeName() are totally wrong. I guess this may not be the fault of our mapper, since in this phase we have not done anything to the functions. Besides, I found a similar problem on stackoverflow: https://stackoverflow.com/questions/29885825. However, there is no no useful information in this link and to be frank, I have no idea about the next step...

The code that I changed for debugging in mapperPass.cpp:

bool runOnFunction(Function &t_F) override {
      // traverse all instructions in the function. 
      for (Function::iterator bb = t_F.begin(); bb != t_F.end(); ++bb)
        for (BasicBlock::iterator i = bb->begin(); i != bb->end(); ++i) {
          for (User::op_iterator op = i->op_begin(); op != i->op_end(); ++op) {
            if (Instruction* inst = dyn_cast<Instruction>(*op)) {
              errs() << "Instruction: " << *inst << "\n";
              errs() << "opcodename" << inst->getOpcodeName() << "\n";
            }
          }
        }

      // Initializes input parameters.
      int rows                      = 4
      ....

@tancheng
Copy link
Owner

Hi @HobbitQia, thanks for the investigation. This seems the opcode issue across different platforms: https://stackoverflow.com/questions/48894012/what-is-getopcode-in-llvm

A quick fix to walkaround this could be:

string getOpcodeNameHelper(Instruction& inst) {
  if ((*inst).find("call") != std::string::npos) {
    return "call";
  } else if ((*inst).find("add") != std::string::npos) {
    return "add";
  } else if ((*inst).find("sub") != std::string::npos) {
    return "sub";
  } else if ...

  }
  return "unknown";
}

Then change m_opcodeName = t_inst->getOpcodeName(); to m_opcodeName = getOpcodeNameHelper(*t_inst);. (I didn't pay attention to the Instruction's pointer and its dump methodology though.)
WDYT?

@HobbitQia
Copy link
Collaborator Author

Hi @HobbitQia, thanks for the investigation. This seems the opcode issue across different platforms: https://stackoverflow.com/questions/48894012/what-is-getopcode-in-llvm

A quick fix to walkaround this could be:

string getOpcodeNameHelper(Instruction& inst) {
  if ((*inst).find("call") != std::string::npos) {
    return "call";
  } else if ((*inst).find("add") != std::string::npos) {
    return "add";
  } else if ((*inst).find("sub") != std::string::npos) {
    return "sub";
  } else if ...

  }
  return "unknown";
}

Then change m_opcodeName = t_inst->getOpcodeName(); to m_opcodeName = getOpcodeNameHelper(*t_inst);. (I didn't pay attention to the Instruction's pointer and its dump methodology though.) WDYT?

Oh that's may be the reason...I agree with your walkaround and I will start to update the code immediately.

@tancheng
Copy link
Owner

To avoid unnecessary else, let's do:

string getOpcodeNameHelper(Instruction& inst) {
  if ((*inst).find("call") != std::string::npos) {
    return "call";
  }  
  if ((*inst).find("add") != std::string::npos) {
    return "add";
  }
  if ((*inst).find("sub") != std::string::npos) {
    return "sub";
  }
  if ...

  }
  return "unknown";
}

@HobbitQia
Copy link
Collaborator Author

To achieve it, I think we must use raw_string_stream to convert Instruction to string...However, raw_xx_stream will still cause Segmentation Fault in our workflow...

@tancheng
Copy link
Owner

tancheng commented Nov 22, 2024

How about sth like this then: if (I.getOpcode == llvm::Add) return "add";. Or Instruction::Add.

@HobbitQia
Copy link
Collaborator Author

HobbitQia commented Nov 22, 2024

How about sth like this then: if (I.getOpcode == llvm::Add) return "add";. Or Instruction::Add.

Currently I try to get the operation names through the code like this if (inst->getOpcode() == Instruction::Add) return "add";.

The method of relying on Opcode can not work and the results are still messed due to the strange opcode. As you can see in this run, I select some content to show below (in the output line 39-43):

inst:   %4 = phi i64 [ 0, %2 ], [ %11, %3 ] opcode name: select
inst:   %5 = getelementptr inbounds i32, i32* %0, i64 %4 opcode name: unknown
inst:   %6 = bitcast i32* %5 to <4 x i32>* opcode name: unknown
inst:   %7 = load <4 x i32>, <4 x i32>* %6, align 4, !tbaa !2 opcode name: getelementptr
inst:   %8 = sdiv <4 x i32> %7, <i32 3, i32 3, i32 3, i32 3> opcode name: urem

@tancheng
Copy link
Owner

Is there a way to perform inst->dump() or store the inst as a stringref?

@tancheng
Copy link
Owner

I am also okay with either:

  • Align opcode: Add a constant to the getOpcode() to compensate the mismatch due to the Github's testing infra. The constant can be a param in the param.json.
  • Dump instruction: dump instructions into a file then read it back to avoid using raw_xx_stream.

@HobbitQia
Copy link
Collaborator Author

I am also okay with either:

  • Align opcode: Add a constant to the getOpcode() to compensate the mismatch due to the Github's testing infra. The constant can be a param in the param.json.
  • Dump instruction: dump instructions into a file then read it back to avoid using raw_xx_stream.

Yes!!! I chose the first method and it seems everthing works well during testing of my repo. I will recheck it to ensure the correctness and later I will push to this branch. Thanks for your patient instructions!

@tancheng
Copy link
Owner

LGTM. Can I merge it now~?

src/CGRANode.cpp Outdated
@@ -425,13 +425,11 @@ bool CGRANode::enableFunctionality(string t_func) {
string type;
if (t_func.length() == 4) type = "none";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can plz add comment about what does 4 mean here? Why is it specialized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 is the length of call. Here I mean the paramter in param.json is call rather than call-.... For this case, I regard it as a function call without special type name.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Plz just add this comment above. And refactor the code like:

const int kLengthOfCall = 4;
if (t_func.length() == kLengthOfCall) {
  type = "none";
}

src/CGRANode.cpp Outdated
@@ -425,13 +425,11 @@ bool CGRANode::enableFunctionality(string t_func) {
string type;
if (t_func.length() == 4) type = "none";
else type = t_func.substr(t_func.find("call") + 5);
cout << type << endl;
enableCall(type);
} else if (t_func.find("complex") != string::npos) {
string type;
if (t_func.length() == 7) type = "none";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry but I remembered deleteing these printing statements and this line has been removed from the code.

@HobbitQia
Copy link
Collaborator Author

Currently I added a parameter opcodeOffset into the param.json to specify the offset of opcodes. For github workflow testing, as u can see in test/nonlinear_test/param.json and test/idiv_test/param.json I set it to be 2 and we can pass the github test. For other cases like we run the mapper locally, we can skip this paramter and the offset will be initialized to 0 in default, which will have no influence on the execution.

LGTM. Can I merge it now~?

Sure. Thanks for your comments~

@tancheng tancheng merged commit a6de261 into tancheng:master Nov 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants