[Serialization] No transitive identifier change #92085

ChuanqiXu9 · 2024-05-14T08:51:30Z

Following of #92083

The motivation is still cutting of the unnecessary change in the dependency chain. See the above link (recursively) for details.

After this patch, (and the above patch), we can already do something pretty interesting. For example,

Motivation example


//--- m-partA.cppm
export module m:partA;

export inline int getA() {
    return 43;
}

export class A {
public:
    int getMem();
};

export template <typename T>
class ATempl {
public:
    T getT();
};

//--- m-partA.v1.cppm
export module m:partA;

export inline int getA() {
    return 43;
}

// Now we add a new declaration without introducing a new type.
// The consuming module which didn't use m:partA completely is expected to be
// not changed.
export inline int getA2() {
    return 88;
}

export class A {
public:
    int getMem();
    // Now we add a new declaration without introducing a new type.
    // The consuming module which didn't use m:partA completely is expected to be
    // not changed.
    int getMem2();
};

export template <typename T>
class ATempl {
public:
    T getT();
    // Add a new declaration without introducing a new type.
    T getT2();
};

//--- m-partB.cppm
export module m:partB;

export inline int getB() {
    return 430;
}

//--- m.cppm
export module m;
export import :partA;
export import :partB;

//--- useBOnly.cppm
export module useBOnly;
import m;

export inline int get() {
    return getB();
}

In this example, module m exports two partitions :partA and :partB. And a consumer useBOnly only consumes the entities from :partB. So we don't hope the BMI of useBOnly changes if only :partA changes. After this patch, we can make it if the change of :partA doesn't introduce new types. (And we can get rid of this if we make no-transitive-type-change).

As the example shows, when we change the implementation of :partA from m-partA.cppm to m-partA.v1.cppm, we add new function declaration getA2() at the global namespace, add a new member function getMem2() to class A and add a new member function to getT2() to class template ATempl. And since :partA is not used by useBOnly completely, the BMI of useBOnly won't change after we made above changes.

Design details

Method used in this patch is similar with #92083 and #86912. It extends the 32 bit IdentifierID to 64 bits and use the higher 32 bits to store the module file index. So that the encoding of the identifier won't get affected by other modules.

Overhead

Similar with #92083 and #86912. The change is only expected to increase the size of the on-disk .pcm files and not affect the compile-time performances. And from my experiment, the size of the on-disk change only increase 1%+ and observe no compile-time impacts.

Future Plans

I'll try to do the same thing for type ids. IIRC, it won't change the dependency graph if we add a new type in an unused units. I do think this is a significant win. And this will be a pretty good answer to "why modules are better than headers."

llvmbot · 2024-05-14T08:51:53Z

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-modules

Author: Chuanqi Xu (ChuanqiXu9)

Changes

Following of #92083

The motivation is still cutting of the unnecessary change in the dependency chain. See the above link (recursively) for details.

After this patch, (and the above patch), we can already do something pretty interesting. For example,

Motivation example


//--- m-partA.cppm
export module m:partA;

export inline int getA() {
    return 43;
}

export class A {
public:
    int getMem();
};

export template &lt;typename T&gt;
class ATempl {
public:
    T getT();
};

//--- m-partA.v1.cppm
export module m:partA;

export inline int getA() {
    return 43;
}

// Now we add a new declaration without introducing a new type.
// The consuming module which didn't use m:partA completely is expected to be
// not changed.
export inline int getA2() {
    return 88;
}

export class A {
public:
    int getMem();
    // Now we add a new declaration without introducing a new type.
    // The consuming module which didn't use m:partA completely is expected to be
    // not changed.
    int getMem2();
};

export template &lt;typename T&gt;
class ATempl {
public:
    T getT();
    // Add a new declaration without introducing a new type.
    T getT2();
};

//--- m-partB.cppm
export module m:partB;

export inline int getB() {
    return 430;
}

//--- m.cppm
export module m;
export import :partA;
export import :partB;

//--- useBOnly.cppm
export module useBOnly;
import m;

export inline int get() {
    return getB();
}

In this example, module m exports two partitions :partA and :partB. And a consumer useBOnly only consumes the entities from :partB. So we don't hope the BMI of useBOnly changes if only :partA changes. After this patch, we can make it if the change of :partA doesn't introduce new types. (And we can get rid of this if we make no-transitive-type-change).

As the example shows, when we change the implementation of :partA from m-partA.cppm to m-partA.v1.cppm, we add new function declaration getA2() at the global namespace, add a new member function getMem2() to class A and add a new member function to getT2() to class template ATempl. And since :partA is not used by useBOnly completely, the BMI of useBOnly won't change after we made above changes.

Design details

Method used in this patch is similar with #92083 and #86912. It extends the 32 bit IdentifierID to 64 bits and use the higher 32 bits to store the module file index. So that the encoding of the identifier won't get affected by other modules.

Overhead

Similar with #92083 and #86912. The change is only expected to increase the size of the on-disk .pcm files and not affect the compile-time performances. And from my experiment, the size of the on-disk change only increase 1%+ and observe no compile-time impacts.

Future Plans

I'll try to do the same thing for type ids. IIRC, it won't change the dependency graph if we add a new type in an unused units. I do think this is a significant win. And this will be a pretty good answer to "why modules are better than headers."

Patch is 27.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92085.diff

10 Files Affected:

(modified) clang/include/clang/Lex/ExternalPreprocessorSource.h (+1-1)
(modified) clang/include/clang/Lex/HeaderSearch.h (+1-1)
(modified) clang/include/clang/Serialization/ASTBitCodes.h (+1-1)
(modified) clang/include/clang/Serialization/ASTReader.h (+8-11)
(modified) clang/include/clang/Serialization/ModuleFile.h (-3)
(modified) clang/lib/Serialization/ASTReader.cpp (+51-45)
(modified) clang/lib/Serialization/ASTWriter.cpp (+40-21)
(modified) clang/lib/Serialization/GlobalModuleIndex.cpp (+2-1)
(modified) clang/lib/Serialization/ModuleFile.cpp (-1)
(added) clang/test/Modules/no-transitive-identifier-change.cppm (+102)

diff --git a/clang/include/clang/Lex/ExternalPreprocessorSource.h b/clang/include/clang/Lex/ExternalPreprocessorSource.h
index 6775841860373..bd5c11a4577f5 100644
--- a/clang/include/clang/Lex/ExternalPreprocessorSource.h
+++ b/clang/include/clang/Lex/ExternalPreprocessorSource.h
@@ -36,7 +36,7 @@ class ExternalPreprocessorSource {
   /// Return the identifier associated with the given ID number.
   ///
   /// The ID 0 is associated with the NULL identifier.
-  virtual IdentifierInfo *GetIdentifier(unsigned ID) = 0;
+  virtual IdentifierInfo *GetIdentifier(uint64_t ID) = 0;
 
   /// Map a module ID to a module.
   virtual Module *getModule(unsigned ModuleID) = 0;
diff --git a/clang/include/clang/Lex/HeaderSearch.h b/clang/include/clang/Lex/HeaderSearch.h
index 5ac63dddd4d4e..cb75dd429c448 100644
--- a/clang/include/clang/Lex/HeaderSearch.h
+++ b/clang/include/clang/Lex/HeaderSearch.h
@@ -124,7 +124,7 @@ struct HeaderFileInfo {
   /// This ID number will be non-zero when there is a controlling
   /// macro whose IdentifierInfo may not yet have been loaded from
   /// external storage.
-  unsigned ControllingMacroID = 0;
+  uint64_t ControllingMacroID = 0;
 
   /// If this file has a \#ifndef XXX (or equivalent) guard that
   /// protects the entire contents of the file, this is the identifier
diff --git a/clang/include/clang/Serialization/ASTBitCodes.h b/clang/include/clang/Serialization/ASTBitCodes.h
index 772452e3afc55..1fd482b5aff0e 100644
--- a/clang/include/clang/Serialization/ASTBitCodes.h
+++ b/clang/include/clang/Serialization/ASTBitCodes.h
@@ -59,7 +59,7 @@ const unsigned VERSION_MINOR = 1;
 ///
 /// The ID numbers of identifiers are consecutive (in order of discovery)
 /// and start at 1. 0 is reserved for NULL.
-using IdentifierID = uint32_t;
+using IdentifierID = uint64_t;
 
 /// The number of predefined identifier IDs.
 const unsigned int NUM_PREDEF_IDENT_IDS = 1;
diff --git a/clang/include/clang/Serialization/ASTReader.h b/clang/include/clang/Serialization/ASTReader.h
index d028d52fc5ef1..1da9123280f26 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -657,14 +657,6 @@ class ASTReader
   /// been loaded.
   std::vector<IdentifierInfo *> IdentifiersLoaded;
 
-  using GlobalIdentifierMapType =
-      ContinuousRangeMap<serialization::IdentifierID, ModuleFile *, 4>;
-
-  /// Mapping from global identifier IDs to the module in which the
-  /// identifier resides along with the offset that should be added to the
-  /// global identifier ID to produce a local ID.
-  GlobalIdentifierMapType GlobalIdentifierMap;
-
   /// A vector containing macros that have already been
   /// loaded.
   ///
@@ -1536,6 +1528,11 @@ class ASTReader
   /// Translate a \param GlobalDeclID to the index of DeclsLoaded array.
   unsigned translateGlobalDeclIDToIndex(GlobalDeclID ID) const;
 
+  /// Translate an \param IdentifierID ID to the index of IdentifiersLoaded
+  /// array and the corresponding module file.
+  std::pair<ModuleFile *, unsigned>
+  translateIdentifierIDToIndex(serialization::IdentifierID ID) const;
+
 public:
   /// Load the AST file and validate its contents against the given
   /// Preprocessor.
@@ -2120,7 +2117,7 @@ class ASTReader
   /// Load a selector from disk, registering its ID if it exists.
   void LoadSelector(Selector Sel);
 
-  void SetIdentifierInfo(unsigned ID, IdentifierInfo *II);
+  void SetIdentifierInfo(serialization::IdentifierID ID, IdentifierInfo *II);
   void SetGloballyVisibleDecls(IdentifierInfo *II,
                                const SmallVectorImpl<GlobalDeclID> &DeclIDs,
                                SmallVectorImpl<Decl *> *Decls = nullptr);
@@ -2145,10 +2142,10 @@ class ASTReader
     return DecodeIdentifierInfo(ID);
   }
 
-  IdentifierInfo *getLocalIdentifier(ModuleFile &M, unsigned LocalID);
+  IdentifierInfo *getLocalIdentifier(ModuleFile &M, uint64_t LocalID);
 
   serialization::IdentifierID getGlobalIdentifierID(ModuleFile &M,
-                                                    unsigned LocalID);
+                                                    uint64_t LocalID);
 
   void resolvePendingMacro(IdentifierInfo *II, const PendingMacroInfo &PMInfo);
 
diff --git a/clang/include/clang/Serialization/ModuleFile.h b/clang/include/clang/Serialization/ModuleFile.h
index 5cada272f59d3..075a07a81d637 100644
--- a/clang/include/clang/Serialization/ModuleFile.h
+++ b/clang/include/clang/Serialization/ModuleFile.h
@@ -310,9 +310,6 @@ class ModuleFile {
   /// Base identifier ID for identifiers local to this module.
   serialization::IdentifierID BaseIdentifierID = 0;
 
-  /// Remapping table for identifier IDs in this module.
-  ContinuousRangeMap<uint32_t, int, 2> IdentifierRemap;
-
   /// Actual data for the on-disk hash table of identifiers.
   ///
   /// This pointer points into a memory buffer, where the on-disk hash
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index b0059e449182b..b4f2bd182e473 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -918,7 +918,7 @@ ASTSelectorLookupTrait::ReadKey(const unsigned char* d, unsigned) {
   SelectorTable &SelTable = Reader.getContext().Selectors;
   unsigned N = endian::readNext<uint16_t, llvm::endianness::little>(d);
   const IdentifierInfo *FirstII = Reader.getLocalIdentifier(
-      F, endian::readNext<uint32_t, llvm::endianness::little>(d));
+      F, endian::readNext<IdentifierID, llvm::endianness::little>(d));
   if (N == 0)
     return SelTable.getNullarySelector(FirstII);
   else if (N == 1)
@@ -928,7 +928,7 @@ ASTSelectorLookupTrait::ReadKey(const unsigned char* d, unsigned) {
   Args.push_back(FirstII);
   for (unsigned I = 1; I != N; ++I)
     Args.push_back(Reader.getLocalIdentifier(
-        F, endian::readNext<uint32_t, llvm::endianness::little>(d)));
+        F, endian::readNext<IdentifierID, llvm::endianness::little>(d)));
 
   return SelTable.getSelector(N, Args.data());
 }
@@ -1009,7 +1009,8 @@ static bool readBit(unsigned &Bits) {
 IdentifierID ASTIdentifierLookupTrait::ReadIdentifierID(const unsigned char *d) {
   using namespace llvm::support;
 
-  unsigned RawID = endian::readNext<uint32_t, llvm::endianness::little>(d);
+  IdentifierID RawID =
+      endian::readNext<IdentifierID, llvm::endianness::little>(d);
   return Reader.getGlobalIdentifierID(F, RawID >> 1);
 }
 
@@ -1027,9 +1028,12 @@ IdentifierInfo *ASTIdentifierLookupTrait::ReadData(const internal_key_type& k,
                                                    unsigned DataLen) {
   using namespace llvm::support;
 
-  unsigned RawID = endian::readNext<uint32_t, llvm::endianness::little>(d);
+  IdentifierID RawID =
+      endian::readNext<IdentifierID, llvm::endianness::little>(d);
   bool IsInteresting = RawID & 0x01;
 
+  DataLen -= sizeof(IdentifierID);
+
   // Wipe out the "is interesting" bit.
   RawID = RawID >> 1;
 
@@ -1060,7 +1064,7 @@ IdentifierInfo *ASTIdentifierLookupTrait::ReadData(const internal_key_type& k,
   bool HadMacroDefinition = readBit(Bits);
 
   assert(Bits == 0 && "Extra bits in the identifier?");
-  DataLen -= 8;
+  DataLen -= sizeof(uint16_t) * 2;
 
   // Set or check the various bits in the IdentifierInfo structure.
   // Token IDs are read-only.
@@ -1186,7 +1190,7 @@ ASTDeclContextNameLookupTrait::ReadKey(const unsigned char *d, unsigned) {
   case DeclarationName::CXXLiteralOperatorName:
   case DeclarationName::CXXDeductionGuideName:
     Data = (uint64_t)Reader.getLocalIdentifier(
-        F, endian::readNext<uint32_t, llvm::endianness::little>(d));
+        F, endian::readNext<IdentifierID, llvm::endianness::little>(d));
     break;
   case DeclarationName::ObjCZeroArgSelector:
   case DeclarationName::ObjCOneArgSelector:
@@ -2055,7 +2059,7 @@ HeaderFileInfoTrait::ReadData(internal_key_ref key, const unsigned char *d,
   HFI.DirInfo = (Flags >> 1) & 0x07;
   HFI.IndexHeaderMapHeader = Flags & 0x01;
   HFI.ControllingMacroID = Reader.getGlobalIdentifierID(
-      M, endian::readNext<uint32_t, llvm::endianness::little>(d));
+      M, endian::readNext<IdentifierID, llvm::endianness::little>(d));
   if (unsigned FrameworkOffset =
           endian::readNext<uint32_t, llvm::endianness::little>(d)) {
     // The framework offset is 1 greater than the actual offset,
@@ -3429,24 +3433,11 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
             "duplicate IDENTIFIER_OFFSET record in AST file");
       F.IdentifierOffsets = (const uint32_t *)Blob.data();
       F.LocalNumIdentifiers = Record[0];
-      unsigned LocalBaseIdentifierID = Record[1];
       F.BaseIdentifierID = getTotalNumIdentifiers();
 
-      if (F.LocalNumIdentifiers > 0) {
-        // Introduce the global -> local mapping for identifiers within this
-        // module.
-        GlobalIdentifierMap.insert(std::make_pair(getTotalNumIdentifiers() + 1,
-                                                  &F));
-
-        // Introduce the local -> global mapping for identifiers within this
-        // module.
-        F.IdentifierRemap.insertOrReplace(
-          std::make_pair(LocalBaseIdentifierID,
-                         F.BaseIdentifierID - LocalBaseIdentifierID));
-
+      if (F.LocalNumIdentifiers > 0)
         IdentifiersLoaded.resize(IdentifiersLoaded.size()
                                  + F.LocalNumIdentifiers);
-      }
       break;
     }
 
@@ -4038,7 +4029,6 @@ void ASTReader::ReadModuleOffsetMap(ModuleFile &F) const {
   F.ModuleOffsetMap = StringRef();
 
   using RemapBuilder = ContinuousRangeMap<uint32_t, int, 2>::Builder;
-  RemapBuilder IdentifierRemap(F.IdentifierRemap);
   RemapBuilder MacroRemap(F.MacroRemap);
   RemapBuilder PreprocessedEntityRemap(F.PreprocessedEntityRemap);
   RemapBuilder SubmoduleRemap(F.SubmoduleRemap);
@@ -4071,8 +4061,6 @@ void ASTReader::ReadModuleOffsetMap(ModuleFile &F) const {
 
     ImportedModuleVector.push_back(OM);
 
-    uint32_t IdentifierIDOffset =
-        endian::readNext<uint32_t, llvm::endianness::little>(Data);
     uint32_t MacroIDOffset =
         endian::readNext<uint32_t, llvm::endianness::little>(Data);
     uint32_t PreprocessedEntityIDOffset =
@@ -4092,7 +4080,6 @@ void ASTReader::ReadModuleOffsetMap(ModuleFile &F) const {
                                     static_cast<int>(BaseOffset - Offset)));
     };
 
-    mapOffset(IdentifierIDOffset, OM->BaseIdentifierID, IdentifierRemap);
     mapOffset(MacroIDOffset, OM->BaseMacroID, MacroRemap);
     mapOffset(PreprocessedEntityIDOffset, OM->BasePreprocessedEntityID,
               PreprocessedEntityRemap);
@@ -8181,7 +8168,6 @@ LLVM_DUMP_METHOD void ASTReader::dump() {
   dumpModuleIDMap("Global bit offset map", GlobalBitOffsetsMap);
   dumpModuleIDMap("Global source location entry map", GlobalSLocEntryMap);
   dumpModuleIDMap("Global type map", GlobalTypeMap);
-  dumpModuleIDMap("Global identifier map", GlobalIdentifierMap);
   dumpModuleIDMap("Global macro map", GlobalMacroMap);
   dumpModuleIDMap("Global submodule map", GlobalSubmoduleMap);
   dumpModuleIDMap("Global selector map", GlobalSelectorMap);
@@ -8822,8 +8808,9 @@ void ASTReader::LoadSelector(Selector Sel) {
 
 void ASTReader::SetIdentifierInfo(IdentifierID ID, IdentifierInfo *II) {
   assert(ID && "Non-zero identifier ID required");
-  assert(ID <= IdentifiersLoaded.size() && "identifier ID out of range");
-  IdentifiersLoaded[ID - 1] = II;
+  unsigned Index = translateIdentifierIDToIndex(ID).second;
+  assert(Index < IdentifiersLoaded.size() && "identifier ID out of range");
+  IdentifiersLoaded[Index] = II;
   if (DeserializationListener)
     DeserializationListener->IdentifierRead(ID, II);
 }
@@ -8876,6 +8863,22 @@ void ASTReader::SetGloballyVisibleDecls(
   }
 }
 
+std::pair<ModuleFile *, unsigned>
+ASTReader::translateIdentifierIDToIndex(IdentifierID ID) const {
+  if (ID == 0)
+    return {nullptr, 0};
+
+  unsigned ModuleFileIndex = ID >> 32;
+  unsigned LocalID = ID & llvm::maskTrailingOnes<IdentifierID>(32);
+
+  assert(ModuleFileIndex && "not translating loaded IdentifierID?");
+  assert(getModuleManager().size() > ModuleFileIndex - 1);
+
+  ModuleFile &MF = getModuleManager()[ModuleFileIndex - 1];
+  assert(LocalID < MF.LocalNumIdentifiers);
+  return {&MF, MF.BaseIdentifierID + LocalID};
+}
+
 IdentifierInfo *ASTReader::DecodeIdentifierInfo(IdentifierID ID) {
   if (ID == 0)
     return nullptr;
@@ -8885,45 +8888,48 @@ IdentifierInfo *ASTReader::DecodeIdentifierInfo(IdentifierID ID) {
     return nullptr;
   }
 
-  ID -= 1;
-  if (!IdentifiersLoaded[ID]) {
-    GlobalIdentifierMapType::iterator I = GlobalIdentifierMap.find(ID + 1);
-    assert(I != GlobalIdentifierMap.end() && "Corrupted global identifier map");
-    ModuleFile *M = I->second;
-    unsigned Index = ID - M->BaseIdentifierID;
+  auto [M, Index] = translateIdentifierIDToIndex(ID);
+  if (!IdentifiersLoaded[Index]) {
+    assert(M != nullptr && "Untranslated Identifier ID?");
+    assert(Index >= M->BaseIdentifierID);
+    unsigned LocalIndex = Index - M->BaseIdentifierID;
     const unsigned char *Data =
-        M->IdentifierTableData + M->IdentifierOffsets[Index];
+        M->IdentifierTableData + M->IdentifierOffsets[LocalIndex];
 
     ASTIdentifierLookupTrait Trait(*this, *M);
     auto KeyDataLen = Trait.ReadKeyDataLength(Data);
     auto Key = Trait.ReadKey(Data, KeyDataLen.first);
     auto &II = PP.getIdentifierTable().get(Key);
-    IdentifiersLoaded[ID] = &II;
+    IdentifiersLoaded[Index] = &II;
     markIdentifierFromAST(*this,  II);
     if (DeserializationListener)
-      DeserializationListener->IdentifierRead(ID + 1, &II);
+      DeserializationListener->IdentifierRead(ID, &II);
   }
 
-  return IdentifiersLoaded[ID];
+  return IdentifiersLoaded[Index];
 }
 
-IdentifierInfo *ASTReader::getLocalIdentifier(ModuleFile &M, unsigned LocalID) {
+IdentifierInfo *ASTReader::getLocalIdentifier(ModuleFile &M, uint64_t LocalID) {
   return DecodeIdentifierInfo(getGlobalIdentifierID(M, LocalID));
 }
 
-IdentifierID ASTReader::getGlobalIdentifierID(ModuleFile &M, unsigned LocalID) {
+IdentifierID ASTReader::getGlobalIdentifierID(ModuleFile &M, uint64_t LocalID) {
   if (LocalID < NUM_PREDEF_IDENT_IDS)
     return LocalID;
 
   if (!M.ModuleOffsetMap.empty())
     ReadModuleOffsetMap(M);
 
-  ContinuousRangeMap<uint32_t, int, 2>::iterator I
-    = M.IdentifierRemap.find(LocalID - NUM_PREDEF_IDENT_IDS);
-  assert(I != M.IdentifierRemap.end()
-         && "Invalid index into identifier index remap");
+  unsigned ModuleFileIndex = LocalID >> 32;
+  LocalID &= llvm::maskTrailingOnes<IdentifierID>(32);
+  ModuleFile *MF =
+      ModuleFileIndex ? M.DependentModules[ModuleFileIndex - 1] : &M;
+  assert(MF && "malformed identifier ID encoding?");
 
-  return LocalID + I->second;
+  if (!ModuleFileIndex)
+    LocalID -= NUM_PREDEF_IDENT_IDS;
+
+  return ((IdentifierID)(MF->Index + 1) << 32) | LocalID;
 }
 
 MacroInfo *ASTReader::getMacro(MacroID ID) {
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index 21c7e50d72ebb..0008eaab42cdf 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -1990,7 +1990,7 @@ namespace {
     std::pair<unsigned, unsigned>
     EmitKeyDataLength(raw_ostream& Out, key_type_ref key, data_type_ref Data) {
       unsigned KeyLen = key.Filename.size() + 1 + 8 + 8;
-      unsigned DataLen = 1 + 4 + 4;
+      unsigned DataLen = 1 + sizeof(IdentifierID) + 4;
       for (auto ModInfo : Data.KnownHeaders)
         if (Writer.getLocalOrImportedSubmoduleID(ModInfo.getModule()))
           DataLen += 4;
@@ -2026,9 +2026,10 @@ namespace {
       LE.write<uint8_t>(Flags);
 
       if (!Data.HFI.ControllingMacro)
-        LE.write<uint32_t>(Data.HFI.ControllingMacroID);
+        LE.write<IdentifierID>(Data.HFI.ControllingMacroID);
       else
-        LE.write<uint32_t>(Writer.getIdentifierRef(Data.HFI.ControllingMacro));
+        LE.write<IdentifierID>(
+            Writer.getIdentifierRef(Data.HFI.ControllingMacro));
 
       unsigned Offset = 0;
       if (!Data.HFI.Framework.empty()) {
@@ -3453,7 +3454,9 @@ class ASTMethodPoolTrait {
   std::pair<unsigned, unsigned>
     EmitKeyDataLength(raw_ostream& Out, Selector Sel,
                       data_type_ref Methods) {
-    unsigned KeyLen = 2 + (Sel.getNumArgs()? Sel.getNumArgs() * 4 : 4);
+    unsigned KeyLen =
+        2 + (Sel.getNumArgs() ? Sel.getNumArgs() * sizeof(IdentifierID)
+                              : sizeof(IdentifierID));
     unsigned DataLen = 4 + 2 + 2; // 2 bytes for each of the method counts
     for (const ObjCMethodList *Method = &Methods.Instance; Method;
          Method = Method->getNext())
@@ -3478,7 +3481,7 @@ class ASTMethodPoolTrait {
     if (N == 0)
       N = 1;
     for (unsigned I = 0; I != N; ++I)
-      LE.write<uint32_t>(
+      LE.write<IdentifierID>(
           Writer.getIdentifierRef(Sel.getIdentifierInfoForSlot(I)));
   }
 
@@ -3789,7 +3792,7 @@ class ASTIdentifierTableTrait {
       InterestingIdentifierOffsets->push_back(Out.tell());
 
     unsigned KeyLen = II->getLength() + 1;
-    unsigned DataLen = 4; // 4 bytes for the persistent ID << 1
+    unsigned DataLen = sizeof(IdentifierID); // bytes for the persistent ID << 1
     if (isInterestingIdentifier(II, MacroOffset)) {
       DataLen += 2; // 2 bytes for builtin ID
       DataLen += 2; // 2 bytes for flags
@@ -3815,11 +3818,11 @@ class ASTIdentifierTableTrait {
 
     auto MacroOffset = Writer.getMacroDirectivesOffset(II);
     if (!isInterestingIdentifier(II, MacroOffset)) {
-      LE.write<uint32_t>(ID << 1);
+      LE.write<IdentifierID>(ID << 1);
       return;
     }
 
-    LE.write<uint32_t>((ID << 1) | 0x01);
+    LE.write<IdentifierID>((ID << 1) | 0x01);
     uint32_t Bits = (uint32_t)II->getObjCOrBuiltinID();
     assert((Bits & 0xffff) == Bits && "ObjCOrBuiltinID too big for ASTReader.");
     LE.write<uint16_t>(Bits);
@@ -3852,6 +3855,10 @@ class ASTIdentifierTableTrait {
 
 } // namespace
 
+/// If the \param IdentifierID ID is a local Identifier ID. If the higher
+/// bits of ID is 0, it implies that the ID doesn't come from AST files.
+static bool isLocalIdentifierID(IdentifierID ID) { return !(ID >> 32); }
+
 /// Write the identifier table into the AST file.
 ///
 /// The identifier table consists of a blob containing string data
@@ -3896,7 +3903,7 @@ void ASTWriter::WriteIdentifierTable(Preprocessor &PP,
 
       // Write out identifiers if either the ID is local or the identifier has
       // changed since it was loaded.
-      if (ID >= FirstIdentID || !Chain || !II->isFromAST() ||
+      if (isLocalIdentifierID(ID) || !Chain || !II->isFromAST() ||
           II->hasChangedSinceDeserialization() ||
           (Trait.needDecls() &&
            II->hasFETokenInfoChangedSinceDeserialization()))
@@ -3931,7 +3938,6 @@ void ASTWriter::WriteIdentifierTable(Preprocessor &PP,
   auto Abbrev = std::make_shared<BitCodeAbbrev>();
   Abbrev->Add(BitCodeAbbrevOp(IDENTIFIER_OFFSET));
   Abbrev->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::Fixed, 32)); // # of identifiers
-  Abbrev->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::Fixed, 32)); // first ID
   Abbrev->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::Blob));
   unsigned IdentifierOffsetAbbrev = Stream.EmitAbbrev(std::move(Abbrev));
 
@@ -3941,8 +3947,7 @@ void ASTWriter::WriteIdentifierTable(Preprocessor &PP,
 #endif
 
   RecordData::value_type Record[] = {IDENTIFIER_OFFSET,
-                                     IdentifierOffsets.size(),
-                                     FirstIdentID - NUM_PREDEF_IDENT_IDS};
+                                     IdentifierOffsets.size()};
   Stream.EmitRecordWithBlob(IdentifierOffsetAbbrev, Record,
                             bytes(IdentifierOffsets));
 
@@ -4032,11 +4037,13 @@ class ASTDeclContextNameLookupTrait {
     unsigned KeyLen = 1;
     switch (Name.getKind()) {
     case DeclarationName::Identifier:
+    case DeclarationName::CXXLiteralOperatorName:
+    case DeclarationName::CXXDeductionGuideName:
+      KeyLen += sizeof(IdentifierID);
+      break;
     case DeclarationName::ObjCZeroArgSelector:
     case DeclarationName::ObjCOneArgSelector:
     case DeclarationName::ObjCMultiArgSelector:
-    case DeclarationName::CXXLiteralOperatorName:
-    case DeclarationName::CXXDeductionGuideName:
       KeyLen += 4;
       break;
     case DeclarationName::CXXOperatorName:
@@ -4064,7 +4071,7 @@...
[truncated]

ChuanqiXu9 · 2024-05-20T01:58:06Z

@jansvoboda11 @Bigcheese gentle ping

clang/include/clang/Lex/HeaderSearch.h

clang/lib/Serialization/ASTReader.cpp

clang/lib/Serialization/ASTWriter.cpp

ChuanqiXu9 · 2024-06-03T07:43:11Z

@jansvoboda11 ping. Did my answers resolve your concerns?

Inspired by the review process in #92085. The check `ID >= FirstIdentID` can cover the following check `!II->isFromAST()`.

github-actions · 2024-06-04T07:43:38Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 799ae77993fa5d7b0638f10b3895090f8748de92 e8f756ec7f8ea7e5bf18cc122a965fb2f258fd15 -- clang/test/Modules/no-transitive-identifier-change.cppm clang/include/clang/Lex/ExternalPreprocessorSource.h clang/include/clang/Lex/HeaderSearch.h clang/include/clang/Serialization/ASTBitCodes.h clang/include/clang/Serialization/ASTReader.h clang/include/clang/Serialization/ModuleFile.h clang/lib/Lex/HeaderSearch.cpp clang/lib/Serialization/ASTReader.cpp clang/lib/Serialization/ASTWriter.cpp clang/lib/Serialization/GlobalModuleIndex.cpp clang/lib/Serialization/ModuleFile.cpp

View the diff from clang-format here.

diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 29a7fcb2cb..fb41a65608 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -3436,8 +3436,8 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
       F.BaseIdentifierID = getTotalNumIdentifiers();
 
       if (F.LocalNumIdentifiers > 0)
-        IdentifiersLoaded.resize(IdentifiersLoaded.size()
-                                 + F.LocalNumIdentifiers);
+        IdentifiersLoaded.resize(IdentifiersLoaded.size() +
+                                 F.LocalNumIdentifiers);
       break;
     }

ChuanqiXu9 · 2024-06-11T01:34:01Z

@jansvoboda11 ping~ I hope we can land the series of the patches in 2 weeks to give more baking times for them

jansvoboda11

LGTM, but I think it would be better to split out the LazyIdentifierInfoPtr change into a separate commit.

ChuanqiXu9 · 2024-06-18T01:43:58Z

Thanks. Will do when landing.

…in HeaderFileInfo This patch is helpful to reduce 32 bits for HeaderFileInfo by combining a uint32_t and pointer into a tagged pointer. This is reviewed as part of #92085 and required to be split as a separate commit

Following of #92085. #### motivation The motivation is still cutting of the unnecessary change in the dependency chain. See the above link (recursively) for details. And this will be the last patch of the `no-transitive-*-change` series. If there are any following patches, they might be C++20 Named modules specific to handle special grammars like `ADL` (See the reply in https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755/53 for example). So they won't affect the whole serialization part as the series patch did. #### example After this patch, finally we are able to cut of unnecessary change of types. For example, ``` //--- m-partA.cppm export module m:partA; //--- m-partA.v1.cppm export module m:partA; namespace NS { class A { public: int getValue() { return 43; } }; } //--- m-partB.cppm export module m:partB; export inline int getB() { return 430; } //--- m.cppm export module m; export import :partA; export import :partB; //--- useBOnly.cppm export module useBOnly; import m; export inline int get() { return getB(); } ``` The BMI of `useBOnly.cppm` is expected to not change if we only add a new class in `m:partA`. This will be pretty useful in practice. #### implementation details The key idea of this patch is similar with the previous patches: extend the 32bits type ID to 64bits so that we can store the module file index in the higher bits. Then the encoding of the type ID is independent on the imported modules. But there are two differences from the previous patches: - TypeID is not completely an index of serialized types. We used the lower 3 bits to store the qualifiers. - TypeID won't take part in any lookup process. So the uses of TypeID is much less than the previous patches. The first difference make we have some more slightly complex bit operations. And the second difference makes the patch much simpler than the previous ones.

…er change (llvm#92085)" Temporarily reverts 2f2ea35 while we fix Swift's ClangImporter to handle the 32 bit -> 64 bit ID change.

[rebranch] Temporarily revert "[Serialization] No transitive identifier change (llvm#92085)"

…in HeaderFileInfo This patch is helpful to reduce 32 bits for HeaderFileInfo by combining a uint32_t and pointer into a tagged pointer. This is reviewed as part of llvm#92085 and required to be split as a separate commit

Following of llvm#92083 The motivation is still cutting of the unnecessary change in the dependency chain. See the above link (recursively) for details. After this patch, (and the above patch), we can already do something pretty interesting. For example, #### Motivation example ``` //--- m-partA.cppm export module m:partA; export inline int getA() { return 43; } export class A { public: int getMem(); }; export template <typename T> class ATempl { public: T getT(); }; //--- m-partA.v1.cppm export module m:partA; export inline int getA() { return 43; } // Now we add a new declaration without introducing a new type. // The consuming module which didn't use m:partA completely is expected to be // not changed. export inline int getA2() { return 88; } export class A { public: int getMem(); // Now we add a new declaration without introducing a new type. // The consuming module which didn't use m:partA completely is expected to be // not changed. int getMem2(); }; export template <typename T> class ATempl { public: T getT(); // Add a new declaration without introducing a new type. T getT2(); }; //--- m-partB.cppm export module m:partB; export inline int getB() { return 430; } //--- m.cppm export module m; export import :partA; export import :partB; //--- useBOnly.cppm export module useBOnly; import m; export inline int get() { return getB(); } ``` In this example, module `m` exports two partitions `:partA` and `:partB`. And a consumer `useBOnly` only consumes the entities from `:partB`. So we don't hope the BMI of `useBOnly` changes if only `:partA` changes. After this patch, we can make it if the change of `:partA` doesn't introduce new types. (And we can get rid of this if we make no-transitive-type-change). As the example shows, when we change the implementation of `:partA` from `m-partA.cppm` to `m-partA.v1.cppm`, we add new function declaration `getA2()` at the global namespace, add a new member function `getMem2()` to class `A` and add a new member function to `getT2()` to class template `ATempl`. And since `:partA` is not used by `useBOnly` completely, the BMI of `useBOnly` won't change after we made above changes. #### Design details Method used in this patch is similar with llvm#92083 and llvm#86912. It extends the 32 bit IdentifierID to 64 bits and use the higher 32 bits to store the module file index. So that the encoding of the identifier won't get affected by other modules. #### Overhead Similar with llvm#92083 and llvm#86912. The change is only expected to increase the size of the on-disk .pcm files and not affect the compile-time performances. And from my experiment, the size of the on-disk change only increase 1%+ and observe no compile-time impacts. #### Future Plans I'll try to do the same thing for type ids. IIRC, it won't change the dependency graph if we add a new type in an unused units. I do think this is a significant win. And this will be a pretty good answer to "why modules are better than headers."

Following of llvm#92085. #### motivation The motivation is still cutting of the unnecessary change in the dependency chain. See the above link (recursively) for details. And this will be the last patch of the `no-transitive-*-change` series. If there are any following patches, they might be C++20 Named modules specific to handle special grammars like `ADL` (See the reply in https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755/53 for example). So they won't affect the whole serialization part as the series patch did. #### example After this patch, finally we are able to cut of unnecessary change of types. For example, ``` //--- m-partA.cppm export module m:partA; //--- m-partA.v1.cppm export module m:partA; namespace NS { class A { public: int getValue() { return 43; } }; } //--- m-partB.cppm export module m:partB; export inline int getB() { return 430; } //--- m.cppm export module m; export import :partA; export import :partB; //--- useBOnly.cppm export module useBOnly; import m; export inline int get() { return getB(); } ``` The BMI of `useBOnly.cppm` is expected to not change if we only add a new class in `m:partA`. This will be pretty useful in practice. #### implementation details The key idea of this patch is similar with the previous patches: extend the 32bits type ID to 64bits so that we can store the module file index in the higher bits. Then the encoding of the type ID is independent on the imported modules. But there are two differences from the previous patches: - TypeID is not completely an index of serialized types. We used the lower 3 bits to store the qualifiers. - TypeID won't take part in any lookup process. So the uses of TypeID is much less than the previous patches. The first difference make we have some more slightly complex bit operations. And the second difference makes the patch much simpler than the previous ones.

(Some backgrounds, not required to read: https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755) This is the document part for the no-transitive-change (#86912, #92083, #92085, #92511) to provide the ability for build system to skip some unnecessary recompilations. See the patch for examples.

(Some backgrounds, not required to read: https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755) This is the document part for the no-transitive-change (llvm#86912, llvm#92083, llvm#92085, llvm#92511) to provide the ability for build system to skip some unnecessary recompilations. See the patch for examples.

Summary: (Some backgrounds, not required to read: https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755) This is the document part for the no-transitive-change (#86912, #92083, #92085, #92511) to provide the ability for build system to skip some unnecessary recompilations. See the patch for examples. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251509

…identifier change (llvm#92085)""

ChuanqiXu9 added the clang:modules C++20 modules and Clang Header Modules label May 14, 2024

ChuanqiXu9 requested review from zygoloid, Bigcheese, jyknight and jansvoboda11 May 14, 2024 08:51

ChuanqiXu9 self-assigned this May 14, 2024

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 14, 2024

ChuanqiXu9 force-pushed the users/ChuanqiXu9/NoTtansitiveIdentifierChange branch from d32a410 to c612b56 Compare May 15, 2024 03:37

ChuanqiXu9 mentioned this pull request May 17, 2024

[serialization] No transitive type change #92511

Merged

jansvoboda11 reviewed May 24, 2024

View reviewed changes

clang/include/clang/Lex/HeaderSearch.h Outdated Show resolved Hide resolved

jansvoboda11 reviewed May 24, 2024

View reviewed changes

clang/lib/Serialization/ASTReader.cpp Show resolved Hide resolved

jansvoboda11 reviewed May 24, 2024

View reviewed changes

clang/lib/Serialization/ASTWriter.cpp Outdated Show resolved Hide resolved

ChuanqiXu9 force-pushed the users/ChuanqiXu9/NoTransitiveDeclChange branch from bf9f481 to 8b7a650 Compare May 27, 2024 06:28

ChuanqiXu9 force-pushed the users/ChuanqiXu9/NoTtansitiveIdentifierChange branch from c612b56 to a2fb7f5 Compare May 27, 2024 07:31

Base automatically changed from users/ChuanqiXu9/NoTransitiveDeclChange to main June 3, 2024 08:13

ChuanqiXu9 added a commit that referenced this pull request Jun 4, 2024

[NFC] [Serialization] Avoid unnecessary check for if Identifier from AST

799ae77

Inspired by the review process in #92085. The check `ID >= FirstIdentID` can cover the following check `!II->isFromAST()`.

ChuanqiXu9 force-pushed the users/ChuanqiXu9/NoTtansitiveIdentifierChange branch from a2fb7f5 to e8f756e Compare June 4, 2024 07:40

jansvoboda11 approved these changes Jun 12, 2024

View reviewed changes

ChuanqiXu9 requested a review from ilya-biryukov June 18, 2024 01:43

ChuanqiXu9 mentioned this pull request Jun 18, 2024

[Serialization] Storing DeclID separately #95897

Open

[Serialization] No transitive identifier change

a2f8832

ChuanqiXu9 force-pushed the users/ChuanqiXu9/NoTtansitiveIdentifierChange branch from e8f756e to a2f8832 Compare June 20, 2024 05:28

ChuanqiXu9 merged commit 2f2ea35 into main Jun 20, 2024
4 of 6 checks passed

ChuanqiXu9 deleted the users/ChuanqiXu9/NoTtansitiveIdentifierChange branch June 20, 2024 05:30

ChuanqiXu9 mentioned this pull request Jun 24, 2024

[Doc] Update documentation for no-transitive-change #96453

Merged

bnbarham added a commit to swiftlang/llvm-project that referenced this pull request Jul 4, 2024

Merge pull request #8948 from bnbarham/temp-revert

5360090

[rebranch] Temporarily revert "[Serialization] No transitive identifier change (llvm#92085)"

beccadax added a commit to swiftlang/llvm-project that referenced this pull request Jul 26, 2024

Revert "[rebranch] Temporarily revert "[Serialization] No transitive …

22f3abe

…identifier change (llvm#92085)""

beccadax mentioned this pull request Jul 26, 2024

Revert "[rebranch] Temporarily revert "[Serialization] No transitive identifier change (llvm#92085)"" swiftlang/llvm-project#9017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serialization] No transitive identifier change #92085

[Serialization] No transitive identifier change #92085

ChuanqiXu9 commented May 14, 2024

llvmbot commented May 14, 2024 •

edited

Loading

Motivation example

Design details

Overhead

Future Plans

ChuanqiXu9 commented May 20, 2024

ChuanqiXu9 commented Jun 3, 2024

github-actions bot commented Jun 4, 2024

ChuanqiXu9 commented Jun 11, 2024

jansvoboda11 left a comment

ChuanqiXu9 commented Jun 18, 2024

[Serialization] No transitive identifier change #92085

[Serialization] No transitive identifier change #92085

Conversation

ChuanqiXu9 commented May 14, 2024

Motivation example

Design details

Overhead

Future Plans

llvmbot commented May 14, 2024 • edited Loading

Motivation example

Design details

Overhead

Future Plans

ChuanqiXu9 commented May 20, 2024

ChuanqiXu9 commented Jun 3, 2024

github-actions bot commented Jun 4, 2024

ChuanqiXu9 commented Jun 11, 2024

jansvoboda11 left a comment

Choose a reason for hiding this comment

ChuanqiXu9 commented Jun 18, 2024

llvmbot commented May 14, 2024 •

edited

Loading