Module loading

Modules

If V8 is the engine of Node.js, npm is its soul!

npm is the world's largest module repository. Let's take a look at some data:

  • Approximately 210,000 modules

  • Billions of module downloads per day

  • Billions of module downloads per week

This led to the creation of a company that manages npm packages called npmjs.com.

Module Loading Preparation Operations

Strictly speaking, there are several types of modules in Node:

  • Builtin modules: Modules provided in C++ format within Node, such as tcp_wrap and contextify.

  • Constants modules: Modules that define constants within Node used to export definitions for things like signal, openssl libraries, file access permissions etc. Examples include O_RDONLY and O_CREAT for file access permissions or SIGHUP and SIGINT for signals.

  • Native modules: Modules provided in JavaScript format within Node such as http, https and fs. Some native modules require builtin modules to implement their functionality behind-the-scenes. For example, the buffer native module still requires node_buffer.cc from builtin to achieve large memory allocation and management outside V8 memory size usage restrictions.

  • Third-party modules: All other non-built-in third-party modules such as express.

Builtin Module and Native Module Generation Process

The generation process for native JS module is relatively complex. Downloading node source code then compiling it will generate a file named node_natives.h located under out/Release/obj/gen directory.

This file was generated by js2c.py which converts all JavaScript files under lib directory along with every character from node.js under src directory into corresponding ASCII codes before storing them into respective arrays.


namespace node {

const char node_native[] = {47, 47, 32, 67, 112 …}

const char console_native[] = {47, 47, 32, 67, 112 …}

const char buffer_native[] = {47, 47, 32, 67, 112 …}



}

struct _native {const char name; const char* source; size_t source_len;};

static const struct _native natives[] = {{ “node”, node_native,sizeof(node_native)-1 },

{“dgram”, dgram_native,sizeof(dgram_native)-1 },

{“console”, console_native,sizeof(console_native)-1 },

{“buffer”, buffer_native,sizeof(buffer_native)-1 },



}

The generation process for builtin C++ module is relatively simple. Each entry point of a builtin C++ module will be expanded into a function through the macro NODE_MODULE_CONTEXT_AWARE_BUILTIN. For example: tcp_wrap module will be expanded to static void _register_tcp_wrap (void) attribute((constructor)). Those familiar with GCC know that functions decorated with attribute((constructor)) will execute before node's main() function which means our builtin C++ modules are loaded into modlist_builtin linked list before main() function executes. modlist_builtin is a pointer of type struct node_module and get_builtin_module() traverses it to find the required modules.

For Node-provided modules whether they are native JS or builtin C++, both are ultimately embedded in ELF format binary file named node during compilation to generate executable files.

However their extraction methods differ. For JS modules we use process.binding("natives") while for C++ modules we directly use get_builtin_module(). This part will be discussed in section 1.2.

module binding

In node.cc, there is a Binding() function provided. When our application or Node's built-in modules call require() to reference another module, the supporter behind the scenes is the Binding() function mentioned here. Later, we will discuss how this function supports require(). Here, we mainly analyze this function.

static void Binding(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);

  Local<String> module = args[0]->ToString(env->isolate());
  node::Utf8Value module_v(env->isolate(), module);

  Local<Object> cache = env->binding_cache_object();
  Local<Object> exports;

  if (cache->Has(module)) {
    exports = cache->Get(module)->ToObject(env->isolate());
    args.GetReturnValue().Set(exports);
    return;
  }

  // Append a string to process.moduleLoadList
  char buf[1024];
  snprintf(buf, sizeof(buf), "Binding %s", *module_v);

  Local<Array> modules = env->module_load_list_array();
  uint32_t l = modules->Length();
  modules->Set(l, OneByteString(env->isolate(), buf));

  node_module* mod = get_builtin_module(*module_v);
  if (mod != nullptr) {
    exports = Object::New(env->isolate());
    // Internal bindings don't have a"module" object, only exports.
    CHECK_EQ(mod->nm_register_func, nullptr);
    CHECK_NE(mod->nm_context_register_func, nullptr);
    Local<Value> unused = Undefined(env->isolate());
    // **for builtin module**
    mod->nm_context_register_func(exports, unused,
      env->context(), mod->nm_priv);
    cache->Set(module, exports);
  } else if (!strcmp(*module_v,"constants")) {
    exports = Object::New(env->isolate());
    // for constants
    DefineConstants(exports);
    cache->Set(module, exports);
  } else if (!strcmp(*module_v,"natives")) {
    exports = Object::New(env->isolate());
    // for native module
    DefineJavaScript(env, exports);
    cache->Set(module, exports);
  } else {
    char errmsg[1024];
    snprintf(errmsg,
             sizeof(errmsg),
             "No such module: %s",
             *module_v);
    return env->ThrowError(errmsg);
  }

  args.GetReturnValue().Set(exports);
}

Module loading

  1. Builtin modules have the highest priority. For any module that needs to be bound, it will first look for it in the modlist_builtin list. The search process is very simple, just traverse this list and find the module with the same name. After finding this module, the registration function of the module will be executed first, and an important data exports will be returned. For builtin modules, the exports object contains the interface names exposed by the builtin C++ module and their corresponding code. For example, for the tcp_wrap module, the contents of exports can be represented in the following format: {"TCP": "/function code of TCPWrap entrance/", "TCPConnectWrap": "/function code of TCPConnectWrap entrance/" }.

  2. The constants module has the second highest priority. The constants in node are exported through constants. The exported exports format is as follows: {"SIGHUP":1, "SIGKILL":9, "SSL_OP_ALL": 0x80000BFFL}

  3. For native modules, except for the node_native array in Figure 3, all other modules will be exported to exports. The format is as follows: {"_debugger": _debugger_native, "module": module_native, "config": config_native} Among them, _debugger_native, module_native, etc. are array names, or memory addresses.

Comparing the exports structure exported by the above three types of modules, it can be found that for each attribute, their values represent completely different meanings. For builtin modules, the TCP attribute value of exports represents the function code entry, for the constants module, the attribute value of SIGHUP represents a number, and for native modules, the attribute value of _debugger represents the memory address (more accurately, it should be the .rodata segment address).

Module Loading

Let's start with var http = require('http');.

How does require work, why can we use it out of nowhere, and what does it actually do?

The following code is from lib/module.js:

// Loads a module at the given file path. Returns that module's
// `exports` property.
Module.prototype.require = function(path) {
  assert(path,'missing path');
  assert(typeof path ==='string','path must be a string');
  return Module._load(path, this);
};

First, the assert module is used to check that the path variable is present and is a string.

// Check the cache for the requested file.
// 1. If a module already exists in the cache: return its exports object.
// 2. If the module is native: call `NativeModule.require()` with the
//    filename and return the result.
// 3. Otherwise, create a new module for the file and save it to the cache.
//    Then have it load  the file contents before returning its exports
//    object.
Module._load = function(request, parent, isMain) {
  if (parent) {
    debug('Module._load REQUEST %s parent: %s', request, parent.id);
  }

  var filename = Module._resolveFilename(request, parent);

  var cachedModule = Module._cache[filename];
  if (cachedModule) {
    return cachedModule.exports;
  }

  if (NativeModule.nonInternalExists(filename)) {
    debug('load native module %s', request);
    return NativeModule.require(filename);
  }

  var module = new Module(filename, parent);

  if (isMain) {
    process.mainModule = module;
    module.id = '.';
  }

  Module._cache[filename] = module;

  var hadException = true;

  try {
    module.load(filename);
    hadException = false;
  } finally {
      if (hadException) {
        delete Module._cache[filename];
      }
  }

  return module.exports;
};

Check the cache for the requested file.

  1. If a module already exists in the cache: return its exports object.

  2. If the module is native: call NativeModule.require() with the filename and return the result.

  3. Otherwise, create a new module for the file and save it to the cache. Then have it load the file contents before returning its exports object.

Let's take a deep dive into the code and look at NativeModule.require in a recursive manner.

  NativeModule.require = function(id) {
    if (id =='native_module') {
      return NativeModule;
    }

    var cached = NativeModule.getCached(id);
    if (cached) {
      return cached.exports;
    }

    if (!NativeModule.exists(id)) {
      throw new Error('No such native module '+ id);
    }

    process.moduleLoadList.push('NativeModule' + id);

    var nativeModule = new NativeModule(id);

    nativeModule.cache();
    nativeModule.compile();

    return nativeModule.exports;
  };

As we can see, caching is a strategy that runs throughout the implementation of Node.

  • If the module is already in the cache, its exports object is returned directly.

  • If not, it is added to the moduleLoadList array, and a new NativeModule object is created.

The following line is the most crucial:

nativeModule.compile();

The implementation details are in node.js:

NativeModule.getSource = function(id) {
  return NativeModule._source[id];
};

NativeModule.wrap = function(script) {
  return NativeModule.wrapper[0] + script + NativeModule.wrapper[1];
};

NativeModule.wrapper = ['(function (exports, require, module, __filename, __dirname) {','\n});' ];

NativeModule.prototype.compile = function() {
  var source = NativeModule.getSource(this.id);
  source = NativeModule.wrap(source);

  var fn = runInThisContext(source, {
    filename: this.filename,
    lineOffset: 0
  });
  fn(this.exports, NativeModule.require, this, this.filename);

  this.loaded = true;
};

The wrap function wraps http.js and compiles the source code using runInThisContext, returning the fn function which then receives the arguments in sequence.

process

Let's take a look at the process variable passed from the underlying C++ to JavaScript in Node.js. When Node.js is first run, the program sets up the process object: Handleprocess = SetupProcessObject(argc, argv); Then, it passes process as an argument to the function returned by the main JavaScript program in src/node.js, allowing process to be passed into JavaScript.

//node.cc

// Get the converted src/node.js source code through MainSource() and execute it

Local f_value = ExecuteString(MainSource(), IMMUTABLE_STRING(“node.js”));
// The result of executing src/node.js is a function, as can be seen from the node.js source code:

//node.js

//(function(process) {

//    global = this;

//    …

//})

Local f = Local::Cast(f_value);
// Create a function execution environment, call the function, and pass in process

Localglobal = v8::Context::GetCurrent()->Global();

Local args[1] = {
  Local::New(process) 
};

f->Call(global, 1, args);

vm

What is runInThisContext?

runInThisContext is a function provided by the contextify module in Node.js. It compiles a string of JavaScript code into a function that can be executed in the current context. This is similar to the eval function, but with additional security features to prevent malicious code execution.

  var ContextifyScript = process.binding('contextify').ContextifyScript;
  function runInThisContext(code, options) {
    var script = new ContextifyScript(code, options);
    return script.runInThisContext();
  }
  • In the Binding function of node.cc, the module is registered using the following call: mod->nm_context_register_func(exports, unused, env->context(), mod->nm_priv);

Let's take a look at the definition of the mod data structure in node.h:

struct node_module {
  int nm_version;
  unsigned int nm_flags;
  void* nm_dso_handle;
  const char* nm_filename;
  node::addon_register_func nm_register_func;
  node::addon_context_register_func nm_context_register_func;
  const char* nm_modname;
  void* nm_priv;
  struct node_module* nm_link;
};

There are also the following macro definitions in node.h, let's keep reading!

#define NODE_MODULE_CONTEXT_AWARE_X(modname, regfunc, priv, flags)    \
  extern "C" {                                                        \
    static node::node_module _module =                                \
    {                                                                 \
      NODE_MODULE_VERSION,                                            \
      flags,                                                          \
      NULL,                                                           \
      __FILE__,                                                       \
      NULL,                                                           \
      (node::addon_context_register_func) (regfunc),                  \
      NODE_STRINGIFY(modname),                                        \
      priv,                                                           \
      NULL                                                            \
    };                                                                \
    NODE_C_CTOR(_register_ ## modname) {                              \
      node_module_register(&_module);                                 \
    }                                                                 \
  }
  
#define NODE_MODULE_CONTEXT_AWARE_BUILTIN(modname, regfunc)           \
  NODE_MODULE_CONTEXT_AWARE_X(modname, regfunc, NULL, NM_F_BUILTIN)   \
  • There is a macro call in node_contextify.cc, which finally makes it clear! Combining the previous points, it actually binds nm_context_register_func of node_module with node::InitContextify.

NODE_MODULE_CONTEXT_AWARE_BUILTIN(contextify, node::InitContextify);

We trace back up the code, from node_module_register(&_module);, to process.binding('contextify') --> mod->nm_context_register_func(exports, unused, env->context(), mod->nm_priv); --> node::InitContextify().

By using env->SetProtoMethod(script_tmpl,"runInThisContext", RunInThisContext);, the runInThisContext function is bound to RunInThisContext.

runInThisContext is a function provided by the contextify module in Node.js. It compiles a string of JavaScript code into a function that can be executed in the current context. This is similar to the eval function, but with additional security features to prevent malicious code execution.

This successfully loads the native module and marks this.loaded = true.

Summary

Node.js solves the problem of infinite circular references through caching, which is an important means of system optimization. By trading space for time, loading modules becomes very efficient.

In actual business development, we observe that Node caches a large number of modules from the perspective of heap after starting the module, including third-party modules, some of which may only be loaded and used once. I think it is necessary to have a module unloading mechanism [1] to reduce the occupation of V8 heap memory and improve the efficiency of subsequent garbage collection.

Reference

[1].https://github.com/nodejs/node/issues/5895

Last updated