Module loading
Last updated
Last updated
If V8 is the engine of Node.js, npm is its soul!
npm is the world's largest module repository. Let's take a look at some data:
Approximately 210,000 modules
Billions of module downloads per day
Billions of module downloads per week
This led to the creation of a company that manages npm packages called npmjs.com
.
Strictly speaking, there are several types of modules in Node:
Builtin modules: Modules provided in C++ format within Node, such as tcp_wrap and contextify.
Constants modules: Modules that define constants within Node used to export definitions for things like signal, openssl libraries, file access permissions etc. Examples include O_RDONLY and O_CREAT for file access permissions or SIGHUP and SIGINT for signals.
Native modules: Modules provided in JavaScript format within Node such as http, https and fs. Some native modules require builtin modules to implement their functionality behind-the-scenes. For example, the buffer native module still requires node_buffer.cc from builtin to achieve large memory allocation and management outside V8 memory size usage restrictions.
Third-party modules: All other non-built-in third-party modules such as express.
The generation process for native JS module is relatively complex. Downloading node source code then compiling it will generate a file named node_natives.h
located under out/Release/obj/gen directory.
This file was generated by js2c.py which converts all JavaScript files under lib directory along with every character from node.js under src directory into corresponding ASCII codes before storing them into respective arrays.
The generation process for builtin C++ module is relatively simple. Each entry point of a builtin C++ module will be expanded into a function through the macro NODE_MODULE_CONTEXT_AWARE_BUILTIN. For example: tcp_wrap module will be expanded to static void _register_tcp_wrap (void) attribute((constructor)). Those familiar with GCC know that functions decorated with attribute((constructor)) will execute before node's main() function which means our builtin C++ modules are loaded into modlist_builtin linked list before main() function executes. modlist_builtin is a pointer of type struct node_module and get_builtin_module() traverses it to find the required modules.
For Node-provided modules whether they are native JS or builtin C++, both are ultimately embedded in ELF format binary file named node
during compilation to generate executable files.
However their extraction methods differ. For JS modules we use process.binding("natives") while for C++ modules we directly use get_builtin_module(). This part will be discussed in section 1.2.
In node.cc, there is a Binding() function provided. When our application or Node's built-in modules call require() to reference another module, the supporter behind the scenes is the Binding() function mentioned here. Later, we will discuss how this function supports require(). Here, we mainly analyze this function.
Module loading
Builtin modules have the highest priority. For any module that needs to be bound, it will first look for it in the modlist_builtin list. The search process is very simple, just traverse this list and find the module with the same name. After finding this module, the registration function of the module will be executed first, and an important data exports will be returned. For builtin modules, the exports object contains the interface names exposed by the builtin C++ module and their corresponding code. For example, for the tcp_wrap module, the contents of exports can be represented in the following format: {"TCP": "/function code of TCPWrap entrance/", "TCPConnectWrap": "/function code of TCPConnectWrap entrance/" }.
The constants module has the second highest priority. The constants in node are exported through constants. The exported exports format is as follows: {"SIGHUP":1, "SIGKILL":9, "SSL_OP_ALL": 0x80000BFFL}
For native modules, except for the node_native array in Figure 3, all other modules will be exported to exports. The format is as follows: {"_debugger": _debugger_native, "module": module_native, "config": config_native} Among them, _debugger_native, module_native, etc. are array names, or memory addresses.
Comparing the exports structure exported by the above three types of modules, it can be found that for each attribute, their values represent completely different meanings. For builtin modules, the TCP attribute value of exports represents the function code entry, for the constants module, the attribute value of SIGHUP represents a number, and for native modules, the attribute value of _debugger represents the memory address (more accurately, it should be the .rodata segment address).
Let's start with var http = require('http');
.
How does require
work, why can we use it out of nowhere, and what does it actually do?
The following code is from lib/module.js:
First, the assert module is used to check that the path
variable is present and is a string.
Check the cache for the requested file.
If a module already exists in the cache: return its exports object.
If the module is native: call NativeModule.require()
with the filename and return the result.
Otherwise, create a new module for the file and save it to the cache. Then have it load the file contents before returning its exports object.
Let's take a deep dive into the code and look at NativeModule.require
in a recursive manner.
As we can see, caching is a strategy that runs throughout the implementation of Node.
If the module is already in the cache, its exports object is returned directly.
If not, it is added to the moduleLoadList
array, and a new NativeModule object is created.
The following line is the most crucial:
The implementation details are in node.js
:
The wrap
function wraps http.js and compiles the source code using runInThisContext
, returning the fn
function which then receives the arguments in sequence.
Let's take a look at the process
variable passed from the underlying C++ to JavaScript in Node.js. When Node.js is first run, the program sets up the process
object: Handleprocess = SetupProcessObject(argc, argv);
Then, it passes process
as an argument to the function returned by the main JavaScript program in src/node.js
, allowing process
to be passed into JavaScript.
What is runInThisContext
?
runInThisContext
is a function provided by the contextify
module in Node.js. It compiles a string of JavaScript code into a function that can be executed in the current context. This is similar to the eval
function, but with additional security features to prevent malicious code execution.
In the Binding function of node.cc, the module is registered using the following call: mod->nm_context_register_func(exports, unused, env->context(), mod->nm_priv);
Let's take a look at the definition of the mod
data structure in node.h
:
There are also the following macro definitions in node.h, let's keep reading!
There is a macro call in node_contextify.cc, which finally makes it clear! Combining the previous points, it actually binds nm_context_register_func of node_module with node::InitContextify.
We trace back up the code, from node_module_register(&_module);
, to process.binding('contextify')
--> mod->nm_context_register_func(exports, unused, env->context(), mod->nm_priv);
--> node::InitContextify()
.
By using env->SetProtoMethod(script_tmpl,"runInThisContext", RunInThisContext);
, the runInThisContext
function is bound to RunInThisContext
.
runInThisContext is a function provided by the contextify
module in Node.js. It compiles a string of JavaScript code into a function that can be executed in the current context. This is similar to the eval
function, but with additional security features to prevent malicious code execution.
This successfully loads the native
module and marks this.loaded = true
.
Summary
Node.js solves the problem of infinite circular references through caching, which is an important means of system optimization. By trading space for time, loading modules becomes very efficient.
In actual business development, we observe that Node caches a large number of modules from the perspective of heap after starting the module, including third-party modules, some of which may only be loaded and used once. I think it is necessary to have a module unloading mechanism [1] to reduce the occupation of V8 heap memory and improve the efficiency of subsequent garbage collection.
Reference
[1].https://github.com/nodejs/node/issues/5895