Introduction
This post aims at explaining the basics of the JVM, while also giving more in-depth information for those who are interested in it, but always without getting excessively technical. In fact, even if most newcomers prefer getting their hands dirty by starting to code straight away, I believe that learning about the Java Virtual Machine is particularly important - to know what you are doing, to be better at optimizing (and troubleshooting) - but mostly, to avoid running into walls when trying to raise your level and competence. And also, because I think it's very interesting to learn about the engine that powers our applications.
The principle
The JVM is, by definition, a Virtual Machine. In other words, it is an abstract, virtualized computer with a very specific set of instructions. This definition, however, might bring a question to life:
So, can it run an OS such as Linux?
No, it can't. Not unless you develop a Linux emulator in Java, but that's not what we are considering. There is a fundamental difference between two kinds of virtual machines: system virtual machines, and process virtual machines.
System virtual machines
Those are your typical OS emulators: QEMU, VirtualBox, VMWare. They provide complete system platforms, by emulating existing architectures and thus allowing the emulation of a complete operating system.
Process virtual machines
This is where the JVM resides, alongside .NET Framework's CLR and others. PVMs allow running normal applications inside a host OS, and support a single process. They take life to spawn a process, and die when it ends. The purpose is to provide platform-independent environments, by abstracting the underlying hardware and OS, and allowing programs to run identically, independently from the platform. So, they adapt the codebase to run on any platform, without emulating the platform itself.
The JVM
The JVM, in particular, has the benefit of being supported by a huge variety of devices and hardware. This is because it is defined by a specification, maintained by Oracle. Everyone is free to make an implementation of that specification, and effectively produce a new JVM that can run bytecode. Thanks to this, there are multiple implementations, such as OpenJDK and Jazelle.
Back to general VMs
As you can already tell, process and system VMs follow the same abstraction principles, but implement them in vastly different ways, to address different needs. We can say that a process VM (or multiple ones) can run inside a system VM, and clearly not the opposite.
While system virtual machines are general-purpose ones and, nowadays, they are even used dynamically to allocate resources and manage different services in medium to big corporations (see: Proxmox), process virtual machines are very specific and limited to a certain scope: executing a program.
The argument could be a thread of its own, spacing from hardware-assisted virtualization to the roots of VMs in the CTSS. However, we are just trying to analyze and understand virtualization as how it is applied to Java, and this was a necessary premise to give a wider view and a general idea.
The structure
The process of developing and running a Java program follows a very specific and invariable flow, which can be expanded and enriched (eg. by Maven, deployment, obfuscation...). We are going to skip everything that happens between you writing code, and the JVM creating a process to run it, since this post's purpose is to explain the JVM and we don't want to get off-topic. Basically, it consists of:
- a developer writes human-readable Java source code by using an IDE;
- a compiler "translates" this code to bytecode;
- the bytecode is sent to a JVM that creates a process and runs it.
The class loader
As we already stated, Java code (Main.java
) is converted into bytecode (Main.class
). The bytecode is then sent to the JVM's class loader.
Loading
The class loader subsystem is responsible for translating - yet again - the bytecode into binary data, which is then saved in a space named method area. The stored information consists of the class' fully qualified name (net.mindoverflow.tutorial.Main
), the type of the class (class
, interface
, enum
) and all of its methods, variables, constants. For each .class
file, the JVM creates a java.lang.Class
object that represents that particular class in memory. This allows usage of the getClass()
method, that we can then expand with, eg., getName()
, to get the class' fully qualified name; getSimpleName()
that just returns the class' name, and so on. Since we don't want ambiguity and redundancy, but also to avoid a number of related issues (which one should we reference to?), only one single Class
object is created and stored in memory for each class.
Linking
The linking phase consists of three other steps: verification, preparation, and resolution.
Verification serves the purpose of ensuring that the .class
file has been properly generated, before actually compiling it to binary. If the file is malformed or badly formatted, the Bytecode Verifier component throws a VerifyError
runtime exception. Else, the file is ready to be compiled.
Preparation prepares the class for usage: the JVM allocates memory for all of the class' variables and sets them to the default values.
During the resolution step, all symbolic references to methods, classes, interfaces... are replaced with direct references. This is done by querying the method area and locating the referenced entities. This is particulatly complex and many errors can be thrown, but for the scope of this article, this is enough. If you want to understand it more in depth, here's the official Oracle specification.
Initialization
This is the final step of the class loader. All static variables are assigned to their vaules specified in the code, following hierarchy (from parent to child) and order (from top to bottom). At the end, all classes are ready to be used. Or technically, what we have extracted from those classes, since we won't be running the .class
files themselves.
Class loaders
There are three kinds of class loaders:
- Bootstrap loader: this is the core class loader, used by the JVM to load trusted classes such as the core Java API ones found in the JRE library. The
JAVA_HOME/jre/lib/
directory is known as the bootstrap path, and it is implemented in native languagaes such as C and C++. - Extension loader: this loads other classes in the
ext/
(JAVA_HOME/jre/lib/ext/
) subdirectory in the bootstrap path, and thus, it is a child of the bootstrap loader. - Application/System loader: finally, this loads all classes in the application's classpath, and it is a child of the extension loader.
JVM's memory
Memory is clearly the foundation of any running software, and the JVM has a very specific structure for handling temporary things.
Method area
We have already referenced the method area earlier. Here, all class-level info is stored: class and parent class names, variables (including static ones), methods. There is only one method area for each JVM instance, and it is shared.
Heap area
As the method area, this too is unique per each JVM and shared. It contains info about every object of the running program.
Stack area
This is not shared: for each thread, the JVM initializes a runtime stack and stores it in the stack area. Every stack is made of blocks, and every bock of each stack is named a record or frame, and traces method calls. Every local variable of those specific methods is stored in their relative record. When a thread is closed, the runtime stack is deleted too.
Program counter registers
The JVM supports multi-threading, and every thread has its own pc register. Each thread is always running a specific (and single) method. The pc register contains the address of the JVM instruction currently being executed if the method is not native. If, instead, the method is native, the address is null.
A native method is a method that is developed and runs in a language different than Java, while non-native methods are Java methods. Native methods are used to access system functions that can't otherwise be accessed by the JVM. However, they limit portability of an application, because they are platfom-dependent.
Native method stacks
In addition to all the previously described memory structures of the JVM, an application could still use other data, created by (or for) native methods. This is because the JVM has very little control over what native methods do, since they are not encapsulated under it, but ran separately. When a thread invokes a native method, it runs in a separate "world" from the JVM, without any kind of restriction. Clearly, then, native method stacks can't be inside the JVM's stack area, and thus, a new native stack is created, and the JVM links dynamically to that method. If the native method is, for example, a C method, then its stack will be a C stack, and not a Java one. A native method interface will, most likely (although it depends on its purpose and how the developer personally decided to implement it) be able to call back into the JVM, and invoke a Java method. In this case, the thread leaves the native stack and enters another Java stack.
Execution engine
The execution engine is what makes everything move. It reads the bytecode in the .class
files line by line, and after querying the various memory areas, it compiles and executes the given instructions. It is made of three parts.
The interpreter
The interpreter reads and verifies the bytecode, and then executes it. This is nice because it doesn't require compilation, but one of the main disadvantages of the interpreter is that if a method is called multiple times, it has to be interpreted each time.
The JIT compiler
The just-in-time compiler complements the interpreter: it compiles everything into native code, and when the interpreter notices repeated method calls, it provides precompiled code to run, thus avoiding re-interpretation and increasing efficiency.
The garbage collector
The garbage collector has the purpose of deleting unreferenced objects. An unreferenced object is any object that has been created, but not stored anywhere, and thus "lost" in memory. This happens, for example, every time you create an instance of something in a method, and then quit the method without storing that instance somewhere (in a List, a Map, ...) and without manually destroying it. The GC is a pretty complex matter, which we'll talk about in another post.
Conclusion
This is pretty much everything you should know about the Java Virtual Machine, especially if you previously knew nothing about it. I understand that some things may be a little complex - specifically if you're just getting started - but don't worry. If you grasped the main concepts, you are already ahead of a big part of developers out there. You don't need to remember or understand everything I wrote here, specially if you are a developer and you don't work daily with the JVM specification. And if you, instead, do work daily on implementing the Oracle specification, you probably even know better than me and this was just a fun quick thing to read.
Either way, I hope this was an interesting thing to read and that I was able to make you learn something new and interesting.
See you soon!