Tuesday, August 9, 2016

Why its called as Virtual Machine?

Introduction

We might have heard the term virtual machine a lot of times when we work with languages which compiles to intermediate format instead of machine code. Some examples are Java, C#, Vb.Net etc...

We might have also complained that application is slow because it is running in virtual machine. There are 2 types of virtual machines in software engineering. One is the virtual machines which are running an operating system inside another operating system. Such as Linux running inside Windows machine. The machines we are getting in cloud providers are virtual machines. There are managing softwares such as Hyper-V and all to manage those machines.

But this is not the virtual machine when we talk in terms of programming languages and platforms. Lets get an understanding about those virtual machines

Machine

What is a machine refers here? Its nothing but an execution hardware which can handle a finite set of instructions. In another words a turing machine which takes one instruction process it and takes next and go on. Instructions may be simple 'add A,B' for adding 2 CPU registers to complicated GPU instructions. 

Unmanaged world / the bare metal machine

The software execution world was running without a manager in the beginning.  In the beginning programmers were talking in terms of 1's and 0's to each other and to machine. It was very difficult to deal with. If we want to add 2 numbers we had to use something like

11001100 1010,0011 // To add 10+3. Assume 11001100 is the instruction for add.

This instruction is commonly known as opcode too. Short form of operation code. This tells the executor here its the CPU/Processor chip to perform an add operation and place the result in a predefined location. Mostly it will be accumulator.

Machine language

The above notation can be called as machine language. Only 1's and 0's which machine can understand as presence or absence of voltage. This was similar to days when developers were using punch cards to enter their programs to computers.

Soon developers started denoting the numbers in hex where the above machine language can be written as

CC 0A,03

There are still some universities having 8085 based programming paper which gives this mode of entering programs in to computer. Even today we can open executable files and see it in hex format.
Everything in computer is actually in binary but for better understanding purpose, we can convert it to hex and display. Don't get confused with the ASCII representation at the right side. That is just coincidence. For example the opcode for add and ASCII code for 'D' may be same. Here the viewer application from where the screen shot is taken is trying to convert to ASCII for more readability.

Assmembly language

Soon people started using more human friendly instructions instead of hex instructions. That was called as assembly language. The above code can be written in assembly language as follows

Add 0A,03

Only the instruction / opcode changed data remained in hex.

Smart people created text editors in assembly language to write assembly language. That made the life of new developers easy as they can write programs editor which is built using assembly language itself.

There are no loops or most of the machines doesn't supported it. But we had an option to change the execution order based on conditions. Instructions such as JMP, JC JNC (Jump, Jump on Carry, Jump on No Carry) are some examples. Using these instructions developers were able to get loops implemented.

Till this point the developers were directly writing in the language the machine / hardware understands. After the assembly language, there were some programs started involving in between the program and machines.

Programs that prepare programs for execution

The concepts like macros, pre-processors etc...introduced the need for particular programs to prepare other programs for execution. In other words, if we are writing assembly language using English opcodes that needs to be converted to corresponding machine code. If we consider the above sample of

Add 0A,03

Someone has to convert this text 'Add' opcode to CC which is the binary equivalent of 11001100 for add instruction. The programs which are doing this conversion is called Assembler.

Still the assmebly language has instrcutions which can be directly mapped to an instruction which machine support. ie we have to read the manual of processor chip to know what are the instructions it supports.

High level languages

One to one relation between assembly language instruction to machine instruction was some kind of limitation and later people thought of more intuitive languages to program. A more human readable approach where instructions in program doesn't needs to have one to one relation to the machine code. Like assemblers, the program can be converted to machine language before execution. When converting, one instruction in this language can be converted as one more more machine instructions.

Yes the high level languages are born. Excellent example is C. We write in high level more human readable format and converted / compiled to machine instructions before execution.

This produced the concept of loops, functions variable types etc...Earlier it was all just bits and bytes with a jump instruction.

Internally loops might be using jump instructions, but to the developer, the details of machine was hidden. This helped more and more people to become developers. The don't need to how the machine works in the chip level or what is voltage or binary but still able to write programs.

Summary - unmanaged world

We can summarize this as a model where we prepare machine instructions either writing directly, or converting / compiling assembly or high level languages. Once it is ready just start execution. There is no one to manage its the machine and it just executes our code. 

Pros

  • Talks directly to the machine. Able to leverage full power as there is no additional overhead

Cons

  • Needs comparatively high skill
  • Development is time consuming as there are less reusable things and abstractions.
  • Needs to worry about many things such as memory allocation security etc...
    • Security here refers to access data from concurrently running program in case there is no operating system level process separation.
  • The machine instructions vary by the macnufactuer of the chip. The instructions supported by Intel chip may not be available in the chip produced by AMD and viceversa. So the developer need to compile their program multiple times to produce the executable for different chips. Also the instructions may vary by the 32 bit and 64 bit.

Managed world

Normally when people say managed, they refer to the technologies such as JVM or CLR based execution where there is a program which runs our program. To be more specific,
  • Developer write the programs in any of the programming language supported.
  • The compilation produces an intermediate code / byte code which the hardware/machine cannot understand .
  • There will be another program (JVM / CLR) which is in machine code, reads intermediate code  produced by developer and execute instructions one by one.
  • This executor program acts like a hardware machine which can understand a specific set of instructions.

So what is virtual machine?

Under the hood its the same machine instructions which are executed by the processor chip. But it is hidden from an end programmer such as a .Net / Java developer. Developers see only the instructions supported by intermediate / byte code as the machine instructions ie a machine which is not there physically. So they target their code written in high level language to the byte code / intermediate code which is understood by the executor program / ie a virtual machine. In .Net world its called as CLR and in Java its JVM. Executor program reads it and carry out the task by giving required machine instructions to the chip. 

The executor program need to be compiled for different chips because it is in the machine language. But that needs to be done by less development teams such as the JVM and CLR. The rest of the development world don't need to worry about compiling their code against different types of chips. So more development teams take the route of managed world.

Since the executor program is reading the developers program to execute, it can easily enforce so many rules and regulations to manage the execution. Some limits the developers capabilities and example are the lack of pointers. Some reduces the overhead of developers and increase productivity. Examples are the memory management (garbage collection), standard libraries. Standard libraries, even we had in unmanaged world as well.

If the chip manufactures were having a worldwide standardization consortium and they produced chips which supports same instructions the managed world might not have been invented.

Pros

  • Portability of intermedite code makes the deployement easy
  • The virtual machine can enforce rules
  • Multiple languages can be compiled to the intermediate code if there are compilers.
  • The above helps to refer an internediate language library written one language by another language.

Cons

  • Little overhead on converting the intermediate code to machine code during runtime,
  • Lack of features such as pointers and limits developer from talking to the bare machine though it can be worked around.
  • Delays the usage of new hardware features till that is supported by the virtual machine. eg: Parallel instructions in new processors

Is Operating System a virtual machine?

If we are writing windows programs we feel that OS is an virtual machine as we are targeting the Windows APIs. Also when we look at the process boundaries and limited access to memory, we can see OS as virtual machine which kind of hides the hardware. But since our programs are compiled to machine language, OS is not truly a virtual machine.

Reference

http://c2.com/cgi/wiki?VirtualMachine

No comments: