Defining mixed reality

Let's begin by first defining and contrasting three similar, but frequently misused, paradigms: VR, AR, and MR:

VR describes technology and experiences where the user is fully immersed in a virtual environment.
AR can be described as technology and techniques used to superimpose digital content onto the real world.
MR can be considered as a blend of VR and AR. It uses the physical environment to add realism to holograms, which may or may not have any physical reference point (as AR does).

The differences between VR, AR, and MR are not so much in the technology but in the experience you are trying to create. Let's illustrate it through an example--imagine that you were given a brief to help manage children's anxiety when requiring hospital treatment.

With VR, you might create a story or game in space, with likeable characters that represent the staff of the hospital in role and character, but in a more interesting and engaging form. This experience will gently introduce the patient to the concepts, procedures, and routines required. On the other end of the spectrum, you can use augmented reality to deliver fun facts based on contextual triggers, for example, the child might glance (with their phone or glasses) at the medicine packaging to discover what famous people had their condition. MR, as the name suggests, mixes both the approaches--our solution can involve a friend, such as a teddy, for the child to converse with, expressing their concerns and fears. Their friend will accompany them at home and in the hospital, being contextually sensitive and respond appropriately.

As highlighted through these hypothetical examples, they are not mutually exclusive but adjust the degree to which they preserve reality; this spectrum is termed as the reality–virtuality continuum, coined by Paul Milgram in his paper Augmented Reality: A class of displays on the reality-virtuality continuum, 2007. He illustrated it graphically, showing the spectrum of reality between the extremes of real and virtual, showing how MR encompasses both AR and augmented virtuality (AV). The following is a figure by Paul Milgram and Fumio Kishino that defined the concept of Milgram’s reality-virtuality continuum and illustrates the concept--to the far left, you have reality and at the opposite end, the far right, you have virtuality--as MR strides itself between these two paradigms:

Representation of reality-virtuality continuum by Paul Milgram and Fumio Kishino

Our focus in this book, and the proposition of HoloLens, is MR (also referred to as Holographic). Next, we will look into the principles and building blocks that make up MR experiences.

Designing for mixed reality

The emergence of new technologies is always faced with the question of doing old things in a new way or doing new things in new ways. When the TV was first introduced, the early programs were adopted from radio, where the presenter read in front of a camera, neglecting the visual element of the medium. A similar phenomenon happened with computers, the web, and mobile--I would encourage you to think about the purpose of what you're trying to achieve rather than the process of how it is currently achieved to free you to create new and innovative solutions.

In this section, we will go over some basic design principles related to building MR experiences and the accompanying building blocks available on HoloLens. Keeping in mind that this medium is still in its infancy, the following principles are still a work in progress.

Identifying the type of experience you want to create

As discussed earlier, the degree of reality you want to preserve is up to you--the application designer. It is important to establish where your experience fits early on as it will impact how you design and also the implementation of the experience you are building. Microsoft outlines three types of experiences:

Enhanced environment apps: These are applications that respect the real world and supplement it with holographic content. An example of this can be pinning a weather poster near the front door, ensuring that you don't forget your umbrella when the forecast is for rain.
Blended environment apps: These applications are aware of the environment, but will replace parts of it with virtual content. An application that lets the user replace fittings and furniture is an example.
Virtual environment apps: These types of applications will disregard the environment and replace it completely with a virtual alternative. An application that converts your room into a jungle, with trees and bushes replacing the walls and the floor can be taken as an example.

Like with so many things, there is no right answer, just a good answer for a specific user, specific context, and at a specific time. For example, designing a weather app for a professional might have the weather forecast pinned to the door so that she sees it just before leaving for work, while it might be more useful to present the information through a holographic rain cloud, for example, to a younger audience.

In the next section, we will continue our discussion on the concepts of MR, specifically looking at how HoloLens makes sense of the environment.

Understanding the environment

One of the most compelling features of HoloLens is its ability to place and track virtual/digital content in the real world. It does this using a process known as spatial mapping, whereby the device actively scans the environment, building its digital representation in memory. In addition, it adds anchors using a concept called spatial anchors. Spatial anchors mark important points in the world in reference to the defined world origin; holograms are positioned relative to these spatial anchors, and these anchors are also used to join multiple spaces for handling larger environments.

The effectiveness of the scanning process will determine the quality of the experience; therefore, it is important to understand this process in order to create an experience that effectively captures sufficient data about the environment. One technique commonly used is digital painting; during this phase, the user is asked to paint the environment. As the user glances around, the scanned surfaces are visualized (or painted over), providing feedback to the user that the surface has been scanned.

However, scanning and capturing the environment is just one part of understanding the environment, and the second is making use of it; some uses include the following:

Occlusion: One of the shortfalls of creating immersive MR experiences using single camera devices (such as Smartphones) is the inability to understand the surface to occlude virtual content from the real world when obstructed. Seeing holograms through objects is a quick way to force the user out of the illusion; with HoloLens, occluding holograms with the real world is easy.
Visualization: Sometimes, visualizing the scanned surfaces is desirable, normally an internal effect such as feeding back what part of the environment is scanned to the user.
Placement: Similar to occlusion in that it creates a compelling illusion, holograms should behave like the real objects that they are impersonating. Once the environment is scanned, further processing can be performed to gain greater knowledge of the environment, such as the types of surfaces available. With this knowledge, we can better infer where objects belong and how they should behave. In addition to creating more compelling illusions, matching the experience with the user's mental model of where things belong makes the experience more familiar, thus easing adoption by making it easier and more intuitive to use.
Physics: HoloLens makes the scanned surfaces accessible as plain geometry data, which means we can leverage the existing physics simulation software to reinforce the presence of holograms in the user's environment. For example, if I throw a virtual ball, I expect it to bounce off the walls and onto the floor before settling down.
Navigation: In game development, we have devised effective methods for path planning. Having a digital representation of our real world affords us to utilize these same techniques in the real world. Imagine offering a visually impaired person an opportunity to effectively navigate an environment independently or assisting a parent to find their lost child in a busy store.
Recognition: Recognition refers to the ability of the computer to classify what objects are in the environment; this can be used to create a more immersive experience, such as having virtual characters sit on seats, or to provide a utility, such as helping teach a new language or assisting visually impaired people so that they can better understand their environment.

Thinking in terms of the real world

The luxury of designing for screen-based experiences is that your problem is simplified. In most cases, we own the screen and have a good understanding of it; we lose these luxuries with MR experiences, but gain more in terms of flexibility and therefore opportunity for new, innovative experiences. So it becomes even more important to understand your users and in what context they will be using your application, such as the following:

Will they be sitting or standing?
Will they be moving or stationary?
Is the experience time dependent?

Some common practices when embedding holograms in the real world include the following:

Place holograms in convenient places--places that are intuitive, easily discovered, and in reach, especially if they are interactive.
Design for the constraints of the platform, but keep in mind that we are developing for a platform that will rapidly advance in the next few years. At the time of writing, Microsoft recommends placing holograms between 1.25 meters and 5 meters away from the device, with the optimum viewing distance of 2 meters. Find ways of gracefully fading content in and out when it gets too close or far, so as not to jar the user into an unexpected experience.
As mentioned earlier, placing holograms on contextually relevant surfaces and using shadows create, more immersive experiences, giving a better illusion that the hologram exists in the real world.
Avoid locking content to the camera; this can quickly become an annoyance to the user. Rather, use an alternative that is more gentle, an approach being adopted has the interface dragged, in an elastic-like manner, with the user's gaze.
Make use of spatial sound to improve immersion and assist in hologram discovery. If you have ever listened to Virtual Barber Shop Hair Cut (https://www.youtube.com/watch?v=8IXm6SuUigI), you will appreciate how effective 3D sound can be in creating an immersive experience and, similar to mimicking the behavior of the objects you are trying to impersonate, use real world sound that the user will expect from the hologram.

The spatial sound, such as 3D, adds another dimension to how sound is perceived. Sounds are normally played back in stereo, meaning that the sound has no spatial position, that is, the user won't be able to infer where in space the sound comes from. Spatial sound is a set of techniques that mimic sound in the real world. This has many advantages, from offering more realism in your experience to assisting the user locate content.

Of course, this list is not comprehensive, but has a few practices to consider when building MR applications. Next, we will look at ways in which the user can interact with holograms.

Interacting in mixed reality

With the introduction of any new computing paradigm comes new ways of interacting with it and, as highlighted in the opening paragraph, history has shown that we are moving from an interface that is natural to the computer toward an interface that is more natural to people. For the most part, HoloLens removes dedicated input devices and relies on inferred intent, gestures, and voice. I would argue that this constraint is the second most compelling offering that HoloLens gives us; it is an opportunity to invent more natural and seamless experiences that can be accessible to everyone. Microsoft refers to three main forms of input, including Gaze Gesture Voice (GGV); let's examine each of these in turn.

Gaze refers to tracking what the user is looking at; from this, we can infer their interest (and intent). For example, I will normally look at a person before I speak to them, hopefully, signalling that I have something to say to them. Similarly, during the conversation, I may gaze at an object, signalling to the other person that the object that I'm gazing at is the subject I'm speaking about.

This concept is heavily used in HoloLens applications for selecting and interacting with holograms. Gaze is accompanied with a cursor; the cursor provides a visual representation of the users gaze, providing visual feedback to what the user is looking at. It can additionally be used to show the state of the application or object the user is currently gazing at, for example, the cursor can visually change to signal whether the hologram the user is gazing at is interactive or not. On the official developer site, Microsoft has listed the design principles; I have paraphrased and listed them here for convenience:

Always present: The cursor is, in some sense, akin to the mouse pointer of a GUI; it helps the users understand the environment and the current state of the application.
Cursor scale: As the cursor is used for selecting and interacting with holograms, it's size should be no bigger than the objects the user can interact with. Scale can also be used to assist the users' understanding of depth, for example, the cursor will be larger when on nearby surfaces than when on surfaces farther away.
Look and feel: Using a directionless shape means that you avoid implying any specific direction with the cursor; the shape commonly used is a donut or torus. Making the cursor hug the surfaces gives the user a sense that the system is aware of their surroundings.
Visual cues: As mentioned earlier, the cursor is a great way of communicating to the user about what is important as well as relaying the current state of the application. In addition to signalling to the user what is interactive and what is not, it also can be used to present additional information (possible actions) or the current state, such as visualizing showing the user that their hand has been detected.

While gazing provides the mechanism for targeting objects, gestures and voice provide the means to interact with them. Gestures can be either discrete or continuous. The discrete gestures execute a specific action, for example, the air-tap gesture is equivalent to a double-click on a mouse or tap on the screen. In contrast, continuous gestures are entered and exited and while active, they will provide continuous update to their state. An example of this is the manipulation gesture, whereby the user enters the gesture by holding their finger down (called the hold gesture); once active, this will continuously provide updates of the position of the tracked hand until the gesture is exited with the finger being lifted. This is equivalent to dragging items on desktop and touch devices with the addition of depth.

HoloLens recognizes and tracks hands in either the ready state (back of hand facing you with the index finger up) or pressed state (back of hand facing you with the index finger down) and makes the current position and state of the currently tracked hands available, allowing you to devise your own gestures in addition of providing some standard gestures, some of which are reserved for the operating system. The following gestures are available:

Air-tap: This is when the user presses (finger down) and releases (finger up), and is performed within a certain threshold. This interaction is commonly associated to selecting holograms (as mentioned earlier).
Bloom: Reversed for the operating system, bloom is performed by holding your hand in front of you with your fingers closed, and then opening your hand up. When detected, HoloLens will redirect the user to the Start menu.
Manipulation: As mentioned earlier, manipulation is a continuous gesture entered when the user presses their finger down and holds it down, and exited when hand tracking is lost or the user releases their finger. When active, the user's hand is tracked with the intention of using the absolute position to manipulate the targeted hologram.
Navigation: This is similar to the manipulation gesture, except for its intended use. Instead of mapping the absolute position changes of the user's hand with the hologram, as with manipulation, navigation provides a standard range of -1 to 1 on each axis (x, y, and z); this is useful (and often used) when interacting with user interfaces, such as scrolling or panning.

The last dominate form of interacting with HoloLens, and one I'm particularly excited about, is voice. In the recent times, we have seen the rise of Conversational User Interface (CUI); so, it's timely to introduce a platform where one of it's dominate inputs is voice. In addition to being a vision we have had since before the advent of computers, it also provides the following benefits:

Hands free (obviously important for a device like HoloLens)
More efficient and requires less effort to achieve a task; this is true for data entry and navigating deeply nested menus
Reduces cognitive load; when done well, it should be intuitive and natural, with minimal learning required

However, how voice is used is really dependent on your application; it can simply be used to supplement gestures such as allowing the user to use the Select keyword (a reserved keyword) to select the object the user is currently gazing at or support complex requests by the user, such as answering free-form questions from the user. Voice also has some weaknesses, including these:

Difficulty with handling ambiguity in language; for example, how do you handle the request of louder
Manipulating things in physical space is also cumbersome
Social acceptance and privacy are also considerations that need to be taken into account

With the success of Machine Learning (ML) and adoption of services such as Amazon's Echo, it is likely that these weaknesses will be short lived.