Jingyi Yu: Processing and Display of Light Field VR Acquisition

Release Time:2017-10-18     Author:Advanced Innovation Center or Future Visual Entertainment     

Yu Jingyi,Director of VRVC, School of Information Science and Technology, Shanghai Tech University


      Today, I am going to talk about “light field virtual reality.” When it comes to myself, VR is more reality than virtual. As I was pursuing a bachelors degree in Caltech, the star of VR at that time was Quicktime VR, developed by Eric Chen. How does it work? First, you need to take some photos surrounding you with your cellphone, and then the program will do feature correspondent. Remember, the aim is to collage these photos into a panorama. Technically, it’s just identifying a set of feature points and correspond them with the photos.

QuickTime VR

      When the correspondence is done, the photos can be mapped onto a sphere. Note that the mapping may not be perfect because of hand vibration, and thus slight displacement in the three-dimensional space. It would be perfect if the photos are all taken from a fixed and stable point.


      There are some algorithms to deal with such imperfections. The ideas are simple. When it tries to put two photos together, the algorithm will find an optimal seam which designates which part belongs to the left photo and which to the right, so that the two photos look like one.


      The technology was raging in 1998. A year later, I started the first business in my life with three buddies from Caltech. We were commissioned with a project for casinos in Las Vegas to present the panorama in and out. The economy was in a slump then, so the casino owners counted on this project to attract visitors.


Casino Panorama

      We managed the panorama thing with our devices. The casino had them, but it was not as successful as expected. So we sold the business in 2000 for merely 4,000 dollars, with which we had a great time in Hawaii.


      The photo I just showed you was taken in an airport in Hawaii by the four of us. It was taken in 1999, almost twenty years ago. The most important thing is, if VR failed then for technological issues, then what issue was it exactly? Standing on this vantage point of retrospection, we have realized that we put too much efforts on the V, the virtual, or how to generate virtual images, but forgot that R, the reality, was the king. And it still is.


      In other words, we have to believe, seeing is believing. What we really want is to fool the eyes. So we need to know our eyes. Now let’s watch a clip derived from the dissertation of a PhD student of mine.

Our Eyes

      Our eyes are probably the most precise and delicate cameras this world has ever seen. It has three parts: cornea, lens and retina. The cornea is like the lens in a camera, while the function of lens in eyes is focusing. The third part, retina, is the sensor. The point is this: eyes are essentially light catchers. If we can catch all light in a given section of this three-dimensional universe, and reproduce its interaction with our eyes, then we are able to reconstruct this mini-world. After all, a 3D world is in effect acquiring a combination of light. Based on this idea, I went on with my research on VR.


      In 2000, I joined MIT as a PhD candidate, where I solved a series of problems about how to create VR that would be able to fool the eyes. The first problem is binocular vision. We have two eyes, like most carnivores, and they are facing front. The good thing is a large overlap between the sights, which creates parallax when something is near our eyes. It means that the images of the same thing in two eyes are of slight differences, which diminishes while it is moving away from us.


Parallax

      Based on this fact, we can do binocular photographing. An obvious way might be using two fisheye lenses. It’s inherently binocular and panorama. Is there anything wrong about it? Well, it works when your head is fixed towards the front and both planes of the lens are parallel. Now you get the right 3D picture you want.


      However, it runs into trouble if your head turns, even modestly. In this case, the same point in the three-dimensional reality will be placed in different depths in the two lens, which causes parallax in both x-axis and y-axis. Here comes the problem. Our eyes are on the sides, which means we may only adjust the vision field horizontally, but scarcely vertically.


      Actually, a ready solution is already there. The idea is simple. First, we take a number of photos covering 360o around us together, and then put the left half of each photo into one, and all the right halves into another one. The technique was invented in 1999, and is probably the most frequently used one in Google Jump.


The Google Jump Solution

      The solution is flawed though, in that the spliced images may not be perfect when put together for the reason that the two operations are totally independent.


      So even though either image looks fine, they may be incorrectly combined. The most popular solution is Dual Camera Array, like the one you see here.


Dual Camera Array

      As the name suggests, the system uses two cameras instead of one, so that each eye has the right 3D vision. The difficulty is the seam. In the last decade, we have made a number of efforts, including the “Real-Time 360o 3D Splice”. Real-time is the key. Real-time splice and putting the scene on live show are ridiculously hard, requiring a bunch of core computer vision algorithms.


      We, together with SMG, did the 360o 3D panorama live show of War Horse in the end of last year, with such a system. The system can shoot me on the stage, the performers and audience at the same time, and it supports 360o 3D live show. It was an interesting experience.

Panorama Shooting in War Horse

      After receiving the PhD degree in MIT, I returned to U. Del as a professor immediately and continued studying how to fool the eyes. As you remember, binocular vision is the first issue that awaits tackling. We have two eyes, but the binocular vision remains even if you close one eye, enabling you to grab things. This is done thanks to the so-called “dynamic focus”. Now, as you watch me speaking, your eyes are not focused on my face.


      For you guys in Beijing Film Academy, this must be so straightforward. It’s just rack focus, or consistently changing the focus. However, this is not possible if you fix the focus with the usual camera. And here is where light field saves the day.

Light Field

      I will briefly talk about the idea of Light Field Refocusing. It was proposed by Ted Adelson of MIT, fellow of both the American Academy of Arts and Sciences and the National Academy of Sciences. He stated that it suffices to acquire all the light if you want dynamic refocusing. It’s actually how our eyes work: they don’t refocus, they just capture light. If you are able to collect all light in the world, you are able to integrate them into a dynamic system, in principle. Well, the double integral equation is probably horribly complicated, but it is not qualitatively different from the system of the microlens, as we learned in middle school: One divided by object distance plus one divided by image distance equals one divided by focal length.


      Given that you can recover the properties of all light captured in your eyes, you can put them together and then re-render the whole.  How to realize this?


      My first trial came in 2000, soon after I joined MIT. The result was a Camera Array, which we called The Virtual Eye and contained 64 (8X8) Web Cam. It was able to collect and manipulate a huge amount of lights. The system was really expensive, so cutting the cost was important.


      As you know, cameras are becoming cheaper and cheaper. For example, you may simply build an array with a set of GoPros. The choice is made on the ground that GoPro has a built-in synchronizer, which is quite good already. In case one wants something fancier and of course, more costly, an array of RED cameras are even better. Like here in BFA.


      Here is a sample of dynamic refocusing. You may refocus and shoot at once because it’s all real-time. Some flaws are visible for the over-dispersed cameras, and they can be fixed by post processing. The sample is like a director’s view. The effect of refocusing is already visible, while a number of post processing may still help to make a real-time and accurate refocusing.

Dynamic Focus

      The interesting thing is, instead of an array of GoPro, you may also build one of light field cameras and get a Light Field Panorama, which is able to do 360o refocusing. It is based on a number of fixed viewpoints. You may switch between them and change the focus alongside. Now we have two of the essential features of human eyes, that is, binocular vision and dynamic focus.

Light Field Panorama

      In 2015, I joined a new college in China after teaching in the US for ten years. It’s called ShanghaiTech University, and many other newcomers are emerging in China. It’s really exciting. Here I studied yet another essential feature of human eyes, Motion Parallax.


      Consider that you are watching me talk. It’s a bit boring because you are positioned in a fixed viewpoint, either from left or right. Can you walk around or zoom in/out? The answer is yes, and it has been accomplished in the movie The Matrix. They placed 120 cameras around Keanu Reeves and shot, and it was like capturing and reconstructing the light field on him.

A classic “bullet time” scene in The Matrix (1999)

      In the last year, I built a “dome” just like the one in The Matrix in Shanghai. The difference is that we are using 140 cameras. It’s a hybrid system of 60 moving cameras with relatively low resolution, and 80 fixed ones which have higher resolutions but low frame ratios. We call it ShanghaiTech- Plex VR Dome.


ShanghaiTech- Plex VR Dome

      With this camera system, we are able to reconstruct VR scenes of extremely high resolutions to a degree that even the minutest folds of my outfits are visible, just like Keanu Reeves in The Matrix. Here is one of our samples. He is Lan Tian, a performer of Peking Opera. The two photos represent Lan Tian, as a contemporary denizen, and Huang Zhong whom he acted as.


Makeup of Peking Opera

      As you see, our system has captured the folds and even their reflections in an extremely high resolution. It also goes dynamic. Here is Ji Yu, one of my students, who is playing a harp.

Harp Playing

      In these images, both figures are real while the backgrounds are completely virtual, including the chairs and the sea. You may watch them from any angle you like. Every person witnessing it cannot help but touching his head. It’s just that real.


      In the next two days, you may experience the same thing in our exhibition area with our VR and AR headsets. Now I’ll recap a little bit. We have talked about three primary functions of human eyes: binocular vision, dynamic focusing and motion parallax.  


      After joining ShanghaiTech, I started a business, Plex VR, as a response to President Xi’s calling. You may find something interesting on our website, Plex-vr.com under the tab of Flat Lux.


      Now I will talk about a few new projects of ours. They are very interesting. The first one is the World VR Concert filmed with the Junior School of Julliard School. Let’s hear a piece of music.


      We photographed and recorded multiple figures of different countries, races, sexes around the world with our Dome system and put them in this VR Concert. Note that they weren’t together when they played, and this concert was composed only after we combined them in a dynamic manner. With this Dome, you may join this concert as well.


      You may walk to them, play with them, see them from whichever angle, and even move as you wish. It’s also available in our exhibition.


      The second project is Light Field E-Commerce in cooperation with Alibaba, which showed this technique on the last Double 11 Shopping Festival. Here is a sample, a LV purse.

Light Field E-Commerce – LV Purse

      Armed with Light Field, we are able to go beyond geometries, and reconstruct fine surface textures, including the lights. Here is our best result recently: Tang Sancai, a national treasure.


      Tang Sancai is (in)famous as an object that is extremely hard to reconstruct for its complicated gloss. To our great comfort, we have made it with Light Field. The FPS exceeds 90 for both eyes.

Tang Sancai Experience

      You can see all the delicate variation of lights and textures on it, and even that of the whole light space. It’s very important. This is necessary if you want photo-realistic rendering, and it is now possible with Light Field.


      The third is a short VR documentary we filmed in ShanghaiTech. How can I miss this in a film academy? It is called Dream in Midsummer, and it’s about a girl from Xuanwei, Yunnan pursuing her dream as a teacher in Shanghai.  


      I want to end this lecture with a quote from Michel Foucault, my favorite philosopher. He said, I am not a prophet, but what I am doing is opening a window where there used to be nothing else than walls so that light shines in. Thank you for listening.