Check out our paper: Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video Streaming
Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer’s FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content’s impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.
AR/VR applications are rapidly growing, with immersive video streaming—such as 360-degree and point cloud videos—being essential for widespread adoption. These videos require significantly higher bandwidth than traditional 2D videos; for example, a point cloud video with 300k to 1M points demands 1.08 Gbps to 3.6 Gbps. A solution is Field of View (FoV) adaptive streaming, which transmits only the video content within the viewer’s viewport, reducing bandwidth by as much as six times for 360-degree videos.
However, accurately predicting the viewer’s future FoV is challenging, as it requires a buffer of 2 to 5 seconds for smooth streaming. Current methods typically involve predicting the viewport based on past trajectories, which can lead to errors. We propose a direct approach to predict cell visibility using the viewer’s trajectory, historical visibility, and spatial features of the objects. This method can reduce error amplification and leverage continuity in visibility changes. Our framework improves long-term visibility prediction accuracy by up to 50% in real-time using actual point cloud video and viewport trajectory datasets.
6DoF FoV Demonstration. As in Fig.(a), the content within the pyramid which is bounded with far plane and near plane is inside the viewport and any content outside pyramid will not be seen by viewer. Furthermore, even though some points are inside the viewport, they are still not visible if occluded by other points. So if there is a viewer watching the point cloud content from side as Fig. 1(a) shows (the actual view is shown in Fig.1(b)), only the highlighted part in Fig. 1(a) would be the visible. If the point cloud is divided into 3D cells like in Fig. 1(c), only the cells covering the visible points need to be transmitted.
The curve on the top is the viewer’s 6DoF viewport trajectory, illustrating one of the the 6DoF coordinates $(x,y,z,\psi,\theta, \phi)$ at each frame time $t$ for a point cloud sequence set $\mathcal{P}^{h+f}$. We partition the whole space into 3D cells and model the cells into 3D grid-like graph. For each frame $t$, based on the 6DoF coordinates and $P^i$, we can calculate, for each cell $i$, the total number of points $O_i$, visible points $V_i$, and percent of cell volume within the viewport (called the cell-based viewport feature) $F_i$. We use $G_i=[O_i, F_i, V_i, E_i]$ to represent the node feature for each frame, where $E_i$ is other features like the coordinates of each cell. We use a temporal Bidirectional GRU model and a spatial transformer-based graph model to capture patterns in viewer attention and cell visibility. GRU model captures each node’s temporal pattern over time, which is encoded into the hidden state $S_h$. In the graph model, each node aggregates its neighbor’s information and hidden state from GRU, as the dash line shows. To simplify the graph, we only show the graph attention updating on the $G_1$. After we got the $S_{h+1}$ and $S^{‘}_{0}$ from bi-directional GRU, a MLP modal will predict the cell visibility at the target time stamp $t^{h+f}$. Before the final output, the $O^{h+f}$ will be applied as mask to get the predicted visibility $\hat{V}^{h+f}$, since in the streaming system, the server has the point cloud at frame $h+f$. We can optionally predict the overlap ratios between viewport and cells, $\hat{F}^{h+f}$, as well (which essentially predict the viewport). The $\hat{Y}^{h+f}$ is the prediction output, indicating either $\hat{V}^{h+f}$ or $\hat{F}^{h+f}$.
In this illustration, given view’s 6DoF and the point cloud frame in 3D cell, we can get the occupancy feature, viewport feature and visibility feature, as $o_i, f_i, v_i$. We have 8 node in total, and 5 of the node with color is occupied by points and has different number of points. $n_6$ has 10 points in total but can be occluded by other points and visibility is 0.4. For other node without points, the visibility feature is set as 0.
Method | 333ms | 1000ms | 2000ms | 5000ms |
---|---|---|---|---|
LR | 0.0043 | 0.0102 | 0.0173 | 0.0229 |
TLR | 0.0028 | 0.0085 | 0.0158 | 0.0223 |
M-MLP | 0.0036 | 0.0093 | 0.0137 | 0.0232 |
M-LSTM | 0.0026 | 0.0083 | 0.0126 | 0.0146 |
Ours | 0.0040 | 0.0100 | 0.0110 | 0.0120 |
Table 1: MSE of Visibility Prediction by Different Methods at Different Prediction Horizons
In this table, we report the MSE losses for different methods across various prediction horizons. For short-term predictions (less than 1000 ms), our model maintains a relatively consistent cell visibility prediction loss. More importantly, for long-term cell visibility predictions, our model reduces the MSE loss by up to 20\% compared to all state-of-the-art methods. This improvement is significant for on-demand point cloud video streaming with target buffer length around 5 second. Our model effectively addresses the error amplification issue of trajectory-based methods, and captures the temporal and spatial patterns in viewer’s attention and cell visibility.
*Viewport prediction loss with different prediction horizon, from 10 frames(333ms) to 150 frames(5000ms).
Our model consistently outperforms all the state-of-the-art baselines across all prediction horizons, demonstrating the superiority of our spatial perception method over the traditional trajectory-based approaches. The performance gaps widen at the longer prediction horizon of 5 second (or 150 frames).
In this paper, we introduce a novel spatial-based FoV prediction approach designed to predict long-term cell visibility for PCV. Our method leverages both spatial and temporal dynamics of PCV objects and viewers, outperforming existing state-of-the-art methods in terms of prediction accuracy and robustness. By integrating Transformer-based GNNs and graph attention networks, our model efficiently captures complex relationships between neighboring cells with a single graph layer. This approach overcomes the limitations of trajectory-based FoV prediction by incorporating the full spatial context, resulting in more accurate and stable predictions. Our spatial-based FoV prediction model presents a promising solution for long-term 6-DoF FoV prediction, immersive video streaming, and 3D rendering. We will make the code available to support further research and development in this area.