Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test with known camera pose #10

Closed
JingleiSHI opened this issue Jul 23, 2019 · 13 comments
Closed

Test with known camera pose #10

JingleiSHI opened this issue Jul 23, 2019 · 13 comments

Comments

@JingleiSHI
Copy link

Hi,
Thank you very much for your excellent work and your open code source!
I have followed your tutorial and got really amazing synthesis results, however, when I test with some other light field data, it seems that colmap can't work correctly and some errors happened. To avoid the problem caused by colmap, I want to skip the img2poses step and give directly camera poses to the following step, is there any way to feed camera poses for it? (I found in your code, you have done some processings like transpose to the estimated poses, but not much comments to explain these processings, could you please give some explaination over camera pose processing after img2poses?)
As for other test data in your paper, I'm very intreseted their output, but I didn't find a download link, are these data available to the public?
Thank you very much for your attention.
Yours sincerely,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Jul 23, 2019

I just added the four test scenes from Figure 9 (airplants, pond, fern, t-rex) to the google drive supplement, you can find them here now:
https://drive.google.com/open?id=1Xzn-bRYhNE5P9N7wnwLDXmo37x7m3RsO

Here's an explanation of the poses_bounds.npy file format. This file stores a numpy array of size Nx17 (where N is the number of input images). You can see how that is loaded in the three lines here. Each row of length 17 gets reshaped into a 3x5 pose matrix and 2 depth values that bound the closest and farthest scene content from that point of view.

The pose matrix is a 3x4 camera-to-world affine transform concatenated with a 3x1 column [image height, image width, focal length] along axis=1.

The rotation (first 3x3 block in the camera-to-world transform) is stored in a somewhat unusual order, which is why there are the transposes. From the point of view of the camera, the three axes are
[ down, right, backwards ]
which some people might consider to be [-y,x,z].

So the steps to reproduce this should be (if you have a set of 3x4 poses for your images, plus focal lengths and close/far depth bounds):

  1. Make sure your poses are in camera-to-world format, not world-to-camera.
  2. Make sure your rotation matrices have the columns in the same order I use (downward, right, backwards).
  3. Concatenate each pose with the [height, width, focal] vector to get a 3x5 matrix.
  4. Flatten each of those into 15 elements and concatenate the close/far depths.
  5. Concatenate each 17d vector to get a Nx17 matrix and use np.save to store it as poses_bounds.npy.

Hopefully that helps explain my pose processing after colmap. Let me know if you have any more questions.

@JingleiSHI
Copy link
Author

Hi,
Thank you very much for your quick response and explanation, I have constructed a camera pose matrix with known parameters, but there's still one point I don't understand very well: you wrote 'From the point of view of the camera, the three axes are
[ down, right, backwards ]
', do you mean that from camera viewpoint, its XYZ axes correspond respectively to the directions [down, right, backwards]?
If I hold horizontally a camera, with its XYZ axes in the directions [right, up, backward(from camera lens to camera sensor)], then I need to rotate XY plane 90 degrees around Z axis to have correct rotation matrix ? Thank you for your attention.
Yours sincerely,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Jul 26, 2019

If I hold horizontally a camera, with its XYZ axes in the directions [right, up, backward(from camera lens to camera sensor)], then I need to rotate XY plane 90 degrees around Z axis to have correct rotation matrix ?

That's right. 90 degrees clockwise.

@JingleiSHI
Copy link
Author

Hi,
Thank you very much, now I have another question on your code DeepIBR, in which 1./inf_depth and 1./close_depth are treated as min_disp and max_disp in the function render_mpi_homogs. But as far as I know, the relationship between depth and disparity should be disparity/baseline = focal_length/depth, could you please explain in more details why you did so? Thank you very much for your attention.

Yours sincerely,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Jul 29, 2019

"Disparity" is a bit of an overloaded term that can also mean inverse depth. The reprojection math correctly accounts for focal length as well (such as here).

@JingleiSHI
Copy link
Author

Hi,
Thanks a lot for your answer and sorry for being late for my response, in fact, what you did is to project 'depth' value to 'fake disparity' value, then reproject it to 'depth' after equally dividing disparity value. Thanks to this depth-disparity-depth conversion, we don't have to calculate true disparity value. Is it right?
Another question about this project is the unit of each parameter, I found that colmap estimates a focal length with value 4*10^3 for your demo scene, but I don't think millimeter is its unit as the value is too large and 10^-5m is not an usual unit, can your clarify the units of each parameter in the poses_bounds.npy ? Thank you very much for your attention.

Regards,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Aug 5, 2019

Focal length is in pixels, to fit with the equation you mentioned earlier:
(disparity in pixels)/(baseline in meters) = (focal_length in pixels)/(depth in meters).

This is what is output by COLMAP and other camera calibration code when estimating intrinsics for a pinhole camera model.

@JingleiSHI
Copy link
Author

Hi @bmild ,
Thanks a lot ! I have change the unit of focal length to pixels (camera position in meter and depth in meter), it works well.
I have noticed that according to the equation mentioned: disparity/baseline = focal_length/depth, the disparity value will always be positive, it means the pixel displacements are in the same direction (from PSV.mp4 of demo scenes, we can observe the foreground objects from all input views converge together). However, as for the datasets in the 4D Light Field Benchmark , their data has both negative and positive disparity value due to a offset term in the equation, in this case, the background objects (with negative disparity value) in PSV will never converge together and therefore final synthesized view will have a bad background, have you ever encounter this kind of problem and how do you think about that?
Thank you for your attention and I am looking forward to your response.

Yours sincerely,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Aug 5, 2019

I've only worked with extrinsic/intrinsic camera poses and corresponding depth ranges. From that webpage, I can't tell what units the disparity ranges are in -- but if you manage to convert those back to real world depths and know the spacing between cameras, you should be able to reconstruct the right poses matrices I think.

@JingleiSHI
Copy link
Author

JingleiSHI commented Aug 12, 2019

Hi,
Thank you very much for your answer, I have a question about the paper: I noticed that MPI prediction network has an output channel number 5, and in your paper, you explained that the output of MPI prediction network contains 'an opacity α for each MPI coordinate, as well as a set of 5 color selection weights that sum to 1 at each MPI coordinate', 1 opacity map and 5 color selection weights should give 6 output channels? Could you please explain more in details how to get a RGBα from the output of the network (I think 5 channels should be disparity, alpha and RGB, but I am not really sure...)?
Regards,
Jinglei

@bmild
Copy link
Collaborator

bmild commented Aug 22, 2019

This is a subtle point. We output 1 channel for opacity (put through a sigmoid to get in [0,1]).

To get the blending weights, we take the other 4 channels, append an all-zero channel, then pass through a softmax to get 5 numbers that sum to one. (You could just output 5 channels and softmax, but this makes the function bijective. It probably does not make much difference in practice.)

@JingleiSHI
Copy link
Author

Hi @bmild
Thank you very much for your explanations, it's very clear and helpful. As I think all my questions about LLFF have been answered, I'll close this issue.

@Zakaria1405
Copy link

can some explain to me if i want only to extract the bounds/bds.npy of every image that i have, how can i do this? and what i need exactly ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants