
Choosing the Right Motion Capture Solution: Sensor-Based Suits and Gloves vs. Camera-Based AI Vision
In the rapidly evolving world of motion capture, choosing the right technology can be challenging. When most people hear of motion capture, they think of the optical ping pong ball markers in giant sound stages they have seen in behind the scenes footage from big productions. However, the 95% of creators that can’t afford or don’t have space for a system like that struggle with a completely different challenge: the choice between sensor-based motion capture suits, i.e. IMU suits that use inertial measurement units, or camera-based systems that use AI Vision technology to do the motion capture. The space is fast-moving and technically complex, so it’s no wonder that many are unsure of what system is right for them and their project. Let’s dive into the key differences to help you make an informed decision.
IMU-Based Motion Capture Suit & Gloves: Precision and Freedom
Some current companies in this space: Rokoko, Xsens, Perception Neuron.
IMU (inertial measurement unit) motion capture - involving suits and sensors - has been around for a long time, but while improvements in AI systems are more of a focus point in today’s environment, it is important to also note significant ongoing updates to the underlying tech in sensor-based technology (WiFi, sensor-calibration, durability, etc.).
Here are some key pros and cons on IMU motion capture:
Pros:
- Untethered / No Occlusion: The main advantage with IMU systems is not being restricted to a recording volume dictated by camera(s), and therefore having no risk of “walking out of shot”. This is an advantage over any camera based solution that will never go away. Even when you are within frame, the camera’s need for line of sight can be problematic with limbs being blocked by the body, or problems when having multiple people dancing, fighting etc. Occlusion/line of sight issues are non-existent with IMU systems.
- Unlimited Usage: Record as much as you need without worrying about additional costs. This freedom is invaluable when you are pursuing a very specific result or expression and don’t want to hesitate in the number of recordings you do.
- Real-Time Feedback: See your results instantly, making adjustments on the fly and ensuring accuracy.
- Comprehensive Capture: Easily integrate body, finger, and facial capture in one streamlined system. The lightweight nature of the sensor data and uninterrupted capture because of no occlusion also means that the data has no sudden drops and is completely fluid throughout long takes.
Cons:
- Drift over time: With an IMU system alone, everything after the initial calibration pose is “pure math” - everything is relative to the starting point and adjusted by any “absolute position” that you would get from a camera. The data can therefore be increasingly inaccurate over time in the “global space”, meaning the position in the virtual space. If you sit on a chair and run 20 meters forward and return, you will not sit exactly in the same spot in the 3D scene. Rokoko built the Coil Pro to mitigate this exact phenomenon, but it will always be an issue with IMU only.
- Magnetic interference: IMUs use a gyrometer, an accelerometer, and a magnetometer to get its position. The magnetometer is sensitive to magnetic interference, meaning that a large piece of metal near the sensor can make it drift and be inaccurate (like a compass). There are many ways to mitigate this (for example leaning entirely on the gyro/acc when sensing interference etc.), but it is an inherent challenge for IMU systems.
- Wearables needed: This isn’t always an issue, but you can have scenarios where it’s not feasible to wear a suit - for example if the actor is being filmed in live action while also being tracked with motion capture. There’s no way to do IMU capture without some kind of wearable.
AI Vision - Camera-based Motion Capture: Convenience and Innovation
Single-Cam / Monocular Systems:
Some current companies in this space: Meshcapade, QuickMagic, Cascadeur, Radical, Marionette, Wonder Dynamics (Flow Studio), Rokoko Vision.
This is the fastest moving category in terms of increases in quality, but also where the fundamental problems are hardest to solve - some of them downright unsolvable. With a single-angle you will inevitably always have issues with line of sight. You can do your best to estimate the position of the limbs/details the cameras can’t see, but then you lose details and are not actually “capturing”, but rather guessing.
That said, there are many use-cases where it makes sense to consider a single-cam solution for its ease of use and low cost.
Pros:
- Easy to set up and affordable: If you have the required space to have your full body in frame throughout the take, all you need is a smartphone or other camera and you are good to go. Most solutions are very affordable in this category too, although they all require a subscription and added consumption cost, so you always have to be mindful.
- No suit/wearable required: Depending on the situation it is not always feasible to have to put on a wearable device or full-body suit, which can be a key advantage to a vision AI system.
- Riding a wave of larger AI advancements. Some of the biggest drawbacks to these solutions should only be temporary, because of the expected advancements in AI. For example, the fact that there’s no realtime visualization to the systems now and you have to wait minutes to see your take is a dealbreaker for many creators. Big upgrades to processing and compute power should help a lot to (at least partially) solve that and other issues.
Cons:
- Limited depth perception: One of the key challenges to a single-angle/monocular approach is the lack of sense of depth. Small and big movements to and away from the camera will always be flawed to some extent because the data simply isn’t available from just one angle.
- Occlusion issues: Another unsolvable issue is line of sight. You can’t capture what you can’t see, and there will always be part of the body not in view of the camera where estimations are needed. This also extends to high impact situations (punches, falls, etc) where the camera just won’t be able to capture the movements accurately.
- Lack of good finger and face capture: The nuances of finger movements or facial expressions are incredibly detailed, but also hold a crucial part of a performance. If you lose those details, you only get the generic movements and lose the magic and individuality of performances - basically the whole reason for doing motion capture and not just using stock libraries. Capturing these minute details with a single camera that is far enough away to see your full body is simply not feasible - at least not with today’s technology.
- Usage Costs & Lack of Realtime Visualization: You would expect both of these issues to be addressed to some extent over time as the tech advances. However, not seeing your result until minutes after the take is a huge problem today. On top of that, it takes a lot of freedom away if you have to be mindful of how many takes you do of a movement, because of the added usage costs. As the solutions are today, these two things created a downward going spiral, where you need to do more takes to be sure you have what you need because you don’t have the option of seeing the result just after your recording. That means that you also are charged more usage fees because you had to do the extra “safety takes”.

Multi-Camera Systems:
Some current companies in this space: Vicon markerless, Move.ai, Loom, Rokoko Vision Dual-cam.
Having multiple cameras/angles to your vision AI capture makes a huge difference in quality. You suddenly have a good sense of depth and a much better way to mitigate occlusion issues.
It seems certain that these multi-cam markerless vision AI system will eventually replace optical marker-based systems entirely. However, with multiple cameras you also add a lot more complexity and cost, and suddenly the big advantages of the single-camera approach in ease and speed of use and cost are no longer as clear.
Pros:
- Improved occlusion handling and depth perception: This is main game-changer from single-cam to multi-cam vision AI. The two inherent and unsolvable issues with only having one angle are way less problematic with multiple cameras.
- No need for suits/wearables: As with all vision AI solutions, the fact that you don’t need wearables can be very important and a big advantage in some cases.
- Riding wave of AI advancements: The usage costs (especially to get results in realtime) are quite massive today if you have 4-8-12 high-res camera running simultaneously and you have to process all that data. As AI and tech more broadly improves, this should also be addressed to some extent.
Cons:
- Time-consuming and complex setup: While not needing to put on a suit can be nice, don’t be fooled - there is a LOT more hardware to deal with in a multi-cam vision AI setup than with an IMU suit. Many cameras that need to be mounted on tripods, running simultaneously without running out of battery (or with a bunch of cables). In practice it just isn’t that different from today’s optical marker-based stages in terms of setup.
- Needs a large and dedicated studio space: For multiple cameras to see your full body at all times, you need a lot of space. Ideally you would use the same space again and again and leave the tripos and equipment standing to not have to do long setups every time. But then suddenly you actually need a dedicated studio space, which many creators don’t have.
- Lack of good finger and facial capture: Even with many high-res cameras, finger movements and facial capture is just possible to capture accurately from several meters away. The means that you need additional solutions for those parts that need to be synced with your body capture, and then suddenly you have even more tech and complexity in play.
- Usage costs: These solutions are expensive. After 1 month you can easily have spent the same amount as it will cost you to buy a full performance capture bundle from Rokoko (body+fingers+face) - and the usage costs of course don’t stop after that. They keep going, while the IMU system doesn’t cost more with increased usage. You have to be really mindful of that if you plan more longterm.
Markerless multi-cam capture has been attempted many times in the past, but never with great results. It feels like the moment is coming where these types of solutions enter the mainstream - but there are still some fundamental questions that are unanswered.
The biggest advantage to markerless capture is.. well, getting rid of the markers obviously. But are the markers really the main issue when it comes to the current, non-AI based, multi-cam optical solutions? Aren’t things like hardware cost and post processing much larger issues that would be more urgent to solve? Markerless solutions may not suffer from “marker flipping”, but many of the other typical issues will be just as prominent, like (self)occlusion and jitters - so post-processing (clean-up) will be just as necessary.
And before we even get to those questions, there are some limitations here and now that we expect might get fixed over time, but that are still far from solved - like frame rate limitations. Solutions like Move.ai maxes out at 60fps, while a Vicon system runs at 240fps or higher. On top of that, Vicon still offers a mm precision that “regular cameras” would need a massive upgrade to match. Even when/if these things are technically possible, the usage costs of running these kinds of cameras will be massive and probably also costly.
Making the Choice
When choosing between these systems, consider your priorities carefully:
For High-Quality, Comprehensive, Reliable Capture: IMU systems like Rokoko’s offer unmatched flexibility, real-time feedback, and freedom from occlusion issues. They are ideal for professionals who need reliable, full-body motion capture.
For Quick and Easy Setup: Single-camera AI vision systems are convenient but come with limitations in accuracy and real-time feedback.
For Advanced Projects: Multi-camera setups provide better occlusion handling but require significant investment and space. With a full performance capture solution from Rokoko, including the Coil Pro for absolute position, you get very far for very low cost compared to the high-end optical systems. Your specific use will decide what system is most suitable.
Conclusion:
In conclusion, while AI vision systems are rapidly improving and offer unique benefits, IMU systems like Rokoko’s stand out for their reliability, comprehensiveness, and cost-effectiveness over time. Some of the current issues with vision AI systems will undoubtedly get solved over time as the tech improves. However, some (occlusion, depth perception, usage costs) are either unsolvable or don’t seem plausible to solve anytime soon.
Book a personal demonstration
Schedule a free personal Zoom demo with our team, we'll show you how our mocap tools work and answer all your questions.
Product Specialists Francesco and Paulina host Zoom demos from the Copenhagen office