US20250117597
2025-04-10
Physics
G06F40/40
The disclosed systems and methods focus on converting digital video data into natural language text descriptions. This involves processing video data in various timeframes to create a text-based narrative that describes the video's subject matter. Additionally, the system supports user queries about the video content, providing natural language responses based on the video's context.
Traditional cloud computing models face limitations when processing data from remote environments due to issues like latency and bandwidth constraints. Remote locations often lack infrastructure for high-speed data transmission, relying instead on slower wireless communications. This poses significant challenges for handling large data volumes generated by cameras and sensors in such areas.
The process involves segmenting videos into overlapping sections, generating various embedding vectors (local, global, temporal, and activity) for each segment. These vectors are aggregated into feature embeddings, which are then transformed into descriptive text narratives. This method can operate in real-time or post-video creation, optimizing storage by potentially eliminating redundant segments.
The system also offers a question-and-answer session where users can ask questions about the video content. It processes feature embeddings and text narratives to generate responses, providing them in natural language. This functionality is supported by machine learning models that can operate in challenging environments with limited connectivity.
Edge computing units, designed to function in harsh conditions, perform these processes locally with minimal latency. These units are equipped with various sensors and communication tools to facilitate data processing without relying on centralized data centers. The edge computing setup ensures efficient handling of data even in remote or infrastructure-limited locations.