6th January 2025

Laptop imaginative and prescient is considered one of many instruments you need to use in content material moderation. With laptop imaginative and prescient, you’ll be able to routinely discover classes of content material in a video, versus utilizing handbook human effort. 

For instance, take into account a situation the place you’re a video producer and need to know if a video clip incorporates alcohol. You possibly can use laptop imaginative and prescient to categorise if the video incorporates alcohol. Utilizing this info, you’ll be able to set off customized enterprise logic, resembling prohibiting the content material being proven earlier than a sure time within the day.

On this information, we’re going to show the best way to reasonable video content material with the Roboflow Video Inference API. We are going to use the CLIP mannequin to determine particular scenes (i.e. violence, alcohol). By the top of this information, it is possible for you to to take a video and determine if the video incorporates particular classes of content material.

We’ll run our evaluation on this video:

[embedded content]

With out additional ado, let’s get began.

What’s CLIP?

Contrastive Language Picture Pre-training (CLIP) is an open supply laptop imaginative and prescient mannequin developed by OpenAI. You should utilize CLIP to calculate the similarity between two photographs and the similarity between photographs and textual content. With this functionality, you’ll be able to determine frames in a video which can be just like a textual content immediate, create media search engines like google and yahoo that allow you to discover photographs utilizing textual content queries, cluster photographs, and extra.

You’ll be able to run CLIP on frames in a video utilizing the Roboflow Video Inference API, which gives a versatile, hosted answer for utilizing CLIP with movies. Our inference API will scale up with you, whether or not you’re processing a single video or a number of terabytes per day.

Reasonable Content material with CLIP and Roboflow

We are able to examine frames in a video to a listing of descriptions that match what we need to determine in a video. For instance, take into account a situation the place we need to determine scenes that comprise violence or smoking in a video. We might accomplish this with CLIP utilizing the next prompts:

  • Violence
  • One thing else

You’ll be able to set any arbitrary textual content immediate(s).

The second immediate, “one thing else”, is the class we would like CLIP to return if no moderation immediate is recognized. We might add completely different prompts, too. For instance, if you happen to needed to determine express scenes in a video, you would set a immediate for express imagery. You’ll be able to seek for a number of classes at one time.

We don’t want these prompts for the video inference API, however we are going to want them later once we course of CLIP outcomes from the video inference API.

On this information, we are going to work with a video that incorporates one violent scene.

Step #1: Set up the Roboflow pip Bundle

The Roboflow Python SDK helps you to run inference on movies in a couple of strains of code. To put in the SDK, run the next command:

pip set up roboflow

Step #2: Calculate CLIP Vectors

We’re going to use the hosted Roboflow Video Inference API to calculate CLIP vectors for frames in a video. Create a brand new Python file and add the next code:

import json from roboflow import CLIPModel mannequin = CLIPModel(api_key="API_KEY") job_id, signed_url, expire_time = mannequin.predict_video(
    "trailer.mp4",
    fps=3,
    prediction_type="batch-video",
) outcomes = mannequin.poll_until_video_results(job_id) with open("outcomes.json", "w") as f:
    json.dump(outcomes, f)

Above, substitute:

  1. API_KEY together with your Roboflow API key. Learn to retrieve your Roboflow API key.
  2. trailer.mp4 with the identify of the video on which you need to run inference. You can even present a URL that factors to a video.
  3. fps=3 with the frames per second to make use of in inference. FPS = Three implies that inference shall be run 3 times each second. The upper the inference quantity, the extra frames on which inference shall be run. To study extra about pricing for video inference, discuss with the Roboflow pricing web page.

The script above will begin a video inference job on the Roboflow cloud. The poll_until_video_results operate will ballot the Roboflow API each 60 seconds to test for outcomes. When outcomes can be found, the outcomes are saved to a file.

The file incorporates:

  1. The frames on which inference was run, in `frame_offset`.
  2. The timestamps that correspond with the frames on which inference was run.
  3. The CLIP vectors from inference.

Step #3: Examine Moderation Labels with CLIP Vectors

The Roboflow Video Inference API returns uncooked CLIP vectors. It’s because there are numerous completely different duties you’ll be able to accomplish with CLIP vectors. For this information, we are going to give attention to utilizing CLIP to determine if a video incorporates a violent scene.

Create a brand new Python file and add the next code:

import json import torch
import clip
from PIL import Picture
from sklearn.metrics.pairwise import cosine_similarity import numpy as np with open("outcomes.json", "r") as f: outcomes = json.load(f) frames = [] system = "cuda" if torch.cuda.is_available() else "cpu"
mannequin, preprocess = clip.load("ViT-B/16", system=system) prompts = ["violence", "something else"] textual content = clip.tokenize(prompts).to(system) with torch.no_grad(): text_features = mannequin.encode_text(textual content) prompts_to_features = listing(zip(prompts, text_features)) buffer = [] for lead to outcomes["clip"]: results_for_frame = {} for immediate, embedding in prompts_to_features: results_for_frame[prompt] = cosine_similarity( embedding.cpu().numpy().reshape(1, -1), np.array(consequence).reshape(1, -1) )[0][0] buffer.append(max(results_for_frame, key=results_for_frame.get)) # if 5 detections in a row are True, then we've got a match match = False for i in vary(len(buffer) - 5): if buffer[i : i + 5] == ["violence"] * 5: match = True break print("Match for 'violence':", match)

On this code, we compute CLIP vectors for our two prompts: “violence” and “one thing else”. We then calculate how comparable every body CLIP vector is to the prompts.

If the CLIP vector related to the immediate “violence” is extra just like the vector for “one thing else” for five of the final 20 vectors on a rolling foundation, we are going to cease iterating over the video and report that the video incorporates violence. Our code is applied this fashion to make sure that one or two false positives don’t stop a scene from being categorized as violent.

Right here is the consequence from our script:

Match for 'violence': True

Our video – a film trailer – incorporates violent scenes. Our script has efficiently recognized that the video incorporates violent scenes.

With the code above, you may make determinations based mostly on your online business logic. For instance, chances are you’ll decide that video that comprise violence (i.e. that depict a violent film) have to go to a human reviewer for additional evaluation. Or if you’re working a group, chances are you’ll reject content material that incorporates violent scenes.

Conclusion

You should utilize CLIP with the Roboflow Video Inference API to determine if a video incorporates any scenes that aren’t applicable. For instance, you’ll be able to determine scenes that comprise violence, express scenes, or different varieties of content material.

On this information, we walked by the best way to run CLIP on frames in a video with the video inference API. We then wrote code that allows you to examine moderation labels to the CLIP vectors related to video frames. This code means that you can classify if a video incorporates content material that you simply need to reasonable.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.