There are a number of videos that I'm looking to process of different video games to detect various "states" in them.
The first game that I am tackling is any edition of Super Street Fighter 4.
In it, I'd like to detect when the "vs" character screen comes up. Here's an example of one frame of the video:
(taken from ~10s mark of this video)
If I could detect the "vs", then I'd be able to detect that frame of video is indeed the "vs" screen, which would allow me to look for other information (for now, let's say I'll use it to detect the timestamp in the video where the match is about to start).
That said, here is what can be assumed about the frames from the videos that I will be processing (this is not the only video, there are thousands, if not tens or hundreds of thousands of videos, but the issue of scale in processing that many videos is a completely different domain):
- I'd prefer (but it's not necessary) to process the lowest resolution image possible with reliable results (lower resolutions = faster processing time). The image above is 480 x 270 pixels (taken from a YouTube video with a
fmt
18) but they may come in different sizes (I've gotten YouTube videos with fmt
18 but with dimensions 640 x 360 pixels).
- Most videos will be direct-feed
- Most videos will be 16:9 aspect ratio
- The reddish background will be animated, but generally be within that orange-red color (it's flames)
- Sometimes there will be a badge fading in and out over the lower part of the "vs" to indicate a version (that will be important, but not right now), which might obfuscate the "vs", like so:
(taken from ~3s mark from this video; also note that the above is a resolution of 640 x 360 pixels)
- The size and position of the "vs" is going to be roughly the same (I haven't verified this yet but I know it doesn't move) in proportion to other direct-feed videos
- The characters will be chosen from a pool of more than 30 on each side (in other words, those areas of the frame will vary)
- The videos will generally be anywhere from two to four minutes long, with somewhere between 4,000 and 6,00 frames. However, there might be longer videos (maybe a two hours) which have various other games and live action cut in. These videos are not as important, but if a solution tells me where a certain game pops up in the larger overall video, great
- The native resolution of the captures is 720p, so a baseline image of the "vs" can be taken at what would be considered a "native" size.
Ultimately, I'm looking to code this pipeline in .NET, but that's not super important, the proof-of-concept is more important here as well as an understanding of the techniques involved so that I can translate and optimize it for .NET as well as for other videos of other games in the same genre (if I can pick out the significant discriminators, and videos of say, Ultimate Marvel vs. Capcom 3, Street Fighter x Tekken, BlazBlue: Continuum Shift, etc.).
I'm also dipping my toes in Mathematica and have home version 8.0, so a proof-of-concepts in that environment is more than welcome as well.
If the "VS" is pretty much the same (save for some badge overlays as in the second example), you can use straightforward cross-correlation to detect the presence of the template in your video frame. I answered a similar question on doing this in MATLAB on Stack Overflow. You can use something like the "magic wand" tool in Photoshop to select the "VS" from the frame to create a template. I've done so and binarized the image to obtain this template.
Looking at the different color channels (RGB) in your two images, the red channel appears to be the best for detecting your template.
You can now cross-correlate the red channel with your binarized template and you should get a peak at the location of the template. I choose to threshold and binarize the red template too, although you can detect it without doing so. I prefer to use a distance function rather than raw cross-correlation values, as it tends to be a bit more robust against false positives. I don't know C#/.NET, but here's an outline of the approach in Mathematica:
image = Import["http://i.stack.imgur.com/7RwAh.png"];
ImageCorrelate[ Binarize[ColorSeparate[image][[1]], 0.1], vsTemplate,
NormalizedSquaredEuclideanDistance] // Binarize[#, 0.2] & // ColorNegate
which gives you the following. The white dot marks the region with the minimum distance in each of the two images
You can then use the above in your next step as desired. Note that typically, cross-correlation will result in an overhang. In other words (using a 1D example) if you're cross-correlating an $N$ point signal with an $M$ point one, you'll get a result that's $N+M-1$ point long. Mathematica's implementation takes care of the overhang for you. However, I don't know what C# does, and you might want to keep this in mind (MATLAB doesn't do it, and I had to account for this in my linked answer above).
You can also build upon this and implement a more robust thresholding criterion of your own. For now, I shall just highlight the detection for the benefit of others:
You can generate the above with a combined function:
detectVS[i_Image] :=
Module[{mask =
ImageCorrelate[ Binarize[ColorSeparate[i][[1]], 0.1], vsTemplate,
NormalizedSquaredEuclideanDistance] ~Binarize~ 0.2 //
ColorNegate},
ColorConvert[i, "Grayscale"]~ImageAdd~
ImageMultiply[i, Image[mask]~Dilation~ DiskMatrix@100]
]
There is a lot of potential for improvement here. I'm an armchair hobbyist at image processing, so I don't know what the fastest algorithms are. However, there are a few things you could look into:
- If the VS is is roughly the same location in every video, you needn't cross-correlate using the entire image – you can just select a box in the middle and work with that.
- This might be an expensive operation to do for each and every frame. However, looking at your video, you have about a little over 4 seconds worth of frames where you have the VS displayed and the character names. So I would suggest that you analyze a frame every second or at most every 2 seconds, thereby guaranteeing that you will land on one with a VS on it. Once you detect VS, you can then start processing every successive frame to do the next part of your processing.
- This process should be, to a reasonable extent, robust towards change in size i.e., you could do cross-correlations on small images, but you'll need a suitable template to match. If you know that your images are going to be in certain set/standard sizes, then you could create templates for each of them and select the appropriate template depending on the image size.
- The thresholds I chose were by trial and error, but they do seem to work for the two images above and from the other related youtube videos, they probably will work for most of them. A more specialized approach would involve dividing it up into blocks and looking at the histogram to infer if it belongs to VS or not – perhaps a Bayesian classifier. However be absolutely sure that you need to do this before embarking on it. It seems to me that it's simple enough that you won't need it.