Sunday, January 21, 2018

image processing - "River" detection in text


Over on the TeX stackexchange, we have been discussing how to detect "rivers" in paragraphs in this question.


In this context, rivers are bands of white space that result from accidental alignment of interword spaces in the text. Since this can be quite distracting to a reader bad rivers are considered to be a symptom of poor typography. An example of text with rivers is this one, where there are two rivers flowing diagonally.


enter image description here


There is interest in detecting these rivers automatically, so that they can be avoided (probably by manual editing of the text). Raphink is making some progress at the TeX level (which only knows of glyph positions and bounding boxes), but I feel confident that the best way to detect rivers is with some image processing (since glyph shapes are very important and not available to TeX). I have tried various ways to extract the rivers from the above image, but my simple idea of applying a small amount of ellipsoidal blurring doesn't seem to be good enough. I also tried some Radon Hough transform based filtering, but I didn't get anywhere with those either. The rivers are very visible to the feature-detection circuits of the human eye/retina/brain and somehow I would think this could be translated to some kind of filtering operation, but I am not able to make it work. Any ideas?


To be specific, I'm looking for some operation that will detect the 2 rivers in the above image, but not have too many other false positive detections.



EDIT: endolith asked why I am pursuing a image-processing-based approach given that in TeX we have access to the glyph positions, spacings, etc, and it might be much faster and more reliable to use an algorithm that examines the actual text. My reason for doing things the other way is that the shape of the glyphs can affect how noticeable a river is, and at the text level it is very difficult to consider this shape (which depends on the font, on ligaturing, etc). For an example of how the shape of the glyphs can be important, consider the following two examples, where the difference between them is that I have replaced a few glyphs with others of almost the same width, so that a text-based analysis would consider them equally good/bad. Note, however, that the rivers in the first example are much worse than in the second.


enter image description here


enter image description here



Answer



I have thought about this some more, and think that the following should be fairly stable. Note that I have limited myself to morphological operations, because these should be available in any standard image processing library.


(1) Open image with a nPix-by-1 mask, where nPix is about the vertical distance between letters


#% read image
img = rgb2gray('http://i.stack.imgur.com/4ShOW.png');

%# threshold and open with a rectangle

%# that is roughly letter sized
bwImg = img > 200; %# threshold of 200 is better than 128

opImg = imopen(bwImg,ones(13,1));

enter image description here


(2) Open image with a 1-by-mPix mask to eliminate whatever is too narrow to be a river.


opImg = imopen(opImg,ones(1,5));

enter image description here



(3) Remove horizontal "rivers and lakes" that are due to space between paragraphs, or indentation. For this, we remove all rows that are all true, and open with the nPix-by-1 mask that we know will not affect the rivers we have found previously.


To remove lakes, we can use an opening mask that is slightly larger than nPix-by-nPix.


At this step, we can also throw out everything that is too small to be a real river, i.e. everything that covers less area than (nPix+2)*(mPix+2)*4 (that will give us ~3 lines). The +2 is there because we know that all objects are at least nPix in height, and mPix in width, and we want to go a little above that.


%# horizontal river: just look for rows that are all true
opImg(all(opImg,2),:) = false;
%# open with line spacing (nPix)
opImg = imopen(opImg,ones(13,1));

%# remove lakes with nPix+2
opImg = opImg & ~imopen(opImg,ones(15,15));


%# remove small fry
opImg = bwareaopen(opImg,7*15*4);

enter image description here


(4) If we're interested in not only the length, but also the width of the river, we can combine distance transform with skeleton.


   dt = bwdist(~opImg);
sk = bwmorph(opImg,'skel',inf);
%# prune the skeleton a bit to remove branches
sk = bwmorph(sk,'spur',7);


riversWithWidth = dt.*sk;

enter image description here (colors correspond to width of river (though color bar is off by a factor of 2)


Now you can get the approximate length of the rivers by counting the number of pixels in each connected component, and the average width by averaging their pixel values.




Here's the exact same analysis applied to the second, "no-river" image:


enter image description here


No comments:

Post a Comment

digital communications - Understanding the Matched Filter

I have a question about matched filtering. Does the matched filter maximise the SNR at the moment of decision only? As far as I understand, ...