Back in the day, I built tools for printed and handwritten text recognition for Indian languages. That had been my introduction to the growing field of Computer Vision. How did we decompose the problematic? Well, some of the tasks could be characterised as segmentation tasks, like document layout understanding, sentence extraction, word extraction. Others could be characterised as classification tasks such as script recognition (e.g. is it bangla, tamil, or hindi?), and character recognition.

Segmentation tasks were largely tackled using heuristics, and sometimes by throwing some classification in the mix, for example, what layout does it look like? Manhattan (grid-like) or non-manhattan (complex shapes - curves, diagonals, etc). Once the category was known, the heuristics kicked in. Document was size-normalised. Then relative width and orientation of spaces were used to make the decision on what's a paragraph, what's a line, what's a word. Then the word was split into graphemes which were the irreducible entities sent for recognition via n-way classification.

Classification before large neural networks was itself tackled in two separate steps: feature extraction, followed by linear (simple threshold based) or non-linear classification (combination of thresholds as in decision trees, or kernel-based as in SVMs). When the input data was a time series, as in the case of handwritten characters, one could make use of time series models such as hidden markov models.

A non-exhaustive list of feature extractors I used included, LBP (linear binary patterns - histogram of quantised 3x3 units - 0 to 255 values); HoG (histogram of oriented gradients which became famous for solving for real-time pedestrian detection); Haar-features (0 and 1 patterns); and variants of these for preserving spatial information. A non-exhaustive list of classifiers I have used includes, linear classifiers, cascade of linear classifiers (ref. Viola-Jones face detector), random forest, bayesian classifiers, support vector machines, HMMs etc.

It was lot of mixing and matching. One feature with another model, choices driven by intuition and understanding of data regularities.

About 2011, deep learning began gathering momentum. As deep learning started to scale - with data volumes (ImageNet) and compute efficiencies (GPUs) - smart segmentation and triaging was no longer necessary. What came about were models like YOLO which used sliding window for brute-force segmentation of potential object boxes. Most of the job was 'good classification'. The windows were resized and passed through CNNs for co-optimised feature extraction and classification. Post processing involved non-maximal suppression to remove repeated and overlapping object detections.
(Note: YOLO is still popular and in wide use. Check out the work by Ultralytics providing an easy-to-use and effective platform for training YOLO models on real-world industrial datasets and deploying them on portables and handy hardware kits.

TLDR; Where am I going with this? Well, I guess I want to say two things:

  1. Tools shape the nature of the problematic.
  2. Also not. Cause the way I see it ML at its core has always been about finding the right balance between Bias & Variance. In earlier models, the biases came through induction i.e. through human understanding of the data and the problem, with lesser variance or adaptability. In DNNs the models grew to be complex, allowing for higher variance, so the bias had to come from sophisticated architectures and learning algorithms. In LLMs and beyond, with even larger models the frontier lies in better and smarter learning algorithms.