Our Optical Character Recognition (OCR), while the best commercially available OCR technology, is not very good at identifying text from older documents.
Take for example, this newspaper from 1847. The images are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong, and the columns get confused.
What we need is “Culture Tech” (a riff on fintech, or biotech) and Culture Techies to work on important and useful projects–the things we need, but are probably not going to get gushers of private equity interest to fund. There are thousands of professionals taking on similar challenges in the field of digital humanities and whatsapp lead we want to complement their work with industrial-scale tech that we can apply to cultural heritage materials.
One such project would be to work on technologies to bring 19th-century documents fully digital. We need to improve OCR to enable full text search, but we also need help segmenting documents into columns and articles. test materials and thousands are uploading more documents all the time.
What we do not have is a good way to integrate work on these projects with the Internet Archive’s processing flow. So we need help and ideas there as well.