In our previous post, we looked at the various ways Torah sefarim have been made available to us digitally. This resource is one of the greatest gifts our generation has been endowed with. It can be used in a vast number of ways.
The sheer scope of material made available, the endless ways it can be collated, categorized, collected and compared is nothing short of wondrous. The more a person works with these resources, the more he realizes how hugely beneficial they are.
The scanned library is remarkably similar to a bricks-and-mortar library, since the pages look just the way they do in a real sefer – they are photographed and stored as is. The difference between scanned and typed material is that the files of the former are heavier and take up more room.
Ask the executors of the “Responsa Project” or DBS Master Library how much it costs to type a sefer – they’ll roll their eyes and sigh. It’s a mammoth expense; each book takes hundreds of hours of typing and proofreading.
Some sefarim publishers agreed to transmit their print-ready digital text of sefarim to the Responsa Project. That was a great solution, since the manuscript was practically error-free, having already been edited and proofread. Sometimes the publisher even sent along valuable insights that were gleaned from having worked on the text. (One of the sefarim thus obtained for the Responsa Project is the Rokeach’s commentary on the seder hatefillah. As of Responsa Project’s Version 14, it is even possible for the user to type and save his own insights alongside the text. This idea was first introduced by Otzar Hachochma.)
But as you can imagine, not many publishers are willing to release the manuscripts in which they invested a lot of time and money. After all, once it’s available in this format, anyone can print it for himself, or, the publishers argue, potential clients have no need to purchase the print version at all, since they can access it digitally. (Although, as mentioned in a previous post, this reservation is unfounded, since the digital form can in no way replace the print form as a means of study. The digital form is only useful as a form of reference or for quick perusal.)
Users of the DBS Master Library sometimes use OCR technology-based programs such as Ligature or FineReader to convert the scanned images of sefarim to editable text. But these programs have many disadvantages, one of them being that the Hebrew Alef Beis contains many similar letters that are indistinguishable by software; such as beis, chof, pei / chof, nun / reish, dalet, zayin, vav / hei, ches / alef, ayin, tes / yud, vav, nun etc. This is another example of where the computer, with all its remarkable prowess, cannot compete with the human brain.
When you input a certain letter into the computer’s memory bank, and then ask it to recognize the same shape in every image it sees, it won’t recognize the letter once there is the most miniscule change in the font size or boldness of the said letter – not to mention major differences such as in the letters of handwritten text. If the leg of an Aleph is a little longer or shorter than the original Aleph the computer was shown, the dumb genius won’t identify it as an Aleph. It likewise doesn’t penetrate his iron brain that there’s a difference between a dalet that has a short stump on its right, and a longer on its left, to a zayin, which has a short stump on both sides. Or, if the scanned printed page has part of a letter broken off, again, the computer will no longer recognize it. And if the text is vowelized, the computer loses its mind completely.
Humans Vs. Computers
The human ability to identify basic details and ignore secondary details such as the ability to distinguish between a printed dot and a little dirt sticking to a page – is something that IT engineers have yet to mimic. The developers of the program Ligature claim to have invented a method called ‘neural networks’ that mimic human thought. However, in an article they wrote on the topic, they concede that computer software is hopelessly inept when compared to the human brain. A clear example of how Hashem’s Master designs can never be duplicated.
However, we digress. Let us return to our subject at hand. The DBS Master Library team probably also used programs such as these to digitize sefarm, and therefore, their work has many errors. There is no way about it; producing a digitized sefer that is error-free is a very time-consuming, costly task.
The Solution? Scan the Sefarim
It was at this point that the idea of scanning the sefarim came up. An optical scanner works faster than any typist, and with it, thousands of books can be stored in a Torah database in a relatively short amount of time. The image is, of course, 100% faithful to the original and the result: a library of a size and scope unmatched in huge libraries, and so compact in size, it can be stored in a box the size of a siddur. The Otzar Hachochmah contains all the basic texts, as well as many other sefarim, including ancient, rare sefarim published for the first time in centuries, facsimiles of handwritten manuscripts, first editions of sefarim, Torah pamphlets, thousands of contemporary sefarim, etc. A typed library of this magnitude would cost literally millions of dollars. Fact is that the Responsa project has been around for forty years, and contains approximately 1,200 sefarim, while the Otzar Hachochmah has been around for only several years and contains about 100,000 sefarim!
Word Search
With all the advantages of a database containing scanned sefarim, it has one major drawback; it is impossible to do a search within images. Unlike a typed sefer in which the computer scans the characters in its memory for a configuration of letters, it cannot search within a scanned sefer, which holds only images. A computer sees no difference between an image of a pretty picture and an image containing lettering; both are a collection of pixels, except that one is more saturated with color than the other.
We have already discussed OCR technology based software that tries to mimic the readability of the human eye and brain, and converts scanned text into typed text. This technology enables heretofore images to be searched. But the converted text contains many errors. Showing up in its raw, non-edited, non-proofread form it is virtually useless. Conversely, editing and proofreading the endless reams of text, even on the most elementary level, would cost millions.
The software developers of Otzar Hachochmah came up with a brilliant solution. They converted the images to text with OCR technology, but instead of only showing the user the typed text, it also shows the original image. It looks as if two pages are ‘stuck’ to each other; the top is the image, and the other is the editable, digital text converted from the image. When the user conducts a search, the typed text shows up, along with a marker showing those words in the original image. The searcher can thus check for any errors.
This proved to be a marvelous idea. Although a search may not always bring up every single time a single word or configuration of words appear in a sefer – being that the digitized text does have its errors – but there are so many sefarim in which to search for, that any search will yield some result. It’s like a man who casts his fishing net into a pond packed with fish. Wherever he casts his net, he’ll come up with fish. For comparison’s sake, there are 1200 sefarim in which to search through in the Response Project, as opposed to 100,000 sefarim in the Otzar Hachochma. Any search in the latter is bound to yield results.
In our last post, we compared digital searches to the genius of a librarian who remembers all the contents of all the books in his library. The Otzar Hachochma librarian’s genius far surpasses that of the Responsa Project. The latter may be punctilious and exact, and the former a little spaced out and may bring up irrelevant results, but the former knows more sefarim by heart. From personal experience, the Otzar Hochochma works amazingly. Try typing your full name in the search engine, and you may be surprised to discover that the software has located and pulled it up from somewhere in its database (though it may not be in reference to you personally…)