ארכיון Uncategorized - מכון תוצאות - סריקת OCR מקצועית https://www.tozhaot.co.il/category/uncategorized/ סריקת ספרים, OCR, סריקה חכמה, הוצאה לאור, עימוד ספרים, עימוד רב טקסט, אוצר החכמה Thu, 15 Dec 2022 15:10:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.9 https://www.tozhaot.co.il/wp-content/uploads/2021/03/vavican.png ארכיון Uncategorized - מכון תוצאות - סריקת OCR מקצועית https://www.tozhaot.co.il/category/uncategorized/ 32 32 תסתכל בקנקן… וגם מי זכה בסוף בסריקה חינם? https://www.tozhaot.co.il/%d7%aa%d7%a1%d7%aa%d7%9b%d7%9c-%d7%91%d7%a7%d7%a0%d7%a7%d7%9f-%d7%95%d7%92%d7%9d-%d7%9e%d7%99-%d7%96%d7%9b%d7%94-%d7%91%d7%a1%d7%95%d7%a3-%d7%91%d7%a1%d7%a8%d7%99%d7%a7%d7%94-%d7%97%d7%99%d7%a0/ https://www.tozhaot.co.il/%d7%aa%d7%a1%d7%aa%d7%9b%d7%9c-%d7%91%d7%a7%d7%a0%d7%a7%d7%9f-%d7%95%d7%92%d7%9d-%d7%9e%d7%99-%d7%96%d7%9b%d7%94-%d7%91%d7%a1%d7%95%d7%a3-%d7%91%d7%a1%d7%a8%d7%99%d7%a7%d7%94-%d7%97%d7%99%d7%a0/#respond Thu, 15 Dec 2022 15:10:45 +0000 https://www.tozhaot.co.il/?p=2354 הפוסט <strong>תסתכל בקנקן… וגם מי זכה בסוף בסריקה חינם?</strong> הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
הפוסט <strong>תסתכל בקנקן… וגם מי זכה בסוף בסריקה חינם?</strong> הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
https://www.tozhaot.co.il/%d7%aa%d7%a1%d7%aa%d7%9b%d7%9c-%d7%91%d7%a7%d7%a0%d7%a7%d7%9f-%d7%95%d7%92%d7%9d-%d7%9e%d7%99-%d7%96%d7%9b%d7%94-%d7%91%d7%a1%d7%95%d7%a3-%d7%91%d7%a1%d7%a8%d7%99%d7%a7%d7%94-%d7%97%d7%99%d7%a0/feed/ 0
The World of Digital Torah Resources – Part III https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-iii/ https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-iii/#respond Mon, 04 Apr 2022 13:42:41 +0000 https://tozhaot.co.il/?p=1521 The sheer scope of material made available, the endless ways it can be collated, categorized, collected and compared is nothing short of wondrous. The more a person works with these resources, the more he realizes how hugely beneficial they are.

הפוסט The World of Digital Torah Resources – Part III הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
In our previous post, we looked at the various ways Torah sefarim have been made available to us digitally. This resource is one of the greatest gifts our generation has been endowed with. It can be used in a vast number of ways.

The sheer scope of material made available, the endless ways it can be collated, categorized, collected and compared is nothing short of wondrous. The more a person works with these resources, the more he realizes how hugely beneficial they are.

The scanned library is remarkably similar to a bricks-and-mortar library, since the pages look just the way they do in a real sefer – they are photographed and stored as is. The difference between scanned and typed material is that the files of the former are heavier and take up more room.

Ask the executors of the “Responsa Project” or DBS Master Library how much it costs to type a sefer – they’ll roll their eyes and sigh. It’s a mammoth expense; each book takes hundreds of hours of typing and proofreading.

Some sefarim publishers agreed to transmit their print-ready digital text of sefarim to the Responsa Project. That was a great solution, since the manuscript was practically error-free, having already been edited and proofread. Sometimes the publisher even sent along valuable insights that were gleaned from having worked on the text. (One of the sefarim thus obtained for the Responsa Project is the Rokeach’s commentary on the seder hatefillah. As of Responsa Project’s Version 14, it is even possible for the user to type and save his own insights alongside the text. This idea was first introduced by Otzar Hachochma.)

But as you can imagine, not many publishers are willing to release the manuscripts in which they invested a lot of time and money. After all, once it’s available in this format, anyone can print it for himself, or, the publishers argue, potential clients have no need to purchase the print version at all, since they can access it digitally. (Although, as mentioned in a previous post, this reservation is unfounded, since the digital form can in no way replace the print form as a means of study. The digital form is only useful as a form of reference or for quick perusal.)

Users of the DBS Master Library sometimes use OCR technology-based programs such as Ligature or FineReader to convert the scanned images of sefarim to editable text. But these programs have many disadvantages, one of them being that the Hebrew Alef Beis contains many similar letters that are indistinguishable by software; such as beis, chof, pei / chof, nun / reish, dalet, zayin, vav / hei, ches / alef, ayin, tes / yud, vav, nun etc. This is another example of where the computer, with all its remarkable prowess, cannot compete with the human brain.

When you input a certain letter into the computer’s memory bank, and then ask it to recognize the same shape in every image it sees, it won’t recognize the letter once there is the most miniscule change in the font size or boldness of the said letter – not to mention major differences such as in the letters of handwritten text. If the leg of an Aleph is a little longer or shorter than the original Aleph the computer was shown, the dumb genius won’t identify it as an Aleph. It likewise doesn’t penetrate his iron brain that there’s a difference between a dalet that has a short stump on its right, and a longer on its left, to a zayin, which has a short stump on both sides. Or, if the scanned printed page has part of a letter broken off, again, the computer will no longer recognize it. And if the text is vowelized, the computer loses its mind completely.

Humans Vs. Computers

The human ability to identify basic details and ignore secondary details such as the ability to distinguish between a printed dot and a little dirt sticking to a page – is something that IT engineers have yet to mimic. The developers of the program Ligature claim to have invented a method called ‘neural networks’ that mimic human thought. However, in an article they wrote on the topic, they concede that computer software is hopelessly inept when compared to the human brain. A clear example of how Hashem’s Master designs can never be duplicated.

However, we digress. Let us return to our subject at hand. The DBS Master Library team probably also used programs such as these to digitize sefarm, and therefore, their work has many errors. There is no way about it; producing a digitized sefer that is error-free is a very time-consuming, costly task.

The Solution? Scan the Sefarim

It was at this point that the idea of scanning the sefarim came up. An optical scanner works faster than any typist, and with it, thousands of books can be stored in a Torah database in a relatively short amount of time. The image is, of course, 100% faithful to the original and the result: a library of a size and scope unmatched in huge libraries, and so compact in size, it can be stored in a box the size of a siddur. The Otzar Hachochmah contains all the basic texts, as well as many other sefarim, including ancient, rare sefarim published for the first time in centuries, facsimiles of handwritten manuscripts, first editions of sefarim, Torah pamphlets, thousands of contemporary sefarim, etc. A typed library of this magnitude would cost literally millions of dollars. Fact is that the Responsa project has been around for forty years, and contains approximately 1,200 sefarim, while the Otzar Hachochmah has been around for only several years and contains about 100,000 sefarim!

Word Search

With all the advantages of a database containing scanned sefarim, it has one major drawback; it is impossible to do a search within images. Unlike a typed sefer in which the computer scans the characters in its memory for a configuration of letters, it cannot search within a scanned sefer, which holds only images. A computer sees no difference between an image of a pretty picture and an image containing lettering; both are a collection of pixels, except that one is more saturated with color than the other. 

We have already discussed OCR technology based software that tries to mimic the readability of the human eye and brain, and converts scanned text into typed text. This technology enables heretofore images to be searched. But the converted text contains many errors. Showing up in its raw, non-edited, non-proofread form it is virtually useless. Conversely, editing and proofreading the endless reams of text, even on the most elementary level, would cost millions.

The software developers of Otzar Hachochmah came up with a brilliant solution. They converted the images to text with OCR technology, but instead of only showing the user the typed text, it also shows the original image. It looks as if two pages are ‘stuck’ to each other; the top is the image, and the other is the editable, digital text converted from the image.  When the user conducts a search, the typed text shows up, along with a marker showing those words in the original image. The searcher can thus check for any errors.

This proved to be a marvelous idea. Although a search may not always bring up every single time a single word or configuration of words appear in a sefer – being that the digitized text does have its errors – but there are so many sefarim in which to search for, that any search will yield some result. It’s like a man who casts his fishing net into a pond packed with fish. Wherever he casts his net, he’ll come up with fish. For comparison’s sake, there are 1200 sefarim in which to search through in the Response Project, as opposed to 100,000 sefarim in the Otzar Hachochma. Any search in the latter is bound to yield results.

In our last post, we compared digital searches to the genius of a librarian who remembers all the contents of all the books in his library. The Otzar Hachochma librarian’s genius far surpasses that of the Responsa Project. The latter may be punctilious and exact, and the former a little spaced out and may bring up irrelevant results, but the former knows more sefarim by heart. From personal experience, the Otzar Hochochma works amazingly. Try typing your full name in the search engine, and you may be surprised to discover that the software has located and pulled it up from somewhere in its database (though it may not be in reference to you personally…)

הפוסט The World of Digital Torah Resources – Part III הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-iii/feed/ 0
The World of Digital Torah Resources – Part II https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-ii%ef%bf%bc/ https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-ii%ef%bf%bc/#respond Sun, 27 Mar 2022 13:48:43 +0000 https://tozhaot.co.il/?p=1406 These two programs are the largest and most comprehensive electronic collection of Jewish texts. The 'Responsa Project' or 'Response' has been around for forty years (!) and was initially developed at a time when most people didn't have computers at home.

הפוסט The World of Digital Torah Resources – Part II הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
The World of Digital Torah Resources – Part II

In the previous post we discussed the revolution that technological development has brought to the world of sifrei kodesh, and the differences between a human brain and computer chip.

In this post we’ll tell you about the vast virtual libraries of sefarim that can be accessed from any computer. These include: the Global Jewish Database (Responsa Project) sponsored by Bar-Ilan University, the DBS Master Library, Tochnat Haketer – a digitized Mikraos Gedolos Chumash, also sponsored by Bar-Ilan University, Otzar Hachochmah, and Otzar HaTorah, a collection of scanned sefarim. There are more, but these are the principal ones.

The Responsa Project and the DBS Master Library

These two programs are the largest and most comprehensive electronic collection of Jewish texts. The ‘Responsa Project’ or ‘Response’ has been around for forty years (!) and was initially developed at a time when most people didn’t have computers at home. The most recent version includes a Talmudic encyclopedia.

The DBS Master Library also goes back many years, although it is newer than the Responsa Project. It, too, has been upgraded many times over the years.

One has to appreciate the incredible compactness of digitized data. A standard bookcase in a Torah house usually occupies at least one of the living room walls, and it usually contains the most basic of Torah sefarim; namely, Chumash, Mishnayos, Talmud Bavli, Rambam, Tur and Shulchan Aruch, sefarim of Rishonim on Shas, some mussar sefarim, and sefarim on Torah thought.

Talmud Bavli alone takes up at least one shelf. So does the Tur and Shulchan Aruch. And if it is a special edition such as the one published by Machon Yerushalayim, or the first edition of Shas produced by Harav Kook’s Institute – it takes up even more space.

And yet, all these sefarim, plus more, can be contained in one thin DVD, which takes up no space at all! And there are no lost or damaged sefarim to contend with, no dusting of sefarim on erev Pesach, no wondering, ‘On what shelf did I put that sefer?’

The digitized collections of sefarim have two main features: First, the possibility to browse through any book in the database. This requires a good catalog that will help the researcher easily find the sefer he’s looking for. Second, the ability to search for any word, phrase, or combination of words in the database.

The catalog of the Responsa Project has 26 main categories, including: the 5 Chumashim, Talmud Bavli, Talmud Yerushalmi, Mishnayos, and Tosfos. The categories are sub-divided into further categories. We won’t tire you with the breakdowns of all the divisions and sub-divisions; suffice to say that the hundreds of sefarim contained in the DVD would fill a sefarim closet covering all four walls of a decently sized living room – if not more.

When I first accessed the software and tried to peruse a page of Gemara, I was sorely disappointed. Instead of viewing a familiar, beloved page of Gemara, one with Rashi on one side, Tosfos on the other, and various other commentaries surrounding the main text, all I saw was cold, sterile text, with annoying + signs before and after each commentary. It bore no resemblance to a traditional page of Gemara.

Today, after many years of consistent use of the software, I must admit that nothing comes close to learning the daf yomi from a real sefer – it’is incomparable to studying from a sterile computer screen. But when it comes to searching for a word or phrase, or looking something up in a sefer, the merits of digitized sefarim are indisputable.  

Take, for example, a talmid who is studying a sugya in depth and wants to look up the opinion of the Rambam or Shulchan Aruch. He leafs through the sefer Ein Mishpat and goes to the bookcase to look up the relevant sources. The Rambam directs him to look up further sources, and so on and so on, until a huge pile of sefarim mounts up on his desk. And all that provided he actually finds the sefarim he’s looking for. If he’s in yeshivah or Kollel, chances are that somebody else is using the sefer.

On a Gemara page in the Responsa project software, one doesn’t even have to look up a source in Ein Mishpat. A hypertext link leads to related sources such as the Rambam or Tur and Shulchan Aruch, as well as other places in Gemara where the same topic is mentioned. One click, and you get a list of all the commentaries of the Rishonim and Acharonim on this topic. The same applies when perusing pages of Tanach, Midrash, and other Torah texts.

Background Information

Another advantage of the Responsa Project is the biographical pages it provides. When a sefer is typed up for the purpose of reprint, the typist doesn’t always bother typing the index page, which usually offers some background information on the sefer and its author. It can be frustrating when doing a search in an electronic Torah database, and relevant passages come up, with no information as to who wrote the sefer in which the passages appear.

All this is not to mention the Responsa project’s bibliographical index feature, which provides a list of references of sefarim not included in the database, articles in Torah journals etc., or a recently added feature that allows one to add personal notes in the margins of sefarim, or the feature that allows you to calculate the gematria (numerical value) of any expression of your choice, or the option to find biblical verses or expressions with any specified gematria, or the dictionary of abbreviations, or the Hebrew calendar, etc. etc.

In our next post we will explore the revolution caused by the scanning of a vast number of ancient Jewish texts.

הפוסט The World of Digital Torah Resources – Part II הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-ii%ef%bf%bc/feed/ 0
The World of Digital Torah Resources – Part I https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-i%ef%bf%bc/ https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-i%ef%bf%bc/#respond Sun, 27 Mar 2022 13:35:02 +0000 https://tozhaot.co.il/?p=1397 Imagine yourself in a library that has an uber-efficient librarian. No need to tell him where your required book is located; just tell him the name of the book and he immediately locates it.

הפוסט The World of Digital Torah Resources – Part I הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
The World of Digital Torah Resources – Part I

Imagine yourself in a library that has an uber-efficient librarian. No need to tell him where your required book is located; just tell him the name of the book and he immediately locates it. He even knows the contents of each book by heart! You can ask him how many times a certain word is mentioned in all the books the library carries, and he will promptly show you on which pages that word is located in every single book on the library’s shelves.  

And not only that. The librarian is present and available in the library 24 hours a day. Whenever you feel like visiting, he’s there, ready and willing to help you out.

Well, the digital software that contains the electronic version of hundreds of sefarim, does just this – plus much more.

It is said about one of the talmidim of the Vilna Gaon that he boasted to his Rebbe that he’d studied a certain masechta so many times, he knew it by heart. The Gaon asked him, “Do you know how many times Abaya and Rava are mentioned in that masechta? Can you recite the masechta in reverse order? The student, of course, did not know, and the Gaon proceeded to tell him exactly how many times Abaya and Rava are mentioned, and recited one of the passages back to front.  

“Knowing a masechta by heart,” said the Gaon, “means knowing it back to front, like every Yid knows the words of Ashrei.”

Now all of us can recite “Ashrei Yoshvei…” by heart, but try saying it in reverse – starting with the word Hallelukah. We won’t get very far. Or if somebody shoots out, “What’s the third word in the tenth posuk?” it’ll take a while for us to come up with the answer.

That’s because a person’s memory is associative – one word leads to the next, until he strings together an entire paragraph, or chapter. That’s how we remember songs; it’s not only musical prodigies who can remember thousands of melodies by heart, each of which consists of scores of different notes. How is it that we can remember the exact combination of notes when we have a hard time remembering a single phone number containing a mere ten digits?

Because the consecutive notes relate to each other; each sound follows the other to form a meaningful melody. The same goes for memorizing facial details; each face has a combination of thousands of small details that form one memorable composite. And that’s why it’s easier to remember the concept of a difficult sugya in Gemara than to memorize a page of Gemara by heart.  The name of the game is – logical connection.

Memory such as the Vilna Gaon possessed is rarely found in humans – it’s dubbed ‘a photographic memory’ for it’s usually only seen in computers and scanners that scan photographs. A human being is no computer.

We need computer software to provide us with the memory we are missing. Scanners can photograph huge amounts of text in one second, and computerized software can tell us exactly how many times the names of Abaya and Rava are written in a certain sefer, how many times Ravina and Rav Ashi appear together in the entire Shas, how many times they appear individually, or anything else that has you pondering.  

For example: in Gemara Bavli, the term  ‘Shema Mina Tlas’ is mentioned a number of times, meaning that from this passage one can derive three new halachos. To the best of my recollection, I’ve never seen it mentioned in Bavli that we can learn two or four or five halachos from this passage. Only three! Isn’t that strange?

Of course my memory can’t be relied on, so what do I do? If I don’t have the right computer software, my only option is to go through all of Shas, which will take me A LOT of time.

But, if I do have the software, it’s no big deal; I can just type ‘Shema Mina Tarti’ in Talmud Bavli and click on the Search icon search. Nothing will come up. My interesting observation has been confirmed. I can now try to theorize why this is so, but at least I know it’s so.

The computer software can also search all other parts of sifrei Chazal, and it will find that in another masechta (Kallah Rabsi 91, Halacha 3) the phrase “Shema Mina Tarti” appears once, as well as in Talmud Yerushalmi (Pesachim 5: 3).  And the phrase “Shema Mina Chamesh” appears once in Yerushalmi (Kiddushin 2: 5).

We’ll end with the statement that is so relevant today: “The fear of Shabbos befalls the ignoramus.” Asks the Tosfos, “Why is that?” In our generation, this takes on a new meaning: On Shabbos, we can’t use the computer, and out true ignorance is exposed…

In the next article we’ll talk a little more about the vast Torah resources that have been made available to us.

הפוסט The World of Digital Torah Resources – Part I הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
https://www.tozhaot.co.il/the-world-of-digital-torah-resources-part-i%ef%bf%bc/feed/ 0
Speedy Scanning https://www.tozhaot.co.il/speedy-scanning/ https://www.tozhaot.co.il/speedy-scanning/#respond Sun, 27 Mar 2022 13:05:40 +0000 https://tozhaot.co.il/?p=1377 Allow us to tell you more about the remarkable breakthrough in the Jewish publishing world; the scanning of ancient sefarim for the purpose of reprint.

הפוסט Speedy Scanning הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
From Dream to Print

Allow us to tell you more about the remarkable breakthrough in the Jewish publishing world; the scanning of ancient sefarim for the purpose of reprint.

Not so long ago, if somebody wanted to reprint a sefer, the sefer had to be painstakingly typed up, letter by letter, line by line, page by page.

The typed up work was, with rare exception, riddled with errors. The text had to be reviewed by several proofreaders in order for the errors to be caught and fixed, yet despite that, sefarim reprinted in that model contain many glaring errors. Sometimes, entire lines were skipped, resulting in a distorted interpretation of the original text.

Many sefarim were reprinted with lines missing from the original. Although proofreaders are tasked with comparing the typed up version to the original sefer, the work is tedious, and they would often rely on their own judgement; if the idea presented made sense, they would leave it at that. Serious mistake.  

And then, OCR software was born, marking the beginning of a new era.

What is OCR?

OCR is a technology that recognizes letters and numbers that were scanned by a scanner, and converts them from an image, into a digital, edit-enabled file.

This is how it works: The computer software photographs the page and reads it, word for word, line by line. The software is smart; it knows when to move on to the next column, and when to turn a page, what continues from a previous section, and when a new section begins. It recognizes and reads everything – it doesn’t miss a single letter, including page numbering, titles, and any comments written in the margins. The software is an exceptional Talmid Chacham!

As the software “reads” the scanned pages, it converts the faded, sometimes partially erased letters into digital text, with remarkable speed and accuracy.  The material is now completely editable and can have words inserted, deleted, moved around, and replaced.

What Makes Machon Totzaot Unique

As with many good things, OCR software is not entirely bug-free. Sometimes, it doesn’t pick up on the unique terminology of Torah expositions and it converts the ancient Hebrew used in Gemara and other Torah texts to modern Ivrit.

Rabbi Bransdorfer has developed a unique formula, he has upgraded the standard OCR software to one that is especially sensitive to the nuances of the Hebrew used in sifrei kodesh. It can distinguish between standard Hebrew and the Hebrew used for Torah works. It ‘read’ and converted the original edition of the Rambam into a digitized, reprintable version with virtually zero errors.  

Projects that used to take months of tedious, laborious work now takes several days, with near-perfect results. Machon Totzaot has successfully scanned and ‘read’ many ancient sefarim, some of which were even too illegible to be typed up. It has also been involved in several complex projects that had to go through many reviews in several different formats, until the required result was achieved.

הפוסט Speedy Scanning הופיע לראשונה ב-מכון תוצאות - סריקת OCR מקצועית.

]]>
https://www.tozhaot.co.il/speedy-scanning/feed/ 0