Skip to content

doc: add missing parts for default training #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions TrainingTesseract-4.00.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ The following table describes its command-line options:
| `sequential_training` | `bool` | `false` | Set to true for sequential training. Default is to process all training data in round-robin fashion. |
| `net_mode` | `int` | `192` | Flags from `NetworkFlags`in `network.h`. Possible values: `128` for Adam optimization instead of momentum; `64` to allow different layers to have their own learning rates, discovered automatically. |
| `perfect_sample_delay` | `int` | `0` | When the network gets good, only backprop a perfect sample after this many imperfect samples have been seen since the last perfect sample was allowed through. |
| `debug_interval` | `int` | `0` | If non-zero, show visual debugging every this many iterations. |
| `debug_interval` | `int` | `0` | If non-zero, show visual debugging every this many iterations (requires Java and ScrollView.jar). |
| `weight_range` | `double` | `0.1` | Range of random values to initialize weights. |
| `momentum` | `double` | `0.5` | Momentum for alpha smoothing gradients. |
| `adam_beta` | `double` | `0.999` | Smoothing factor squared gradients in ADAM algorithm. |
Expand Down Expand Up @@ -635,15 +635,13 @@ Training data is created using [tesstrain.sh](https://github.com/tesseract-ocr/t
as follows:

```
cd tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--noextract_font_properties --langdata_dir ~/tesstutorial/langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
```
And the following is printed out after a successful run:
```
Created starter traineddata for LSTM training of language 'eng'
Run 'lstmtraining' command to continue LSTM training for language 'eng'
```
And the following is printed out after a successful run, `Created starter traineddata for LSTM training of language 'eng'`, file `~/tesstutorial/engeval/*.lstmf` have been created, you can the jump to section "Training From Scratch" below.

The above command makes LSTM training data equivalent to the data used to train
base Tesseract for English. For making a general-purpose LSTM-based OCR engine,
Expand All @@ -653,7 +651,7 @@ Now try this to make eval data for the 'Impact' font:

```
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--noextract_font_properties --langdata_dir ~/tesstutorial/langdata \
--tessdata_dir ./tessdata \
--fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval
```
Expand Down Expand Up @@ -688,9 +686,11 @@ The following example shows the command line for training from scratch. Try it
with the default training data created with the command-lines above.

```
mkdir ~/tesstutorial/engtrain/
find ~/tesstutorial/engeval -name "*.lstmf" > ~/tesstutorial/engtrain/eng.training_files.txt
mkdir -p ~/tesstutorial/engoutput
training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
src/training/lstmtraining \
--traineddata tessdata/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
Expand Down Expand Up @@ -1208,4 +1208,4 @@ If you notice that your model is misbehaving, for example by:
* Adding `Space` where it should not do that.
* etc...

[Then read the hallucination topic.](The-Hallucination-Effect.md)
[Then read the hallucination topic.](The-Hallucination-Effect.md)