Skip to content

Releases: allenai/olmocr

v0.1.60

17 Mar 17:00
Compare
Choose a tag to compare

What's new

Commits

dd72563 Bump version to v0.1.60 for release
baa0082 Don't go down too low in temp
f2951f3 Lints
1e42e5e Faster and nicer equation cache
1f8cc59 Pipeline scales temperature automatically, increases performance ~2%
4768ac4 Merge branch 'main' of https://github.com/allenai/olmocr
0968bd1 Mine headers footers
1270ca3 lints
d7361c4 Basic convert script
142a9cb Convert script to support broader folder structures
98c4283 Cap max workers to hopefully improve stability
5f3ef51 Faster equation cache and checking, cleanup data script
79e2677 Hmm, these should be passing!
f5d92bd Trying to get new CI to work
1db1b34 Merge pull request #122 from allenai/gpu-ci
9f38a8a Lints
5009bb3 Lints
acb0df3 Fixes
3eec2a8 Mining math
95f03e1 More small tests
d30a070 Tests
2696502 Much faster and responsive math bench
980121f Loading tests much faster in parallel
7729e5a Graphical pdf test from github
154a07c Math miner looks decent
d0b9b5b Fixes for math mining
09fd299 Mining
3f92265 Math miner working decently
5387a79 More tests for olmocrbench
189104b Fixing escaped html bug in mathml parsing
770bc36 Fixes for multipage
0553443 Convert scripts and other fun
8b3a9e4 Fixes for multipage runners
743e48e More fixes
b2fe82d Working on math compares
bc3a945 Adding some tests
35cc6f1 A few fixes for text comparisons and normalized chars
4709156 Leaving with some more data, but still cases to investigate
07be9ea More math testing
e39c3e4 New method for comparing equations
fff4050 More test documents
0ba56c0 Adjusting repeat test to be the "baseline" test which also looks for disallowed characters
a2b5ca8 Better markdown table parsing
3fef3f9 Gemini support, some debugging stuff
fc857f9 Starting on math dataset
d006e8f Working on equation matching
7003e9c Working on a better compare function
e144200 Fix markdown parsing for mistral
bdc0d75 Adding mistral ocr to eval
4053ea5 Work on image matching
b03d840 Better error handling on eqn rendering
438e68e Some more math stuff
7f36ac8 First math tests
b62ccc2 Equation rendering code, first pass
9be696f Adding a trailing repetition test
07466e1 Stats tests
eeb2733 Marker rerun, stats changes
50e55f4 Conversion fixes
fb0a729 Better convert script
fa68c6b Better conversion script, run on more things
c9ecd8e Need those chat templates
5611d79 Model runners
5cb32c3 Convert script work with server backends
87875b3 Merge branch 'main' of https://github.com/allenai/olmocr into main
2982526 Convert scripts for benchmark
1545a6d Adding more work on diffs
004486f Nice tables support
3a0bcb6 Better table tests
748fd62 Adding basic table relative tests
76476f9 Synth rendering ideas
c4f6b11 Fixing the mine diffs script, but it still doesn't work great
fcb1eab Consistent ordering on convert, with data dir script
ecac384 Making a nicer warning message when waiting for sglang server
03ef353 One last lint fix
7d7e81e Internal version bump
7a7c878 double parentheses for proper escaping
dc7cb5c Ruff fixes to CI
1348a29 Merge branch 'main' of https://github.com/allenai/olmocr into main
ca0f911 Probably need at least 20GB GPU ram to have a good time with olmocr
2241853 Merge branch 'main' of https://github.com/allenai/olmocr into main
a701a37 Fix for calling --pdfs with an invalid pdf
622540e Fix so that the pipeline.py attempts to download the model weights first, before starting the loading timeout
010fdf8 Small fix
7dd44ed convert script
701abdb Some new entries
1148b47 Minor fixes
361ed2a Merge branch 'main' of https://github.com/allenai/olmocr into main
9f12917 Organizing things for data entry
af02c63 Working viewer
8061aac Working on viewer/editor for rules
ab13ac6 Mining diff script outputs candidate rules
99ab046 Autominer work
143769b Merge pull request #61 from allenai/kylel/elo
1b78ec9 More work on automining
3670219 commits
2d4c1a1 Merge branch 'main' of https://github.com/allenai/olmocr into main
a03673e Working on some progress for the autominer, fixing more options in convert script
11e89dc Script fixups
505e08c automine draft
ae7efd3 Refactoring
9e019f1 More factoring
bd08fdb fixes missing OSS code for Issue #36
d4b902c Olmocr runner implemented
aac0c15 chatgpt converter
8a6e8b9 Basic rule viewer
9081f7f Update README.md
0130a97 fixed style
c2b54d8 updated readme
d841216 Merge branch 'main' of https://github.com/allenai/olmocr into main
813a355 Fixing mineru runner, added a few sample docs
cc1f476 Bugfixes
9da1f92 Cleaner implementations of benchmark stuff
53494d9 Refactoring
ff465f7 Starting refactor
a348cd6 olmocr bench runner
c20e3c0 Pdf for dataset
16a3244 olmocr running
422d08f Adding more rules and seeing how they should work
f2f7619 Adding mineru script
e5a80c5 Fixing up benchmark a bit
c3d0ce9 Some readmes and instructions
4e0339f Runner for olmocr bench
a8f6921 Benchmark runners for other systems
318abf2 Adding runbench
1230aef Making progress
072bc1d Making some progress
823629d Sample code for olmocrbench
9e62003 Adding readme for olmocr bench
e4f9b19 Infinigram counting script for paper
6020122 Match script
b871e4b Small helper to measure overlap

v0.1.58

15 Feb 00:29
Compare
Choose a tag to compare

What's new

Commits

a2c0887 Bump version to v0.1.58 for release
0e7b397 Update README.md

v0.1.53

14 Feb 21:15
Compare
Choose a tag to compare

What's new

  • Fixed git checks

Commits

08f7612 Bump version to v0.1.53 for release
58bdfa5 CI
25ec87b CI
c05e015 Hopefully CI runs now
15f9b8b Install poppler in CI
229da8c unused imports
32aa359 Formatting fix
0dcdbcc Update README.md
6583fb6 hfupload scripts
8297955 Making my parquets
51cfdbd Better converter
e369569 Update README.md
91eef27 Adding some gnarly 1 pager pdfs from kyle
87cb957 First pass at dataset builder script
6ed6f85 Generating parquets for hugging face
84c0c71 Merge branch 'main' of https://github.com/allenai/olmocr
7d67a59 Remove unused
6471f28 Random git ignores, remove unused code
c74d47a Pipeline fixes
04844b3 More beaker and docker fixes
9df86da Beaker fixes
cf6673c Pipeline fixes
7fbbb57 Remove mypy for now
d36e556 Hopefully fixes build
c69e0d6 More cleanup, removing dead adv anchor code
d4d711d Nicer glob handing for pipeline.py
84477b5 More formatting
e3d04ee Merge branch 'main' of https://github.com/allenai/olmocr into main
c37e545 running isort again
2c29533 Fixing most ruff errors
5690377 Ruff
fb40229 Isort and black update
cdb10a9 Python 3.11
dcaca8a Black formatting
4a1762d isort
0628d31 Some unit test cleanup
7d2403d More infos
8dd006d Merge branch 'main' of https://github.com/allenai/olmocr into main
04615d7 More logging on sglang server
0ccb99c readme
2e4ef95 Readme
2192505 Update README.md
9a1be7e Readme
496e162 Update README.md
b574766 Viewer and gitignore
86267d8 Viewer cleanup
a243c89 Update README.md
dbf6477 viewer fix
4c35105 More readme imporvements
f16acec Readme improvements
dee494a Local file stuff
7882944 Local pdf support
dbe5487 Support stats feature later
48447b6 Can use remote s3 files, and local workspace now
50f9a6a Name refactor
e0afb93 Better check for separate sglang installation step
00e3aac Inference test for qwen2 and 2.5, work queue fixes, build current still broken
4d0d924 Merge branch 'main' of https://github.com/allenai/olmocr
b28aad6 More test docs
96ae2dd Refactoring
c606267 Cleaning up some unused code
d8c13d0 Readmes and version updates
b2894d0 Massive refactor from pdelfin to olmocr
7261bfc Update README.md
cbfc803 Merge pull request #27 from allenai/molmo
aa59d38 Merge branch 'main' of https://github.com/allenai/pdelfin
eacd044 csv output
201fec3 Config update
72d2fa2 Reviewing molmo training
0311b44 Some small updates
6586744 Building some data summary tools
c74e3d1 ELO stuff
18f72b4 New ELO building stuff finished up I think
50464c1 build elo v1
3a28955 Added ELO scores
a8d9a55 Fixes for elo
00f2a67 More elo scoring stuff
834e91c runelo start
ef4167d Test set script
683be68 Better error handling on expand_s3_glob
5e633e0 Merge branch 'main' of https://github.com/allenai/pdelfin
0d1fc08 Small fixes
2190f61 Merge branch 'main' of https://github.com/allenai/pdelfin
e2bbd0e Adding some long context stats
0b72eda Move form check into exception handler, don't mark the work item as done if it had an exception on it
fa318da New version with s3 fix in it
84c53c2 Merge branch 'main' of https://github.com/allenai/pdelfin
e9c3c21 Skipping files which are not found
3e33ce1 Ignores
37cdb9e Merge branch 'main' of https://github.com/allenai/pdelfin
1eda300 Dolma viewer niceties
fe04db8 Better error handling
35502bc Limit the number of retries on the server process
b3ca86a More robust to errors when reading logs which had caused freezes
d4f3cff More reliable weka
6872105 Merge branch 'main' of https://github.com/allenai/pdelfin
c93fc36 Missing import
dd17185 More things to try
46fe4ac Trying fixes for live lock
41accfe Error out if you see a broken process pool, might need a better check for this
a95487e Adding check for possible sglang livelock
cff9799 Moving to official sglang release
f8dcdf6 Better catching of httpx errors and retrying them
d6a0013 Faster init by caching pdf filter
a91befc Fix for fallback stuff
8c858a9 New version
66fff4f Merge branch 'main' of https://github.com/allenai/pdelfin
212d391 More convservative filtering
cb800d6 Merge branch 'main' of https://github.com/allenai/pdelfin into main
7dd2046 New version
af8ce51 Merge branch 'main' of https://github.com/allenai/pdelfin into main
9112d81 No keep alive connection to try to resolve sglang livelock
53a5104 Merge branch 'main' of https://github.com/allenai/pdelfin into main
67d11ec TODOs and client fix
3153aea Merge branch 'main' of https://github.com/allenai/pdelfin into main
9b8d58b Better stats and metadata
273a8b0 Logging fallback pages
b0acfa8 Adding support for fallback pages
204a4a8 Better stats
3ef4609 Fixing args
27d2352 Claude recommends httpx instead of aiohttp, seeing if that will help with straggler timeouts
4469f4b Version patch
9e2e09b More fixes
8793fc7 Adding more retries, and it was able to process more complicated books
2f55a3d fix
d4d4736 more gcs
e48d4be Fix
8c3b575 Gcs support better
9381bf8 docs
f287f24 Fixing a few stats things
e499413 Better work queue
04429b2 Basic work queue from claude
995b1d1 Fixes, mocking out queue into separate file
fcabb8e Handling more error cases
96984fc Fix a reliability issue
0af29f1 Adding page rotation
e2303f2 Running on l40s, fixing queue
68543d4 Adding stats
b4ca563 Decent set of todos for monday
2f1664f Stop everything on a Nan
eac3b10 allow weka from augusta through vpn
370dbba new build
9ce243e no weka on augusta
eefb045 Single cluster fix
2e1d0b6 Fix
748b095 Fix
80ba562 Fixing timeout situation
65763de Don't retry accessdenied errors
2c52664 Cleaner exit
77c82fd New version with aiohttp fixes
ae1e4bc More realistic results
770da2b Docker
bfe4211 Debugging timeout errors and other things
fd17652 Trying to make it faster
278422b Fixing one max context issue
62de9fe weka fix
9a1e82f Logging
fe0574c Cleanup code, s3 retries
2c7686f I think I have error handling better now
8217e49 Page calc
4eab90f Fixing bugs
b67d8e7 Fixing work queue population
827b77e Working on task groups
a58efea better logging
a9cf2e0 Allow setting beaker priority
41c8d55 exponential backoff
4dcf9ed more fixes
06331d7 Fix timeout
8e16780 Beaker stuff
4c3bf70 Beaker fixes
3172a1c Shuffling
fe3c9a2 Creds and other things
a3b6962 fix
83bb1dc Dockerfile fixes
6c9c785 Using version strings
9610eac Secrets management
39256c1 Beaker running
867e2c9 Docker builds
a091412 Starting to play with docker too
bce85e6 pipeline
a085e8c Beaker test
910c2eb Downloads from s3 based on hash
6598e2d Control http session at the worker level
fbacdd0 Stuff
ae9b1c4 Better stats
9ce28c0 Measuring metrics better now
193e521 Semaphore timeout
102c0e4 new version of sglang, server restarts, semaphore timeouts
918e2f3 Pipeline stuff
691cc5a A few items
4f2f4fd Quicker results by limited workers via semaphore while still utilizing gpu
6154095 Logging and perf stuff
ade3580 FIxes
732300a Some errors dealt with
24a9d23 Trying to get reliablity up
fedda40 Small fixes
a9a94f2 Code to get stats
6b625b2 Bugfixes
9fb464c Refactoring to assemble docs
da1b23f Minor fixes
9ff107b Merge branch 'main' of https://github.com/allenai/pdelfin into main
299819e Reqs
9d51935 some cleanups
6590164 Starting to work
82ec249 Progress
37dc412 Working on script
e5fb7c0 Organization
ee72b36 Starting up server and workers async now
a39350e Reworking to be async
a103ce7 Some small things
b15bff6 Work queue coallescing
57186c7 Doing some more stuff
923231e exit handlers
051a7b4 Prepping work script
a65e12b Model download stuff
12a91ff Starting on a new approach
faf8659 Putting aside redis
3d6be3c Work queue sharing thing
75d4a0e Experimental beaker pipeline self organizing redis idea
a14febc sglang support for runeval
592cc50 More docs
03f5b25 Docs good now
d89ea6b docs
0362ce6 docs
b2b3f06 docs
46ccab3 More docs
93d7068 More docs
73bd961 Logger fix
3778228 More docs
ef2e4d6 Adding more docs
5ebc8cd Checkfix
9f010e6 Add check for poppler installation
be8fb28 Update README.md
426fda1 Removing some logs
500bd2d flash attn
d45b34f Trust remote code
cda0ad7 Config typo
cf3b377 train script
8f001bf Config updates
6a4a55f Hopefully working molmo HF trainer config
bede854 Startng to write molmo formatters
e65747e Some better logging
a0e0917 Merge branch 'main' of https://github.com/allenai/pdelfin into main
43aa4f2 Proper selection of LORA weights
bcb4794 Starting on molmo changes
232c445 Pipeline stability fixes hopefully and logging
ce2e4ba Applying rotation corrections
08d51b7 Adding some rotation retry contrl
7678f31 Fixing some reliability issues with the pipeline script
45269fa Switching to logging vs prints
a3e7654 Update all docs at once
062abff Adding some skip logic
8e6d0c6 swtichin to orjson, some better json error handling
48a3aff Reindexing
f13d0a5 List configs to list
ffe470b Fix
180dde0 dataprep sampling tests
64041bd Allow sampling different anchor text lens
6a22900 Allow for sampling anchor and other params
999f64d Adding empty anchor support
f8c5aac Some cleanup
a1a4798 Some crazy idea I had to simplify futures and memory limits
f6ac591 vllm benchmarker
4047258 Fixing one old bug to make update_static atomic
38dc5a2 Refactored to have a more efficient batchwriter, and also not allow too many running futures
d99096e Adding vllm profile script for reference
0a5c506 index
7c78676 Fix pipeline bug with indexing
31becaf S2orc dataset extractor
302eee3 Yay matches between birr and hf
f44dbd1 Small fixes
a482271 train more steps
c9ac48b Try to save at the last second only
9d35d3c Birr tokenization test
77f0b9f help text
7dbcbc1 Birr tests that don't do anything but help me understand the universe
492a3f6 Adding parameters for taget image and anchor text sizes
1c8602c Removing rotation invalid ones to see what happens
dd4f967 Filter refactor
3ecbeae Trying save to s3 but with threaded saver
5ba78ed Fix
89fcff2 Fixing saving bug again
7d4cff5 Nice test for picking proper page in birrpipelie
a4d7620 Choosing proper page
529d51d Put LR back, need to save larger checkpoints to weka to prevent timeouts
e141c91 Try lora run higher LR
2826bca Yay all unit tests pass cleanly now too
124aaf5 Hmm, cant repro ...

Read more