Skip to content Skip to footer
0 items - $0.00 0

Ingesting PDFs and why Gemini 2.0 changes everything by serjester

Ingesting PDFs and why Gemini 2.0 changes everything by serjester

Ingesting PDFs and why Gemini 2.0 changes everything by serjester

54 Comments

  • Post Author
    cedws
    Posted February 5, 2025 at 6:40 pm

    90% accuracy +/- 10%? What could that be useful for, that’s awfully low.

  • Post Author
    Havoc
    Posted February 5, 2025 at 7:03 pm

    Been toying with the flash model. Not the top model, but think it'll see plenty use due to the details. Wins on things other than top of benchmark logs

    * Generous free tier

    * Huge context window

    * Lite version feels basically instant

    However

    * Lite model seems more prone to repeating itself / looping

    * Very confusing naming e.g. {model}-latest worked for 1.5 but now its {model}-001? The lite has a date appended, the non-lite does not. Then there is exp and thinking exp…which has a date. wut?

  • Post Author
    daemonologist
    Posted February 5, 2025 at 7:03 pm

    I wonder how this compares to open source models (which might be less accurate but even cheaper if self-hosted?), e.g. Llama 3.2. I'll see if I can run the benchmark.

    Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) – the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.

  • Post Author
    fecal_henge
    Posted February 5, 2025 at 7:04 pm

    Is there an AI platform where I can paste a snip of a graph and it will generate a n th order polynomial regression for me of the trace?

  • Post Author
    bt3
    Posted February 5, 2025 at 7:05 pm

    One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.

    [1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43…

  • Post Author
    scottydelta
    Posted February 5, 2025 at 7:07 pm

    This is what I am trying to figure out how to solve.

    My problem statement is:

    – Injest PDFs, summarize, and extract important information.

    – Have some way to overlay the extracted information on the pdf in the UI.

    – User can provide feedback on the overlaid info by accepting or rejecting the highlights as useful or not.

    – This info goes back in to the model for reinforced learning.

    Hoping to find something that can make this more manageable.

  • Post Author
    exabrial
    Posted February 5, 2025 at 7:10 pm

    You know what'd be fucking nice? The ability to turn Gemini off.

  • Post Author
    jfkrrorj
    Posted February 5, 2025 at 7:11 pm

    [flagged]

  • Post Author
    fngjdflmdflg
    Posted February 5, 2025 at 7:12 pm

    >Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

    This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.

  • Post Author
    coderstartup
    Posted February 5, 2025 at 7:17 pm

    Following this post

  • Post Author
    cubefox
    Posted February 5, 2025 at 7:19 pm

    Why is Gemini Flash so much cheaper than other models here?

  • Post Author
    lazypenguin
    Posted February 5, 2025 at 7:19 pm

    I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

    Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

  • Post Author
    ein0p
    Posted February 5, 2025 at 7:23 pm

    > Why Gemini 2.0 Changes Everything

    Clickbait. It doesn't change "everything". It makes ingestion for RAG much less expensive (and therefore feasible in a lot more scenarios), at the expense of ~7% reduction in accuracy. Accuracy is already rather poor even before this, however, with the top alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the author seems to suggest that the failure modes are mostly around formatting rather than e.g. mis-recognition or hallucinations.

    TL;DR: is this exciting? If you do RAG, yes. Does it "change everything" nope. There's still a very long way to go. Protip for model designers: accuracy is always in greater demand than performance. A slow model that solves the problem is invariably better than a fast one that fucks everything up.

  • Post Author
    ChrisArchitect
    Posted February 5, 2025 at 7:23 pm

    Related:

    Gemini 2.0 is now available to everyone

    https://news.ycombinator.com/item?id=42950454

  • Post Author
    cccybernetic
    Posted February 5, 2025 at 7:27 pm

    Shameless plug: I'm working on a startup in this space.

    But the bounding box problem hits close to home. We've found Unstructured's API gives pretty accurate box coordinates, and with some tweaks you can make them even better. The tricky part is implementing those tweaks without burning a hole in your wallet.

  • Post Author
    sho_hn
    Posted February 5, 2025 at 7:31 pm

    Remember all the hyperbole a year ago on how Google was failing and over?

  • Post Author
    pockmarked19
    Posted February 5, 2025 at 7:31 pm

    Now, I could look at this relatively popular post about Google and revise my opinion of HN as an echo chamber, but I’m afraid it’s just that the downvote loving HNers weren’t able to make the cognitive leap from Gemini to Google.

  • Post Author
    resource_waste
    Posted February 5, 2025 at 7:31 pm

    Google's models have historically been total disappointments compared to chatGPT4. Worse quality, wont answer medical questions either.

    I suppose I'll try it again, for the 4th or 5th time.

    This time I'm not excited. I'm expecting it to be a letdown.

  • Post Author
    beklein
    Posted February 5, 2025 at 7:34 pm

    Great article, I couldn't find any details about the prompt… only the snippets of the `CHUNKING_PROMPT` and the `GET_NODE_BOUNDING_BOXES_PROMPT`.

    Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?

    Your insights would be highly appreciated.

  • Post Author
    nickandbro
    Posted February 5, 2025 at 7:41 pm

    I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.

  • Post Author
    matthest
    Posted February 5, 2025 at 7:42 pm

    This is completely tangential, but does anyone know if AI is creating any new jobs?

    Thinking of the OCR vendors who get replaced. Where might they go?

    One thing I can think of is that AI could help the space industry take off. But wondering if there are any concrete examples of new jobs being created.

  • Post Author
    rjurney
    Posted February 5, 2025 at 7:44 pm

    I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.

    I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works – you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)

    My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think – which REALLY helps out.

    They are:

    * The Stratigrapher – A Lovecraftian short story about the world's first city.
    All of Seton Lloyd/Faud Safar's work on Eridu.
    Various sources on Sumerian culture and religion
    All of Lovecraft's work and letters.
    Various sources about opium
    Some articles about nonlinear geometries

    * FPGA Accelerated Graph Analytics
    An introduction to Verilog
    Papers on FPGAs and graph analytics
    Papers on Apache Spark architecture
    Papers on GraphFrames and a related rant I created about it and graph DBs
    A source on Spark-RAPIDS
    Papers on subgraph matching, graphlets, network motifs
    Papers on random graph models

    * Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.

    I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)

  • Post Author
    ratedgene
    Posted February 5, 2025 at 7:51 pm

    Is this something we can run locally? if so what's the license?

  • Post Author
    __jl__
    Posted February 5, 2025 at 7:53 pm

    The numbers in the blog post seem VERY inaccurate.

    Quick calculation:
    Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt.
    Output pricing: Let's assume 500 token per page, which is $0.0003

    Cost per page: $0.0004935

    That means 2,026 pages per dollar. Not 6,000!

    Might still be cheaper than many solutions but I don't see where these numbers are coming from.

    By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.

    Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.

  • Post Author
    nothrowaways
    Posted February 5, 2025 at 7:55 pm

    Cool

  • Post Author
    anirudhb99
    Posted February 5, 2025 at 7:55 pm

    thanks a ton for all the amazing feedback on this thread! if

    (a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or

    (b) there are loss cases for which gemini doesn't work well today,

    please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!

  • Post Author
    raunakchowdhuri
    Posted February 5, 2025 at 8:02 pm

    CTO of Reducto here. Love this writeup!

    We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.

    A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.

    Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.

    The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.

    We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon.
    For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.

    Overall though, I’m very excited about the progress here.


    One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.

  • Post Author
    jbarrow
    Posted February 5, 2025 at 8:02 pm

    > Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes

    Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.

    [1] https://qwenlm.github.io/blog/qwen2.5-vl/

  • Post Author
    xena
    Posted February 5, 2025 at 8:05 pm

    I really wish that Google made an endpoint that's compatible with the OpenAI API. That'd make trying Gemini in existing flows so much easier.

  • Post Author
    devmor
    Posted February 5, 2025 at 8:12 pm

    I think this is one of the few functional applications of LLMs that is really undeniably useful.

    OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.

  • Post Author
    bambax
    Posted February 5, 2025 at 8:20 pm

    I'm building a system that does regular OCR and outputs layout-following ASCII; in my admittedly limited tests it works better than most existing offerings.

    It will be ready for beta testing this week or the next, and I will be looking for beta testers; if interested please contact me!

  • Post Author
    siquick
    Posted February 5, 2025 at 8:23 pm

    Strange that LlamaParse is mentioned in the pricing table but not the results. We’ve used them to process a lot of pages and it’s been excellent each time.

  • Post Author
    zoogeny
    Posted February 5, 2025 at 8:30 pm

    Orthogonal to this post, but this just highlights the need for a more machine readable PDF alternative.

    I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?

    I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?

  • Post Author
    jibuai
    Posted February 5, 2025 at 8:40 pm

    I've been working on something similar the past couple months. A few thoughts:

    – A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.

    – Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.

    – Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.

  • Post Author
    dasl
    Posted February 5, 2025 at 8:49 pm

    How does the Gemini OCR perform against non-English language text?

  • Post Author
    an_aparallel
    Posted February 5, 2025 at 8:49 pm

    Has anyone in the AEC industry who's reading this worked out a good way to get Bluebeam MEP, electrical layouts into Revit (LOD 200-300).

    Have seen MarkupX as a paid option, but it seems some AI in the loop can greatly speed up exception handling, encode family placement to certain elevations based on building code docs….

  • Post Author
    sensecall
    Posted February 5, 2025 at 9:02 pm

    This is super interesting.

    Would this be suitable for ingesting and parsing wildly variable unstructured data into a structured schema?

  • Post Author
    kbyatnal
    Posted February 5, 2025 at 9:14 pm

    It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

    I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

    This requires things like:

    – state-of-the-art parsing powered by VLMs and OCR

    – multi-step extraction powered by semantic chunking, bounding boxes, and citations

    – processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

    – tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

    – evaluation and benchmarking tools

    – fine-tuning pipelines that turn reviewed corrections —> custom models

    Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

    [1] https://extend.app/

  • Post Author
    ThinkBeat
    Posted February 5, 2025 at 9:17 pm

    Hmm I have been doing a but if this manually lately for a personal project.
    I am working on some old books that are far past any copyright, but
    they are not available anywhere on the net.
    (Being in Norwegian m makes a book a lot more obscure)
    so I have been working on creating ebooks out of them.

    I have a scanner, and some OCR processes I run things through.
    I am close to 85% from my automatic process.

    The pain of going from 85% to 99% though is considerable.
    (and in my case manual)
    (well Perl helps)

    I went to try this AI on one of the short poem manufscript I have.

    I told the prompt I wanted PDF to Markdown, it says sure go ahead
    give me the pdf.
    I went upload it.
    It spent a long time spinning.
    then a quick messages comes up, something like

    "Failed to count tokens"

    but it just flashes and goes away.

    I guess the PDF is too big?
    Weird though, its not a lot of pages.

  • Post Author
    rudolph9
    Posted February 5, 2025 at 9:25 pm

    We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.

    https://tika.apache.org/

  • Post Author
    nottorp
    Posted February 5, 2025 at 9:27 pm

    Will 2.0.1 also change everything?

    How about 2.0.2?

    How about Llama 13.4.0.1?

    This is tiring. It's always the end of the world when they release a new version of some LLM.

  • Post Author
    llm_trw
    Posted February 5, 2025 at 9:27 pm

    This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

    You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

    You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

    You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

    You feed each image box into a multimodal model to describe what the image is about.

    For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

    You then stitch everything together in an XML file because Markdown is for human consumption.

    You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

    You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

    You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

    I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

  • Post Author
    oedemis
    Posted February 5, 2025 at 9:29 pm

    there is also https://ds4sd.github.io/docling/ from ibm research which is mit license and track bounding boxes as rich json format

  • Post Author
    sergiotapia
    Posted February 5, 2025 at 9:39 pm

    The article mentions OCR, but you're sending a PDF how is that OCR? Or is this is mistake? What if you send photos of the pages, that would be true OCR – does the performance and price remain the same?

    If so this unlocks a massive workflow for us.

  • Post Author
    xnx
    Posted February 5, 2025 at 9:42 pm

    Glad Gemini is getting some attention. Using it is like a superpower. There are so many discussions about ChatGTP, Claude, DeepSeek, Llama, etc. that don't even mention Gemini.

  • Post Author
    roywashere
    Posted February 5, 2025 at 9:53 pm

    I think it is very ironic that we chose to use PDF in many fields to archive data because it is a standard and because we would be able to open our pdf documents in 50 or 100 years time. So here we are just a couple of years later facing the challenge of getting the data out of our stupid PDF documents already!

  • Post Author
    gapeslape
    Posted February 5, 2025 at 10:31 pm

    In my mind, Gemini 2.0 changes everything because of the incredibly long context (2M tokens on some models), while having strong reasoning capabilities.

    We are working on compliance solution (https://fx-lex.com) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.

    It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.

  • Post Author
    mateuszbuda
    Posted February 5, 2025 at 10:39 pm

    There’s AWS Bedrock Knowledge Base (Amazon proprietary RAG solution) which can digest PDFs and, as far as I tested it on real world documents, it works pretty well and is cost effective.

  • Post Author
    erulabs
    Posted February 5, 2025 at 10:40 pm

    Hrm I've been using a combo of Textract (for bounding boxes) AI for understanding the contents of the document. Textract is excellent at bounding boxes and exact-text capture, but LLMs are excellent at understanding when a messy/ugly bit of a form is actually one question, or if there are duplicate questions etc.

    Correlating the two (Textract <-> AI) output is difficult, but another round of AI is usually good at that. Combined with some text-different scoring and logic, I can get pretty good full-document understanding of questions and answer locations. I've spent a pretty absurd amount of time on this and as of yet have not launched a product with it, but if anyone is interested I'd love to chat about the pipeline!

  • Post Author
    pbronez
    Posted February 5, 2025 at 10:46 pm

    Wonder how this compares to Docling. So far that's been the only tool that really unlocked PDFs for me. It's solid but really annoying to install.

    https://ds4sd.github.io/docling/

  • Post Author
    ritvikpandey21
    Posted February 5, 2025 at 11:09 pm

    ritvik here from pulse. everyone’s pretty much made the right points here, but wanted to emphasize that due to the llm architecture, they predict “the most probable text string” that corresponds to the embedding, not necessarily the exact text. this non-deterministicness is awful for customers deploying in production and a lot of our customers complained about this to us initially. the best approach is to build a sort-of “agent”-based VLM x traditional layout segmentation/reading order algos, which is what we’ve done and are continuing to do.

    we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!

    check us out at https://www.runpulse.com

  • Post Author
    twelve40
    Posted February 5, 2025 at 11:11 pm

    Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.

  • Post Author
    throw7381
    Posted February 5, 2025 at 11:22 pm

    For data extraction from long documents (100k+ tokens) how does structured outputs via providing a json schema compare vs asking one question per field (in natural language)?

    Also I've been hearing good things regarding document retrieval about Gemini 1.5 Pro, 2.0 Flash and gemini-exp-1206 (the new 2.0 Pro?), which is the best Gemini model for data extraction from 100k tokens?

    How do they compare against Claude Sonnet 3.5 or the OpenAI models, has anyone done any real world tests?

  • Post Author
    lifeisstillgood
    Posted February 5, 2025 at 11:37 pm

    Imagine there's no PostScript

    It's easy if you try

    No pdfs below us

    Above us only SQL

    Imagine all the people livin' for CSV

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.