We had somewhat of the same setup here. I used Google Cloud & GlideAI to transcribe the image’s text into a JSON form I pre-defined, then query the JSON to get the fields I need.
With the availability of OpenAI’s vision model, I think it should be easier to do that now.