When the Epstein files were published a few weeks back, there were a couple of directories of raw text files and 12 directories containing a total of 23,000 one page JPEGs of evidence. I wrote a bit of Python to OCR all these in using the Gemini API: genai.GenerativeModel(model_name='gemini-2.5-flash') and a simple prompt to read the text from each JPEG, outputting to a txt file of the same base name.

In 2% of cases (444 files), this failed with:

ValueError: Invalid operation: The response.text quick accessor requires the response to contain a valid Part, but none were returned. The candidate's finish_reason is 4. Meaning that the model was reciting from copyrighted material.

One was a front page from New York Times, but the others I sampled don’t contain anything that looks copyrighted – often just emails with some redacted (and missing) content.

Given this is publicly published information, is there any legitimate way of getting the text translated?

Leave a Reply