Using other API to translate PDF documents

hwaon · 11-01-2022 11:17 PM

Hello all,

I am trying to find a way to translate English PDF documents to a target language(Korean) without messing up the original PDF page format(pictures, headers, tables, etc.)

The only problem with the default google translation is that many of the words that appear in the document are very industry-specific and need to be translated accordingly through AutoML translation.

However, we'd like to use our own language model (i.e. fine-tuned GPT3) to translate just the text and feed the translated text to the output stream to get the final pdf output.

I'm yet to see any other company that maintains PDF formatting as well as Google while translating, so I'd really like to use Cloud Translation API with our own translation module for optimal accuracy.

Is there a way to do this? I've tried reaching out to the local Google branch to no avail. Please help!

ErnestoC

It is possible to use a glossary in Cloud Translation to provide the API with custom translations for terms that appear in texts. This would help when industry-specific terminology needs to be translated in a specific way.

As for using a custom language recognition model, you would be able to create a Feature Request for Cloud Translate API in Google’s public issue tracker.

Subhashini

Do we have any solution how to perform this activity by fine tuning llm models?

How to perform translation using fine tuning techniques, how to create training datasets and which llm model suits well for this use case

How to make sure input styles remain same, and just translate and keep the text in the output file

format of input format same and just translate the text and place it exactly in output file