File Processor Module
File Processor Module
Users can request the process of files rather than plain text chunks. The File Processor Module implements the needed logic to deal with the most common filetypes.
Users access the File Processor functionalities implicitly when they are using the different supplied interfaces which deal with files:
- The generic API when using the processfile endpoint
- The java application PGB
- The web application PGWeb
Main Functions
Users running any of the supplied interfaces have to enter their assigned credentials (APIKey) which will determine the list of processing engines they can use with the documents.
An engine has to be selected before a user can request a document to be translated. Once selected, users will use the translation file drag & drop-ing into the PGBClient.
The general workflow to process the files is as follows:
1. The file is received at the Production Access Server, assigned a unique id and queued for process in the File Processor.
2. The file is analysed to verify the format and the text segments are extracted in a temporary file (xliff format)
3. The segments are sent to the Production Access Server to be routed to the right engine
4. Dockerized engines are flared up when needed to process the segments
5. The processed texts are sent back to the File Processor which in turn updates the temporary file with the bilingual info.
6. When all text is processed, the output data is merged into the original document keeping the original document layout and format.
7. When the new, processed file is created, the client interface can be used again to download it into the user machine.
Compatible Formats
The service currently accepts the following file formats:
- txt, text plain files. The text to process is segmented by the obvious dots and by the new-line character.
- rtf, format is preserved
- MS Office family, only new file formats are accepted (docx, pptx and xlsx)
- Open Office family, all formats are accepted
- Microsoft Mail, msg files are accepted but after process the format is not preserved and plain text is returned.
- PDF files. Two options are being used in the current version. In the 1 st one only “searchable” pdfs are accepted. The pdf is converted to text, processed and plain text is returned, no layout is preserved. The 2 nd option the PDF is OCR’d and whatever the PDF contents the detected text is processed and returned preserving format and layout.
When translating a document, it is sometimes interesting to previously define which terms and expressions have to be kept invariant, preserving the original wording, or require a specific (custom) translation. That’s usually the case with the names of brands or brand terminology, proper names, jargon expressions, etc. The PGB client (the java client application running at the user’s machine) and the generic API offer the possibility to upload one or more ‘glossary’ files. These files define the terms/expressions for custom translation. It is possible to decide which glossary file to use if any when a document is requested for translation. Glossary files are plain text files using the TAB separator.
System Architecture
The File Processor is a multi-process Python application requiring Python3.5+ and a list of requirements defined in the standard requirements.txt
For the extraction and merge subprocesses and for different format conversions a number of java jar libraries are required. The full set is distributed with the process server package.
The server is configured using a config.ini file, only relevant fields are shown:
directorMaxConcurrentProcess = 5
mysqlHost = localhost
mysqlUser = root
mysqlPassword = password
mysqlDb = relay
relayMaxWorker = 1
relayMaxSnippetsPerRequest = 32