Maintain formatting while reading PDF document

Question

b il 15 Gen 2024

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2069916-maintain-formatting-while-reading-pdf-document

Commentato: Steven il 7 Giu 2024

Hello, while reading a PDF document, I want to let the formatting as it is - for the bold to be bold, for the italic to be italic. I have tried this with extractFileText, but not successful. How can this be done? Thanks.

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

b il 19 Gen 2024

Assuming that all text inside double-stars is bold, if the Word document (the easiest and the most preferred option number 2 out of the 5 options provided by Shah to read the Word document converted by pdf2word) has:

first word - this line contains first word which appears in between a sentence.

Now I want to select the bold instances matching with whatever occurs before the hyphen (the string 'first word' in this case) and replace it with three consecutive underscores ___ so that the output is:

first word - this line contains ___ which appears in between a sentence.

It would have been easy to do this manually, however, there are thousands of such replacements to be made in the document.

Thanks.

b il 19 Gen 2024

Forgot to mention that I have to neccessarily use MATLAB for this ... Otherwise it is ridiculously easy using Find-Replace of Word.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

akshatsood il 15 Gen 2024

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2069916-maintain-formatting-while-reading-pdf-document#answer_1389576

Hi @b,

I understand that you want to maintain formatting while reading a PDF document. To extract text from a PDF document while preserving formatting such as bold and italic, you would typically need a more advanced PDF processing tool or library that supports rich text extraction. MATLAB's built-in "extractFileText" function does not preserve text formatting, as it is designed to extract plain text.

I hope this helps.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 2

Hassaan il 15 Gen 2024

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2069916-maintain-formatting-while-reading-pdf-document#answer_1389636

Apri in MATLAB Online

If you want to preserve the formatting, MATLAB itself does not provide built-in functions to directly extract formatted text from PDFs, as this requires interpretation of the PDF content stream which can be quite complex due to the nature of PDF formatting.

External Tools: Use an external tool designed for PDF text extraction that preserves formatting. There are several tools available that can extract text with formatting from PDFs, such as Adobe Acrobat's SDK or other third-party libraries. You can call these tools from MATLAB using the system function or other interfacing methods depending on the tool.
PDF to Word: Convert the PDF to a Word document (which preserves formatting) using an external tool or online service, and then use MATLAB to read the Word document using functions from the Text Analytics Toolbox.
Manual Inspection: If you only have a few documents and you're looking for specific formatted text, you might manually inspect the PDF file for the markup of bold and italic text. However, this is not practical for large-scale or automated extraction.
Custom Scripting with Other Programming Languages: Use a scripting language that has libraries for PDF manipulation (like Python with PyPDF2 or PDFMiner) to extract the text while preserving formatting, and then pass the extracted content to MATLAB if needed.
Optical Character Recognition (OCR): Use OCR tools that can recognize and preserve text formatting. MATLAB has an OCR function that can recognize text in images, but it won't retain text formatting. You would need to use a more advanced OCR tool for formatted text extraction.

[status, cmdout] = system('command-to-extract-formatted-text-from-pdf');

Remember to replace 'command-to-extract-formatted-text-from-pdf' with the actual command that invokes your PDF text extraction tool.

For advanced document processing needs that go beyond what MATLAB directly supports, it's usually more effective to use a combination of tools, possibly involving other programming environments that have more specialized libraries for handling PDFs.

---------------------------------------------------------------------------------------------------------------------------------------------------------

If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.

Professional Interests

Technical Services and Consulting
Embedded Systems | Firmware Developement | Simulations
Electrical and Electronics Engineering

Feel free to contact me.

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Christopher Creutzig il 19 Gen 2024

Just for clarification, extractFileText already does >90% of the complexity of parsing the PDF stream you mentioned. The reason it does not give information about font names, bold/italic/roman, position on the page, etc. is that its design point is to read the text to then use in text analytics workflows.

Most of that information is, after all, already used internally to arrange the text found correctly before returning a string.

Steven il 7 Giu 2024

the conditions are very clearly stated. thank you.

Accedi per commentare.

Maintain formatting while reading PDF document

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Risposte (2)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Community Treasure Hunt

Maintain formatting while reading PDF document

3 Commenti Mostra 1 commento meno recenteNascondi 1 commento meno recente

Risposte (2)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

3 Commenti Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Community Treasure Hunt

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente