dennisgorelik | Converting doc/docx/pdf/text to each other

Entry tags:

Converting doc/docx/pdf/text to each other

Lots of applications need to load and convert document files of different formats into other formats or into text.
You would have think that there would be a good solution to it.
Unfortunately it's not the case.
Existing solutions are either for desktop only, or buggy or extremely expensive (~$10K/year).

I thought I found a solution - DevExpress Document Server library for $599.99

Unfortunately, after running for couple of weeks it crashed my service with StackOverflowException exception:
----
https://www.devexpress.com/Support/Center/Question/Details/T257097
To my regret, there is no simple workaround to avoid this exception with your document. Regarding the time frame for fixing this issue, it is difficult to provide any estimate in such cases.
----

So now I need to find a way to prevent my service from dying in case if some random document is fed into it.

Sigh.

Flat | Top-Level Comments Only

Converting DOC in a nice way is basically impossible without VM running Windows 2000 - it's an old COM storage.

What does "not nice way" for converting DOC mean?

Are memory leaks and occasional fatal crashes pretty much guaranteed?

> What does "not nice way" for converting DOC mean?

It mean skewered tables and garbage in some places instead of text.

> Are memory leaks and occasional fatal crashes pretty much guaranteed?

For a .NET library on a web server? Of course.

> It mean skewered tables and garbage in some places instead of text.

That's not a serious problem.
The crashes that kill the whole process - that's what concerns me.

> For a .NET library on a web server? Of course.

Do you mean any .NET library on a web server crash es occasionally?
Or such crashes are specific to conversion of DOC files (due to old COM storage calls)?

I'm not sure about Web Server, because IIS automatically recovers with problems (so we might have not noticed) but our windows service did not exit due to crash for several years (and when it did it was our silly coding mistake).

cron 1-min check & restart if service crashed? fast'n'dirty fix when you have a queue to process (i.e. mark "supposed bad" document as "check-manually" and skip it).

We made a workaround similar to it:
1) Autorestart of our windows service in case of crash.
2) Remembering the hash of resume that crashed and do not convert it next time.

However that is only ok as an ugly patch, because our windows service runs many other processes in parallel to that document converter queue.
All these processes are terminated in the middle.

Another problem is that web site will also crash.
Fortunately IIS automatically recovers crashed thread, but still it makes me think that hard crashes like these can add instability to our system.

kind of "reliable processing system to be made of unreliable elements", where task state is divided from the processing made with (unstable) agents/services.

Причин две:

1. PDF достаточно закрытый формат принадлежащий Адобе.
2. PDF весьма сложный (язык).

Примерно тоже самое с Microsoft Word - сложен и владельцы мудаки. И те и те не хотят делится, они хотя стать монополистами форматов.

Поэтому полноценных перефарматировщиков не жди.

Ищи маленькие конвертеры .dll от индусских программистов. Работают на 80% и это максимум.

В смысле - использовать отдельные DLL-и для конвертирования pdf->text и doc->docx?

Что значит, работают на 80%?

значит, что некоторые участки, с мудрёным кодом, не конвертируют.. могут оставаться "белые места". бывают так же искажения (но это уже из серии векторной графики, с обычным текстом такое редкость)

Flat | Top-Level Comments Only

Converting doc/docx/pdf/text to each other

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject