dennisgorelik: (2009)
Dennis Gorelik ([personal profile] dennisgorelik) wrote2015-06-18 06:34 pm

Converting doc/docx/pdf/text to each other

Lots of applications need to load and convert document files of different formats into other formats or into text.
You would have think that there would be a good solution to it.
Unfortunately it's not the case.
Existing solutions are either for desktop only, or buggy or extremely expensive (~$10K/year).

I thought I found a solution - DevExpress Document Server library for $599.99

Unfortunately, after running for couple of weeks it crashed my service with StackOverflowException exception:
----
https://www.devexpress.com/Support/Center/Question/Details/T257097
To my regret, there is no simple workaround to avoid this exception with your document. Regarding the time frame for fixing this issue, it is difficult to provide any estimate in such cases.
----

So now I need to find a way to prevent my service from dying in case if some random document is fed into it.

Sigh.

[identity profile] cranequinier.livejournal.com 2015-06-19 05:22 am (UTC)(link)
Converting DOC in a nice way is basically impossible without VM running Windows 2000 - it's an old COM storage.

[identity profile] cranequinier.livejournal.com 2015-06-19 03:31 pm (UTC)(link)
> What does "not nice way" for converting DOC mean?

It mean skewered tables and garbage in some places instead of text.

> Are memory leaks and occasional fatal crashes pretty much guaranteed?

For a .NET library on a web server? Of course.

[identity profile] sagarasousuke.livejournal.com 2015-06-19 08:53 am (UTC)(link)
cron 1-min check & restart if service crashed? fast'n'dirty fix when you have a queue to process (i.e. mark "supposed bad" document as "check-manually" and skip it).

[identity profile] sagarasousuke.livejournal.com 2015-06-19 11:31 am (UTC)(link)
kind of "reliable processing system to be made of unreliable elements", where task state is divided from the processing made with (unstable) agents/services.

[identity profile] serjiojitser.livejournal.com 2015-06-28 02:14 pm (UTC)(link)
Причин две:

1. PDF достаточно закрытый формат принадлежащий Адобе.
2. PDF весьма сложный (язык).

Примерно тоже самое с Microsoft Word - сложен и владельцы мудаки. И те и те не хотят делится, они хотя стать монополистами форматов.

Поэтому полноценных перефарматировщиков не жди.

Ищи маленькие конвертеры .dll от индусских программистов. Работают на 80% и это максимум.

[identity profile] serjiojitser.livejournal.com 2015-06-30 03:55 am (UTC)(link)
значит, что некоторые участки, с мудрёным кодом, не конвертируют.. могут оставаться "белые места". бывают так же искажения (но это уже из серии векторной графики, с обычным текстом такое редкость)