Re: problems with pdftotext

From: jimw <jpw_at_no.spam.please>
Date: Wed Jun 28 2006 - 12:45:02 CST

Conrad Knauer wrote:

> On 6/27/06, jimw <jpw@sasktel.net> wrote:
>
> [pdftotext]
>
>> a text done in
>> the Cyrillic alphabet produces a file containing nothing but the
>> punctuation. it's necessary to use the -enc option to turn it to
>> cyrillic, but the result I get is this:
>>
>> /Desktop/Zash$ pdftotext -enc cyrillic ./zash1.pdf
>> Error: Couldn't find unicodeMap file for the 'cyrillic' encoding
>> Error: Couldn't get text encoding
>>
>> Anyone know what I'm doing wrong?
>
>
> Hmm... First, are you still using Breezy? (this is not necessarily
> the cause of any problems, I just am curious :)
>
> Second, If you download
> http://www.health.state.ny.us/nysdoh/hospital/healthcareproxy/pdf/1402.pdf
>
> and just do "pdftotext 1402.pdf"
> do you get a garbage document? (In Dapper it makes a nice text file,
> e.g. it starts "Доверенность на принятие решений о медицинской
> помощи")
>
> Also, if you open the PDF file with a viewer can you copy and then
> paste sections into a gedit window?
>
> If that file works, can you send me a copy of the one that's giving
> you grief?
>
That file, in pdftotext, produces only the punctuation and such words as
are written in Latin alphabet

I can, however, make a copy of it and past it into gedit, thanks.

I tried this on one of the simpler things, and I can cut and past it, too.

Unfortunately, one of the ones I'd really like to do is a newspaper, and
cutting and pasting it scrambles the columns, aside from semi-randomly
replacing some of the characters with other characters.

I've worked at cutting and pasting a column at a time, which leaves me
only the problem of weird letters being substituted here and there.

I'll send it to you privately, so everyone in the group doesn't have to
suffer.

JimW
Received on Wed Jun 28 12:45:37 2006

This archive was generated by hypermail 2.1.8 : Fri Sep 08 2006 - 23:26:38 CST