For example, when parsing a PDF such as test/pdf/misc/i28_line_break_210.pdf in this repo,
const pdfParser = new PDFParser();
pdfParser.on('pdfParser_dataError', (errData: PDFParserError) => {
// handle error
});
pdfParser.on('pdfParser_dataReady', (pdfData: PDFData) => {
try {
// write pdfData.Pages[0].Texts to file
} catch (error) {
// handle error
}
});
pdfParser.loadPDF(`test/pdf/misc/i28_line_break_210.pdf`);
The resulting output is something like this:
[ { x: -0.25,
y: 48.75,
w: 3,
clr: 0,
sw: 0.32553125,
A: 'left',
R: [ { T: '%20', S: -1, TS: [ 0, 15, 0, 0 ] } ] },
{ x: -0.25,
y: 48.75,
w: 110.016,
clr: 0,
sw: 0.32553125,
A: 'left',
R: [ { T: 'BY%20ORDER%20OF%20THE%20', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
{ x: -0.25,
y: 48.75,
w: 3,
clr: 0,
sw: 0.32553125,
A: 'left',
R: [ { T: '%20', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
{ x: -0.25,
y: 48.75,
w: 140.376,
clr: 0,
sw: 0.32553125,
A: 'left',
R: [ { T: 'SECRETARY%20OF%20THE%20AIR', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
...
Sometimes the first item has a unique x/y coordinate, but thereafter all elements have the same x/y coordinate.
This makes spatial-aware parsing and grouping algorithms which depend on these coordinates useless.
Output was fine with version 3.2.0, and is broken in version 3.2.1 and 3.2.2.
The issue happens for both parseBuffer() and loadPDF().
The major refactor of version 3.2.1 added "Type3 glyph font support", but after reviewing the diff I was unable to identify where the above undesirable behavior was introduced.
For example, when parsing a PDF such as
test/pdf/misc/i28_line_break_210.pdfin this repo,The resulting output is something like this:
Sometimes the first item has a unique x/y coordinate, but thereafter all elements have the same x/y coordinate.
This makes spatial-aware parsing and grouping algorithms which depend on these coordinates useless.
Output was fine with version 3.2.0, and is broken in version 3.2.1 and 3.2.2.
The issue happens for both
parseBuffer()andloadPDF().The major refactor of version 3.2.1 added "Type3 glyph font support", but after reviewing the diff I was unable to identify where the above undesirable behavior was introduced.