PDFLoader
Only available on Node.js.
This notebook provides a quick overview for getting started with
PDFLoader
document loaders. For
detailed documentation of all PDFLoader
features and configurations
head to the API
reference.
Overviewβ
Integration detailsβ
Class | Package | Compatibility | Local | PY support |
---|---|---|---|---|
PDFLoader | @langchain/community | Node-only | β | π (See note below) |
The Python package has many PDF loaders to choose from. See this link for a full list of Python document loaders.
Setupβ
To access PDFLoader
document loader youβll need to install the
@langchain/community
integration, along with the pdf-parse
package.
Credentialsβ
Installationβ
The LangChain PDFLoader integration lives in the @langchain/community
package:
- npm
- yarn
- pnpm
npm i @langchain/community pdf-parse
yarn add @langchain/community pdf-parse
pnpm add @langchain/community pdf-parse
Instantiationβ
Now we can instantiate our model object and load documents:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const nike10kPdfPath = "../../../../data/nke-10k-2023.pdf";
const loader = new PDFLoader(nike10kPdfPath);
Loadβ
const docs = await loader.load();
docs[0];
Document {
pageContent: 'Table of Contents\n' +
'UNITED STATES\n' +
'SECURITIES AND EXCHANGE COMMISSION\n' +
'Washington, D.C. 20549\n' +
'FORM 10-K\n' +
'(Mark One)\n' +
'β ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE FISCAL YEAR ENDED MAY 31, 2023\n' +
'OR\n' +
'β TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n' +
'FOR THE TRANSITION PERIOD FROM TO .\n' +
'Commission File No. 1-10635\n' +
'NIKE, Inc.\n' +
'(Exact name of Registrant as specified in its charter)\n' +
'Oregon93-0584541\n' +
'(State or other jurisdiction of incorporation)(IRS Employer Identification No.)\n' +
'One Bowerman Drive, Beaverton, Oregon 97005-6453\n' +
'(Address of principal executive offices and zip code)\n' +
'(503) 671-6453\n' +
"(Registrant's telephone number, including area code)\n" +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:\n' +
'Class B Common StockNKENew York Stock Exchange\n' +
'(Title of each class)(Trading symbol)(Name of each exchange on which registered)\n' +
'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:\n' +
'NONE\n' +
'Indicate by check mark:YESNO\n' +
'β’if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.ΓΎ Μ\n' +
'β’if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. ΜΓΎ\n' +
'β’whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding\n' +
'12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the\n' +
'past 90 days.\n' +
'ΓΎ Μ\n' +
'β’whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T\n' +
'(Β§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).\n' +
'ΓΎ Μ\n' +
'β’whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging growth company. See the definitions of βlarge accelerated filer,β\n' +
'βaccelerated filer,β βsmaller reporting company,β and βemerging growth companyβ in Rule 12b-2 of the Exchange Act.\n' +
'Large accelerated filerΓΎAccelerated filerβNon-accelerated filerβSmaller reporting companyβEmerging growth companyβ\n' +
'β’if an emerging growth company, if the registrant has elected not to use the extended transition period for complying with any new or revised financial\n' +
'accounting standards provided pursuant to Section 13(a) of the Exchange Act.\n' +
' Μ\n' +
"β’whether the registrant has filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial\n" +
'reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit\n' +
'report.\n' +
'ΓΎ\n' +
'β’if securities are registered pursuant to Section 12(b) of the Act, whether the financial statements of the registrant included in the filing reflect the\n' +
'correction of an error to previously issued financial statements.\n' +
' Μ\n' +
'β’whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the\n' +
"registrant's executive officers during the relevant recovery period pursuant to Β§ 240.10D-1(b).\n" +
' Μ\n' +
'β’\n' +
'whether the registrant is a shell company (as defined in Rule 12b-2 of the Act).βΓΎ\n' +
"As of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:\n" +
'Class A$7,831,564,572 \n' +
'Class B136,467,702,472 \n' +
'$144,299,267,044 ',
metadata: {
source: '../../../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
},
id: undefined
}
console.log(docs[0].metadata);
{
source: '../../../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}
Usage, one document per fileβ
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const singleDocPerFileLoader = new PDFLoader(nike10kPdfPath, {
splitPages: false,
});
const singleDoc = await singleDocPerFileLoader.load();
console.log(singleDoc[0].pageContent.slice(0, 100));
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
Usage, custom pdfjs
buildβ
By default we use the pdfjs
build bundled with pdf-parse
, which is
compatible with most environments, including Node.js and modern
browsers. If you want to use a more recent version of pdfjs-dist
or if
you want to use a custom build of pdfjs-dist
, you can do so by
providing a custom pdfjs
function that returns a promise that resolves
to the PDFJS
object.
In the following example we use the βlegacyβ (see pdfjs
docs)
build of pdfjs-dist
, which includes several polyfills not included in
the default build.
- npm
- yarn
- pnpm
npm i pdfjs-dist
yarn add pdfjs-dist
pnpm add pdfjs-dist
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const customBuildLoader = new PDFLoader(nike10kPdfPath, {
// you may need to add `.then(m => m.default)` to the end of the import
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});
Eliminating extra spacesβ
PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const noExtraSpacesLoader = new PDFLoader(nike10kPdfPath, {
parsedItemSeparator: "",
});
const noExtraSpacesDocs = await noExtraSpacesLoader.load();
console.log(noExtraSpacesDocs[0].pageContent.slice(100, 250));
(Mark One)
β ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023
OR
β TRANSITI
Loading directoriesβ
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const exampleDataPath =
"../../../../../../examples/src/document_loaders/example_data/";
/* Load all PDFs within the specified directory */
const directoryLoader = new DirectoryLoader(exampleDataPath, {
".pdf": (path: string) => new PDFLoader(path),
});
const directoryDocs = await directoryLoader.load();
console.log(directoryDocs[0]);
/* Additional steps : Split text into chunks with any TextSplitter. You can then use it as context or save it to memory afterwards. */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splitDocs = await textSplitter.splitDocuments(directoryDocs);
console.log(splitDocs[0]);
Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt
Unknown file type: example.txt
Unknown file type: notion.md
Unknown file type: bad_frontmatter.md
Unknown file type: frontmatter.md
Unknown file type: no_frontmatter.md
Unknown file type: no_metadata.md
Unknown file type: tags_and_frontmatter.md
Unknown file type: test.mp3
Document {
pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\n' +
'Satoshi Nakamoto\n' +
'satoshin@gmx.com\n' +
'www.bitcoin.org\n' +
'Abstract. A purely peer-to-peer version of electronic cash would allow online \n' +
'payments to be sent directly from one party to another without going through a \n' +
'financial institution. Digital signatures provide part of the solution, but the main \n' +
'benefits are lost if a trusted third party is still required to prevent double-spending. \n' +
'We propose a solution to the double-spending problem using a peer-to-peer network. \n' +
'The network timestamps transactions by hashing them into an ongoing chain of \n' +
'hash-based proof-of-work, forming a record that cannot be changed without redoing \n' +
'the proof-of-work. The longest chain not only serves as proof of the sequence of \n' +
'events witnessed, but proof that it came from the largest pool of CPU power. As \n' +
'long as a majority of CPU power is controlled by nodes that are not cooperating to \n' +
"attack the network, they'll generate the longest chain and outpace attackers. The \n" +
'network itself requires minimal structure. Messages are broadcast on a best effort \n' +
'basis, and nodes can leave and rejoin the network at will, accepting the longest \n' +
'proof-of-work chain as proof of what happened while they were gone.\n' +
'1.Introduction\n' +
'Commerce on the Internet has come to rely almost exclusively on financial institutions serving as \n' +
'trusted third parties to process electronic payments. While the system works well enough for \n' +
'most transactions, it still suffers from the inherent weaknesses of the trust based model. \n' +
'Completely non-reversible transactions are not really possible, since financial institutions cannot \n' +
'avoid mediating disputes. The cost of mediation increases transaction costs, limiting the \n' +
'minimum practical transaction size and cutting off the possibility for small casual transactions, \n' +
'and there is a broader cost in the loss of ability to make non-reversible payments for non-\n' +
'reversible services. With the possibility of reversal, the need for trust spreads. Merchants must \n' +
'be wary of their customers, hassling them for more information than they would otherwise need. \n' +
'A certain percentage of fraud is accepted as unavoidable. These costs and payment uncertainties \n' +
'can be avoided in person by using physical currency, but no mechanism exists to make payments \n' +
'over a communications channel without a trusted party.\n' +
'What is needed is an electronic payment system based on cryptographic proof instead of trust, \n' +
'allowing any two willing parties to transact directly with each other without the need for a trusted \n' +
'third party. Transactions that are computationally impractical to reverse would protect sellers \n' +
'from fraud, and routine escrow mechanisms could easily be implemented to protect buyers. In \n' +
'this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed \n' +
'timestamp server to generate computational proof of the chronological order of transactions. The \n' +
'system is secure as long as honest nodes collectively control more CPU power than any \n' +
'cooperating group of attacker nodes.\n' +
'1',
metadata: {
source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 9
},
loc: { pageNumber: 1 }
},
id: undefined
}
Document {
pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\n' +
'Satoshi Nakamoto\n' +
'satoshin@gmx.com\n' +
'www.bitcoin.org\n' +
'Abstract. A purely peer-to-peer version of electronic cash would allow online \n' +
'payments to be sent directly from one party to another without going through a \n' +
'financial institution. Digital signatures provide part of the solution, but the main \n' +
'benefits are lost if a trusted third party is still required to prevent double-spending. \n' +
'We propose a solution to the double-spending problem using a peer-to-peer network. \n' +
'The network timestamps transactions by hashing them into an ongoing chain of \n' +
'hash-based proof-of-work, forming a record that cannot be changed without redoing \n' +
'the proof-of-work. The longest chain not only serves as proof of the sequence of \n' +
'events witnessed, but proof that it came from the largest pool of CPU power. As \n' +
'long as a majority of CPU power is controlled by nodes that are not cooperating to',
metadata: {
source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 9
},
loc: { pageNumber: 1, lines: [Object] }
},
id: undefined
}
API referenceβ
For detailed documentation of all PDFLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html