Loader
Before you can start indexing your documents, you need to load them into memory.
SimpleDirectoryReader
LlamaIndex.TS supports easy loading of files from folders using the SimpleDirectoryReader
class.
It is a simple reader that reads all files from a directory and its subdirectories.
import { SimpleDirectoryReader } from "llamaindex/readers/SimpleDirectoryReader";
// or
// import { SimpleDirectoryReader } from 'llamaindex'
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("../data");
documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});
Currently, it supports reading .txt
, .pdf
, .csv
, .md
, .docx
, .htm
, .html
, .jpg
, .jpeg
, .png
and .gif
files, but support for other file types is planned.
You can override the default reader for all file types, inlcuding unsupported ones, with the overrideReader
option.
Additionally, you can override the default reader for specific file types or add support for additional file types with the fileExtToReader
option.
Also, you can provide a defaultReader
as a fallback for files with unsupported extensions. By default it is TextFileReader
.
SimpleDirectoryReader supports up to 9 concurrent requests. Use the numWorkers
option to set the number of concurrent requests. By default it runs in sequential mode, i.e. set to 1.
import type { BaseReader, Document, Metadata } from "llamaindex";
import {
FILE_EXT_TO_READER,
SimpleDirectoryReader,
} from "llamaindex/readers/SimpleDirectoryReader";
import { TextFileReader } from "llamaindex/readers/TextFileReader";
class ZipReader implements BaseReader {
loadData(...args: any[]): Promise<Document<Metadata>[]> {
throw new Error("Implement me");
}
}
const reader = new SimpleDirectoryReader();
const documents = await reader.loadData({
directoryPath: "../data",
defaultReader: new TextFileReader(),
fileExtToReader: {
...FILE_EXT_TO_READER,
zip: new ZipReader(),
},
});
documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});
LlamaParse
LlamaParse is an API created by LlamaIndex to efficiently parse files, e.g. it's great at converting PDF tables into markdown.
To use it, first login and get an API key from https://cloud.llamaindex.ai. Make sure to store the key as apiKey
parameter or in the environment variable LLAMA_CLOUD_API_KEY
.
Then, you can use the LlamaParseReader
class to local files and convert them into a parsed document that can be used by LlamaIndex.
See LlamaParseReader.ts for a list of supported file types:
import { LlamaParseReader, VectorStoreIndex } from "llamaindex";
async function main() {
// Load PDF using LlamaParse
const reader = new LlamaParseReader({ resultType: "markdown" });
const documents = await reader.loadData("../data/TOS.pdf");
// Split text and create embeddings. Store them in a VectorStoreIndex
const index = await VectorStoreIndex.fromDocuments(documents);
// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "What is the license grant in the TOS?",
});
// Output response
console.log(response.toString());
}
main().catch(console.error);
Additional options can be set with the LlamaParseReader
constructor:
resultType
can be set tomarkdown
,text
or.json
. Defaults totext
language
primarly helps with OCR recognition. Defaults toen
. See ../readers/type.ts for a list of supported languages.parsingInstructions
can help with complicated document structures. See this LlamaIndex Blog Post for an example.skipDiagonalText
set to true to ignore diagonal text.invalidateCache
set to true to ignore the LlamaCloud cache. All document are kept in cache for 48hours after the job was completed to avoid processing the same document twice. Can be useful for testing when trying to re-parse the same document with, e.g. differentparsingInstructions
.gpt4oMode
set to true to use GPT-4o to extract content.gpt4oApiKey
set the GPT-4o API key. Optional. Lowers the cost of parsing by using your own API key. Your OpenAI account will be charged. Can also be set in the environment variableLLAMA_CLOUD_GPT4O_API_KEY
.numWorkers
as in the python version, is set inSimpleDirectoryReader
. Default is 1.
LlamaParse with SimpleDirectoryReader
Below a full example of LlamaParse
integrated in SimpleDirectoryReader
with additional options.
import {
LlamaParseReader,
SimpleDirectoryReader,
VectorStoreIndex,
} from "llamaindex";
async function main() {
const reader = new SimpleDirectoryReader();
const docs = await reader.loadData({
directoryPath: "../data/parallel", // brk-2022.pdf split into 6 parts
numWorkers: 2,
// set LlamaParse as the default reader for all file types. Set apiKey here or in environment variable LLAMA_CLOUD_API_KEY
overrideReader: new LlamaParseReader({
language: "en",
resultType: "markdown",
parsingInstruction:
"The provided files is Berkshire Hathaway's 2022 Annual Report. They contain figures, tables and raw data. Capture the data in a structured format. Mathematical equation should be put out as LATEX markdown (between $$).",
}),
});
const index = await VectorStoreIndex.fromDocuments(docs);
// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query:
"What is the general strategy for shareholder safety outlined in the report? Use a concrete example with numbers",
});
// Output response
console.log(response.toString());
}
main().catch(console.error);