Eino: Document Parser Interface Guide
Introduction
Document Parser is a toolkit for parsing raw content into standard documents. It is not a standalone component; it’s used inside Document Loader. Parsers support:
- Parsing various formats (text, PDF, Markdown, etc.)
- Automatically selecting a parser by file extension (
ExtParser) - Adding metadata to parsed documents
Interfaces
Parser
Code:
eino/components/document/parser/interface.go
import (
"github.com/cloudwego/eino/schema"
)
type Parser interface {
Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)
}
Parse
- Purpose: parse from a
Reader - Params:
ctx: contextreader: raw contentopts: parsing options
- Returns:
[]*schema.Document: parsed documentserror
Common Options
type Options struct {
// URI of the document source
URI string
// ExtraMeta merged into each parsed document’s metadata
ExtraMeta map[string]any
}
Provided helpers:
WithURI: set document URI (used byExtParserto select parser)WithExtraMeta: set additional metadata
Built-in Parsers
TextParser
Basic text parser; uses input as content directly.
Code:
eino-examples/components/document/parser/textparser
import "github.com/cloudwego/eino/components/document/parser"
textParser := parser.TextParser{}
docs, _ := textParser.Parse(ctx, strings.NewReader("hello world"))
logs.Infof("text content: %v", docs[0].Content)
ExtParser
Selects parsers by file extension; falls back to a default parser.
Code:
eino-examples/components/document/parser/extparser
package main
import (
"context"
"os"
"github.com/cloudwego/eino-ext/components/document/parser/html"
"github.com/cloudwego/eino-ext/components/document/parser/pdf"
"github.com/cloudwego/eino/components/document/parser"
"github.com/cloudwego/eino-examples/internal/gptr"
"github.com/cloudwego/eino-examples/internal/logs"
)
func main() {
ctx := context.Background()
textParser := parser.TextParser{}
htmlParser, _ := html.NewParser(ctx, &html.Config{ Selector: gptr.Of("body") })
pdfParser, _ := pdf.NewPDFParser(ctx, &pdf.Config{})
extParser, _ := parser.NewExtParser(ctx, &parser.ExtParserConfig{
Parsers: map[string]parser.Parser{ ".html": htmlParser, ".pdf": pdfParser },
FallbackParser: textParser,
})
filePath := "./testdata/test.html"
file, _ := os.Open(filePath)
docs, _ := extParser.Parse(ctx, file,
parser.WithURI(filePath),
parser.WithExtraMeta(map[string]any{ "source": "local" }),
)
for idx, doc := range docs {
logs.Infof("doc_%v content: %v", idx, doc.Content)
}
}
Other Implementations
- PDF parser: Parser — PDF
- HTML parser: Parser — HTML
Using Parsers in Document Loader
Parsers are primarily used by Document Loader to parse loaded content.
File Loader Example
Code:
eino-ext/components/document/loader/file/examples/fileloader
import (
"github.com/cloudwego/eino/components/document"
"github.com/cloudwego/eino/schema"
"github.com/cloudwego/eino-ext/components/document/loader/file"
)
ctx := context.Background()
loader, err := file.NewFileLoader(ctx, &file.FileLoaderConfig{
UseNameAsID: true,
Parser: &parser.TextParser{}, // Or parser.NewExtParser()
})
filePath := "../../testdata/test.md"
docs, err := loader.Load(ctx, document.Source{ URI: filePath })
log.Printf("doc content: %v", docs[0].Content)
log.Printf("Extension: %s\n", docs[0].MetaData[file._MetaKeyExtension_])
log.Printf("Source: %s\n", docs[0].MetaData[file._MetaKeySource_])
Custom Parser Implementation
Options
type options struct {
Encoding string
MaxSize int64
}
func WithEncoding(encoding string) parser.Option {
return parser.WrapImplSpecificOptFn(func(o *options) { o.Encoding = encoding })
}
func WithMaxSize(size int64) parser.Option {
return parser.WrapImplSpecificOptFn(func(o *options) { o.MaxSize = size })
}
Example
Code:
eino-examples/components/document/parser/customparser/custom_parser.go
import (
"github.com/cloudwego/eino/components/document/parser"
"github.com/cloudwego/eino/schema"
)
type Config struct {
DefaultEncoding string
DefaultMaxSize int64
}
type CustomParser struct {
defaultEncoding string
defaultMaxSize int64
}
func NewCustomParser(config *Config) (*CustomParser, error) {
return &CustomParser{
defaultEncoding: config.DefaultEncoding,
defaultMaxSize: config.DefaultMaxSize,
}, nil
}
func (p *CustomParser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) ([]*schema.Document, error) {
commonOpts := parser.GetCommonOptions(&parser.Options{}, opts...)
_ = commonOpts
myOpts := &options{
Encoding: p.defaultEncoding,
MaxSize: p.defaultMaxSize,
}
myOpts = parser.GetImplSpecificOptions(myOpts, opts...)
_ = myOpts
return []*schema.Document{{
Content: "Hello World",
}}, nil
}
Notes
- Handle common options consistently via the shared abstraction
- Set and propagate metadata appropriately
Last modified
December 12, 2025
: chore: update websocket docs (#1479) (967538e)