Working with PDFs with PowerShell in “Run .Net Script” Activites

This month I helped our own Administration Team automating their tasks with PDF documents. The tasks are splitting PDF documents based on content and rename the new PDF docs based on content.

With the Orchestrator “Run .Net Script” Activity you can execute Windows Powershell Script and use the Published Data from the Activities completed before and put result on the Databus for the following Activities.

To work with the PDF files the PowerShell scripts uses itextsharp.dll. I got this from the Internet search for “itextsharp-all-5.5.10.zip”.

With the following code you can extract the text from a PDF file:

Add-Type -Path 'C:\orchestrator\itextsharp\itextsharp.dll'
$PDFfile = Get-Item -Path "C:\temp\Stefan\file.pdf"
$reader = [iTextSharp.text.pdf.parser.PdfTextExtractor]
$Extract = $reader::GetTextFromPage($PDFfile.FullName,1)

$Extract is the now text from the PDF document.

In Output of $Extract I noticed that the Customer is in a line between the lines name and address of the company I work.

I defined a Regular Expression with Named Capturing Groups (?<Customer>.*) to get the Customer for example:

$SearchPattern = 'VAS Value Added Solutions GmbH.*[\r\n](?<Customer>.*)[\r\n]Hammfelddamm 7, 41460 Neuss'
$Extract | where { $_ -match $SearchPattern } | ForEach-Object {
$Customer = $Matches.Customer}

Here’s the complete script to parse information from a PDF document I use in the Orchestrator “Run .Net Script” Activity:

$OutputFolder = “C:\temp\output” # Path with the pdf docs to scan

Add-Type -Path ‘C:\orchestrator\itextsharp\itextsharp.dll’

$MainSubject = @()
$MainCustomerName = @()
$MainCustomerID = @()
$MainNewFileName = @()
$MainFilePath = @()

[Array]$Files = Get-ChildItem -Path $Outputfolder -Filter ‘*.pdf’

foreach ($File in $Files)
{
#Extract Text from PDF
$reader = [iTextSharp.text.pdf.parser.PdfTextExtractor]
$Extract = $reader::GetTextFromPage($File.FullName,1)

#Get ID of Customer
$SearchPattern = ‘Seite: 1[\r\n]Kunden Nr\.: (?<CustomerID>\d{5})’
$Extract | where { $_ -match $SearchPattern } | ForEach-Object {
$CustomerID = $Matches.CustomerID}
$MainCustomerID += $CustomerID

#Get Subject
$SearchPattern = ‘Datum:.*[\r\n](?<Subject>.*)[\r\n]Sehr geehrte Damen und Herren’
$Extract | where { $_ -match $SearchPattern } | ForEach-Object {
$Subject = $Matches.Subject}
[System.IO.Path]::GetInvalidFileNameChars() | Where {$Subject = $Subject.replace($_,’.’)} # Replaces Invalid chars for FileName
$MainSubject += $Subject

#Get Name of Customer
$SearchPattern = ‘VAS Value Added Solutions GmbH.*[\r\n](?<Customer>.*)[\r\n]Hammfelddamm 7, 41460 Neuss’
$Extract | where { $_ -match $SearchPattern } | ForEach-Object {
$Customer = $Matches.Customer}
[System.IO.Path]::GetInvalidFileNameChars() | Where {$Customer = $Customer.replace($_,’.’)} # Replaces Invalid chars for FileName
$MainCustomerName += $Customer

$MainNewFileName += $Subject +’_’ + $Customer + ‘.pdf’
$MainFilePath += $file.FullName

}

Here is the complete script to split a PDF document based on its content. ‘Seite: 1[\r\n]Kunden Nr\.: (?<CustomerID>\d{5})’ is the regular expression for a start page for every new document:

Working with PDFs with PowerShell in “Run .Net Script” Activites
$InputPath = “C:\temp\Stefan”
$FileName = “FileToSplit.pdf
Add-Type -Path ‘C:\orchestrator\itextsharp\itextsharp.dll’

#$PdfFile = Get-Item $PdfFilePath |Select-Object -ExpandProperty FullName

$PatternNewPage = ‘Seite: 1[\r\n]Kunden Nr\.: (?<CustomerID>\d{5})’

$PdfReader = [iTextSharp.text.pdf.PdfReader]::new($PdfFile)

$CustomerStack = [System.Collections.Stack]::new()

# Map out the PDF file.
foreach ($Page in 1..($PdfReader.NumberOfPages)) {
[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($PdfReader, $Page) |
Where-Object { $_ -match $PatternNewPage } |
ForEach-Object {
$CustomerStack.Push([PSCustomObject]@{
Customer_Id = $Matches.CustomerID
StartPage = $Page
})
}
}

# Extract the pages and save the files
$LastPage = $PdfReader.NumberOfPages
while ($CustomerStack.Count -gt 0) {
$Current = $CustomerStack.Pop()

$StartPage = $Current.StartPage
$EndPage = $LastPage

$Document = [iTextSharp.text.Document]::new($PdfReader.GetPageSizeWithRotation($StartPage))
$TargetMemoryStream = [System.IO.MemoryStream]::new()
$PdfCopy = [iTextSharp.text.pdf.PdfSmartCopy]::new($Document, $TargetMemoryStream)

$Document.Open()
foreach ($Page in $StartPage..$EndPage) {
$PdfCopy.AddPage($PdfCopy.GetImportedPage($PdfReader, $Page));
}
$Document.Close()

$NewFileName = ‘Reminder – {0}.pdf’ -f $Current.Customer_Id
$NewFileFullName = [System.IO.Path]::Combine($OutputFolder, $NewFileName)
[System.IO.File]::WriteAllBytes($NewFileFullName, $TargetMemoryStream.ToArray())

$LastPage = $Current.StartPage – 1
}

Starten Sie jetzt Ihren Weg zu Azure!

Los geht's

top