Files
Inlander-Restaurant-Week-Pi…/memory/scraping-guide.md

8.5 KiB

IRW Scraping Guide — Full Process for Adding a New Year

Overview

The Inlander Restaurant Week website (inlanderrestaurantweek.com) is WordPress/Divi. Menu pages are partially JS-rendered but WP-Super-Cache creates static HTML snapshots that the Wayback Machine archives. We scrape those static snapshots.


Step 1: Find Restaurant Slugs

Fetch the price listing page to get all slugs for that year:

curl -s "https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/price/" \
  -o /tmp/irw-price-YEAR.html

Pick a timestamp close to the event (Wayback Machine format: YYYYMMDDHHmmss). The price listing page has portfolio items like:

<article class="et_pb_portfolio_item ... project_category_45">
  <a href="https://inlanderrestaurantweek.com/project/SLUG/">

Extract slug from the href. The class project_category_(25|35|45) gives authoritative price.

Important: Scrape the price listing page FIRST and save the slug→price map. Some restaurant pages have drink prices ($22, $33) that confuse the price parser.


Step 2: Scrape Each Restaurant Page

Use a PowerShell script (written to project dir, copied to local temp to run):

Wayback Machine URL format:

https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/

Key fields to extract:

# Name
$nameM = [regex]::Match($html, '<title>(.+?) \| Inlander')

# Price (from page, but USE PRICE LISTING MAP - this can be wrong)
$priceM = [regex]::Match($html, '<strong>\$(\d+)</strong>')

# Cuisine
$cuisineM = [regex]::Match($html, 'CUISINE:\s*([A-Z][A-Za-z/ ]+?)(?:\s*</|\s*<)')
$cuisine = (Get-Culture).TextInfo.ToTitleCase($c.ToLower())

# Phone
$phoneM = [regex]::Match($html, '\((?:208|509)\) \d{3}-\d{4}')

# Hours
$hoursM = [regex]::Match($html, 'Menu served [^<]+')

# Area (match against known area keys, case-insensitive)
$areaMap keys: "AIRWAY HEIGHTS","ATHOL","COEUR D'ALENE","POST FALLS","HAYDEN",
               "LIBERTY LAKE","NORTH SPOKANE","SOUTH SPOKANE","SPOKANE VALLEY",
               "WEST SPOKANE","WORLEY","DOWNTOWN"

Rate limiting: Add Start-Sleep -Milliseconds 2000 between each request. After a 429, stop and wait 30+ minutes before trying again.


Step 3: Parse Menu Courses

Course Block Extraction (Get-CourseBlock)

Two HTML layouts exist:

Layout A (most common): heading and items in SEPARATE et_pb_text_inner blocks

# Strategy 1: find content between this label and next label
$m = [regex]::Match($html, [regex]::Escape($label) + '(.+?)(?=' + [regex]::Escape($nextLabel) + ')', $opts)

# Strategy 3 (fallback): items in next et_pb_text_inner block
$im = [regex]::Match($sub, '(?s)et_pb_text_inner">(?!<h[123])(.+?)(?=et_pb_text_inner"><h|</div>\s*</div>\s*</div>\s*</div>\s*<div)', $opts)

Layout B (some restaurants — tavolata, durkins, table13, etc.): heading + items in SAME block

# Strategy 2: extract <p> tags after </h3> within same div
$sameDivM = [regex]::Match($sub, '(?s)</h[123]>\s*(<p.+?)(?=</div>)', $opts)

Dish Parsing (Parse-Dish)

Three tag styles exist:

Style 1 (most restaurants): <strong> for name

<p><strong>Dish Name</strong><br/>Description text</p>

Style 2 (India House, Lebanon, Karma, others): <b> with <br/> before </b>

<p><b>Dish Name <br/></b><span>Description text</span></p>

Style 3 (1898): <b> + <strong> combination

<p><span><b>Part1</b></span><strong>Part2</strong> Description</p>

Multi-strategy parser (handles all three):

function Parse-Dish($pContent) {
    $opts = [System.Text.RegularExpressions.RegexOptions]::Singleline

    # Style 2: <b>Name <br/></b>
    $bWithBrM = [regex]::Match($pContent, '(?s)<b>(.*?)<br\s*/?>', $opts)
    if ($bWithBrM.Success) {
        $name = Get-CleanText $bWithBrM.Groups[1].Value
        if (Test-ValidDishName $name) {
            $desc = Get-CleanText ($pContent.Substring($bWithBrM.Index + $bWithBrM.Length))
            return [PSCustomObject]@{ name = $name; desc = $desc }
        }
    }

    # Style 3: <b>Part1</b>...<strong>Part2</strong>
    $bM = [regex]::Match($pContent, '(?s)<b>(.*?)</b>', $opts)
    if ($bM.Success) {
        $namePart = Get-CleanText $bM.Groups[1].Value
        if (Test-ValidDishName $namePart) {
            $afterB = $pContent.Substring($bM.Index + $bM.Length)
            $sM2 = [regex]::Match($afterB, '(?s)^[^<]*<strong>(.*?)</strong>(.*)', $opts)
            if ($sM2.Success) {
                $p2 = Get-CleanText $sM2.Groups[1].Value
                if (-not (Test-DietaryTag $p2) -and $p2.Length -ge 2) {
                    return [PSCustomObject]@{ name = "$namePart $p2".Trim(); desc = Get-CleanText $sM2.Groups[2].Value }
                }
            }
            return [PSCustomObject]@{ name = $namePart; desc = Get-CleanText $afterB }
        }
    }

    # Style 1: <strong>Name</strong>
    $sM = [regex]::Match($pContent, '(?s)<strong>(.*?)</strong>', $opts)
    if ($sM.Success) {
        $name = Get-CleanText $sM.Groups[1].Value
        if (-not (Test-ValidDishName $name)) { return $null }
        $afterBr = ''
        if ($pContent -match '(?s)<br\s*/?>(.*?)$') { $afterBr = $matches[1] }
        else { $am = [regex]::Match($pContent, '(?s)</strong>(.*?)$', $opts); if ($am.Success) { $afterBr = $am.Groups[1].Value } }
        return [PSCustomObject]@{ name = $name; desc = Get-CleanText $afterBr }
    }
    return $null
}

function Test-ValidDishName($name) {
    $name.Length -ge 3 -and $name.Length -le 80 -and
    $name -notmatch '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$' -and
    $name -notmatch '^[A-Z]{1,3}:'
}

function Test-DietaryTag($str) {
    $str -match '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$'
}

HTML Cleanup

function Get-CleanText($rawHtml) {
    $t = $rawHtml -replace '<[^>]+>', ' '
    $t = $t -replace '&amp;', '&' -replace '&#039;', "'" -replace '&quot;', '"'
    $t = $t -replace '&lt;', '<' -replace '&gt;', '>' -replace '&nbsp;', ' '
    $t = $t -replace '&#8211;', '-' -replace '&#8212;', '-'
    ($t -replace '\s+', ' ').Trim()
}

Step 4: Fix Prices

After scraping, apply authoritative prices from the price listing page:

  • Parse project_category_(25|35|45) CSS class from portfolio items
  • Match slug from adjacent href attribute
  • Build a hashtable and apply to all entries

Common gotcha: Restaurant pages may show $22 (wine), $33 (lunch) — these are NOT the event price.


Step 5: Recover Missing Restaurants

If a restaurant has 0/0/0 courses:

  1. Try alternate Wayback timestamps: 20250401000000, 20250415000000, 20250501000000, 20250601000000
  2. Check if page uses Layout B (same-block) — add Strategy 2 to course block extractor
  3. Check if page uses <b> tags instead of <strong> for dish names

Known JS-only restaurants (no static cache recoverable for 2025): heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza


Step 6: Output and Validation

# Save as UTF-8 (important — special characters in restaurant names)
$json = $data | ConvertTo-Json -Depth 10
[System.IO.File]::WriteAllText($outPath, $json, [System.Text.Encoding]::UTF8)

# Validate: list any restaurant not at 3/3/3
$data | Where-Object {
    $_.menu.courses.'First Course'.Count -ne 3 -or
    $_.menu.courses.'Second Course'.Count -ne 3 -or
    $_.menu.courses.'Third Course'.Count -ne 3
} | ForEach-Object {
    "$($_.slug): $($_.menu.courses.'First Course'.Count)/$($_.menu.courses.'Second Course'.Count)/$($_.menu.courses.'Third Course'.Count)"
}

PowerShell Script Execution Pattern (REQUIRED)

# Write script to project dir (via Write tool or Edit)
# Then in bash:
cp "//WinServ-20-3.chns.local/Profiles/derekc/Documents/Coding Projects/.../script.ps1" \
   "/c/Users/derekc.CHNSLocal/AppData/Local/Temp/script.ps1"
powershell.exe -ExecutionPolicy Bypass -File "C:\Users\derekc.CHNSLocal\AppData\Local\Temp\script.ps1"

Never use powershell -Command "..." for multi-line scripts — escaping is unreliable. Never try to run .ps1 directly from \\WinServ-20-3... UNC path — execution policy blocks it.


PowerShell Gotchas

  • "$slug: text" fails if : follows var — use "${slug}: text"
  • Function names like Is-X, Decode-X, Parse-X get PSScriptAnalyzer warnings (unapproved verbs) but work fine
  • return ,$array (comma prefix) forces PowerShell to return an array, not unroll it
  • [System.IO.File]::WriteAllText(path, json, UTF8) — use this, not Out-File, to avoid BOM/encoding issues