# IRW Scraping Guide — Full Process for Adding a New Year ## Overview The Inlander Restaurant Week website (inlanderrestaurantweek.com) is WordPress/Divi. Menu pages are partially JS-rendered but WP-Super-Cache creates static HTML snapshots that the Wayback Machine archives. We scrape those static snapshots. --- ## Step 1: Find Restaurant Slugs Fetch the price listing page to get all slugs for that year: ```bash curl -s "https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/price/" \ -o /tmp/irw-price-YEAR.html ``` Pick a timestamp close to the event (Wayback Machine format: YYYYMMDDHHmmss). The price listing page has portfolio items like: ```html
``` Extract slug from the href. The class `project_category_(25|35|45)` gives authoritative price. **Important**: Scrape the price listing page FIRST and save the slug→price map. Some restaurant pages have drink prices ($22, $33) that confuse the price parser. --- ## Step 2: Scrape Each Restaurant Page Use a PowerShell script (written to project dir, copied to local temp to run): **Wayback Machine URL format**: ``` https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/ ``` **Key fields to extract**: ```powershell # Name $nameM = [regex]::Match($html, '(.+?) \| Inlander') # Price (from page, but USE PRICE LISTING MAP - this can be wrong) $priceM = [regex]::Match($html, '<strong>\$(\d+)</strong>') # Cuisine $cuisineM = [regex]::Match($html, 'CUISINE:\s*([A-Z][A-Za-z/ ]+?)(?:\s*</|\s*<)') $cuisine = (Get-Culture).TextInfo.ToTitleCase($c.ToLower()) # Phone $phoneM = [regex]::Match($html, '\((?:208|509)\) \d{3}-\d{4}') # Hours $hoursM = [regex]::Match($html, 'Menu served [^<]+') # Area (match against known area keys, case-insensitive) $areaMap keys: "AIRWAY HEIGHTS","ATHOL","COEUR D'ALENE","POST FALLS","HAYDEN", "LIBERTY LAKE","NORTH SPOKANE","SOUTH SPOKANE","SPOKANE VALLEY", "WEST SPOKANE","WORLEY","DOWNTOWN" ``` **Rate limiting**: Add `Start-Sleep -Milliseconds 2000` between each request. After a 429, stop and wait 30+ minutes before trying again. --- ## Step 3: Parse Menu Courses ### Course Block Extraction (`Get-CourseBlock`) Two HTML layouts exist: **Layout A** (most common): heading and items in SEPARATE `et_pb_text_inner` blocks ```powershell # Strategy 1: find content between this label and next label $m = [regex]::Match($html, [regex]::Escape($label) + '(.+?)(?=' + [regex]::Escape($nextLabel) + ')', $opts) # Strategy 3 (fallback): items in next et_pb_text_inner block $im = [regex]::Match($sub, '(?s)et_pb_text_inner">(?!<h[123])(.+?)(?=et_pb_text_inner"><h|</div>\s*</div>\s*</div>\s*</div>\s*<div)', $opts) ``` **Layout B** (some restaurants — tavolata, durkins, table13, etc.): heading + items in SAME block ```powershell # Strategy 2: extract <p> tags after </h3> within same div $sameDivM = [regex]::Match($sub, '(?s)</h[123]>\s*(<p.+?)(?=</div>)', $opts) ``` ### Dish Parsing (`Parse-Dish`) Three tag styles exist: **Style 1** (most restaurants): `<strong>` for name ```html <p><strong>Dish Name</strong><br/>Description text</p> ``` **Style 2** (India House, Lebanon, Karma, others): `<b>` with `<br/>` before `</b>` ```html <p><b>Dish Name <br/></b><span>Description text</span></p> ``` **Style 3** (1898): `<b>` + `<strong>` combination ```html <p><span><b>Part1</b></span><strong>Part2</strong> Description</p> ``` **Multi-strategy parser** (handles all three): ```powershell function Parse-Dish($pContent) { $opts = [System.Text.RegularExpressions.RegexOptions]::Singleline # Style 2: <b>Name <br/></b> $bWithBrM = [regex]::Match($pContent, '(?s)<b>(.*?)<br\s*/?>', $opts) if ($bWithBrM.Success) { $name = Get-CleanText $bWithBrM.Groups[1].Value if (Test-ValidDishName $name) { $desc = Get-CleanText ($pContent.Substring($bWithBrM.Index + $bWithBrM.Length)) return [PSCustomObject]@{ name = $name; desc = $desc } } } # Style 3: <b>Part1</b>...<strong>Part2</strong> $bM = [regex]::Match($pContent, '(?s)<b>(.*?)</b>', $opts) if ($bM.Success) { $namePart = Get-CleanText $bM.Groups[1].Value if (Test-ValidDishName $namePart) { $afterB = $pContent.Substring($bM.Index + $bM.Length) $sM2 = [regex]::Match($afterB, '(?s)^[^<]*<strong>(.*?)</strong>(.*)', $opts) if ($sM2.Success) { $p2 = Get-CleanText $sM2.Groups[1].Value if (-not (Test-DietaryTag $p2) -and $p2.Length -ge 2) { return [PSCustomObject]@{ name = "$namePart $p2".Trim(); desc = Get-CleanText $sM2.Groups[2].Value } } } return [PSCustomObject]@{ name = $namePart; desc = Get-CleanText $afterB } } } # Style 1: <strong>Name</strong> $sM = [regex]::Match($pContent, '(?s)<strong>(.*?)</strong>', $opts) if ($sM.Success) { $name = Get-CleanText $sM.Groups[1].Value if (-not (Test-ValidDishName $name)) { return $null } $afterBr = '' if ($pContent -match '(?s)<br\s*/?>(.*?)$') { $afterBr = $matches[1] } else { $am = [regex]::Match($pContent, '(?s)</strong>(.*?)$', $opts); if ($am.Success) { $afterBr = $am.Groups[1].Value } } return [PSCustomObject]@{ name = $name; desc = Get-CleanText $afterBr } } return $null } function Test-ValidDishName($name) { $name.Length -ge 3 -and $name.Length -le 80 -and $name -notmatch '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$' -and $name -notmatch '^[A-Z]{1,3}:' } function Test-DietaryTag($str) { $str -match '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$' } ``` ### HTML Cleanup ```powershell function Get-CleanText($rawHtml) { $t = $rawHtml -replace '<[^>]+>', ' ' $t = $t -replace '&', '&' -replace ''', "'" -replace '"', '"' $t = $t -replace '<', '<' -replace '>', '>' -replace ' ', ' ' $t = $t -replace '–', '-' -replace '—', '-' ($t -replace '\s+', ' ').Trim() } ``` --- ## Step 4: Fix Prices After scraping, apply authoritative prices from the price listing page: - Parse `project_category_(25|35|45)` CSS class from portfolio items - Match slug from adjacent `href` attribute - Build a hashtable and apply to all entries Common gotcha: Restaurant pages may show $22 (wine), $33 (lunch) — these are NOT the event price. --- ## Step 5: Recover Missing Restaurants If a restaurant has 0/0/0 courses: 1. Try alternate Wayback timestamps: `20250401000000`, `20250415000000`, `20250501000000`, `20250601000000` 2. Check if page uses Layout B (same-block) — add Strategy 2 to course block extractor 3. Check if page uses `<b>` tags instead of `<strong>` for dish names **Known JS-only restaurants** (no static cache recoverable for 2025): heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza --- ## Step 6: Output and Validation ```powershell # Save as UTF-8 (important — special characters in restaurant names) $json = $data | ConvertTo-Json -Depth 10 [System.IO.File]::WriteAllText($outPath, $json, [System.Text.Encoding]::UTF8) # Validate: list any restaurant not at 3/3/3 $data | Where-Object { $_.menu.courses.'First Course'.Count -ne 3 -or $_.menu.courses.'Second Course'.Count -ne 3 -or $_.menu.courses.'Third Course'.Count -ne 3 } | ForEach-Object { "$($_.slug): $($_.menu.courses.'First Course'.Count)/$($_.menu.courses.'Second Course'.Count)/$($_.menu.courses.'Third Course'.Count)" } ``` --- ## PowerShell Script Execution Pattern (REQUIRED) ```bash # Write script to project dir (via Write tool or Edit) # Then in bash: cp "//WinServ-20-3.chns.local/Profiles/derekc/Documents/Coding Projects/.../script.ps1" \ "/c/Users/derekc.CHNSLocal/AppData/Local/Temp/script.ps1" powershell.exe -ExecutionPolicy Bypass -File "C:\Users\derekc.CHNSLocal\AppData\Local\Temp\script.ps1" ``` **Never** use `powershell -Command "..."` for multi-line scripts — escaping is unreliable. **Never** try to run `.ps1` directly from `\\WinServ-20-3...` UNC path — execution policy blocks it. --- ## PowerShell Gotchas - `"$slug: text"` fails if `:` follows var — use `"${slug}: text"` - Function names like `Is-X`, `Decode-X`, `Parse-X` get PSScriptAnalyzer warnings (unapproved verbs) but work fine - `return ,$array` (comma prefix) forces PowerShell to return an array, not unroll it - `[System.IO.File]::WriteAllText(path, json, UTF8)` — use this, not `Out-File`, to avoid BOM/encoding issues