Setup 2025 files and started parsing the archive site but was rate limited. Will need to finish it in the future.

2026-02-24 15:25:39 -08:00
parent 17c8270742
commit ab9abdb53e
11 changed files with 1158 additions and 1006 deletions
@@ -0,0 +1,49 @@
+# Inlander Restaurant Week Picker - Project Memory
+
+## Quick Reference
+- See `scraping-guide.md` for full year-scraping instructions and script templates
+- See `html-structures.md` for HTML parsing patterns per restaurant type
+- Project dir: `\\WinServ-20-3.chns.local\Profiles\derekc\Documents\Coding Projects\Gitea-CooperandGoodman-Inlander-Restaurant-Week-Picker\Inlander-Restaurant-Week-Picker`
+
+## Key Constraints (CRITICAL)
+- **WebFetch cannot access web.archive.org** — use `curl` via Bash tool instead
+- **PowerShell cannot run scripts from UNC paths** (\\server\...) — always `cp` scripts to local temp first
+- **bash `/tmp`** = `C:\Users\DEREKC~1.CHN\AppData\Local\Temp` (8.3 short name)
+- **PowerShell temp** = `C:\Users\derekc.CHNSLocal\AppData\Local\Temp` (long name) — same dir, different string
+- **Wayback Machine rate limits** to ~20 requests before throttling with 429; use 3-5 sec delays, wait 30+ min after getting blocked
+
+## JSON Schema
+Each entry in `YEAR-restaurants.json`:
+```json
+{
+  "name": "Restaurant Name",
+  "slug": "restaurantslug",
+  "price": 45,
+  "areas": ["Downtown"],
+  "cuisine": "American",
+  "url": "https://inlanderrestaurantweek.com/project/SLUG/",
+  "menu": {
+    "hours": "Menu served 5pm-close",
+    "phone": "(509) 555-1234",
+    "courses": {
+      "First Course": [{"name": "Dish Name", "desc": "Description"}],
+      "Second Course": [...],
+      "Third Course": [...]
+    }
+  }
+}
+```
+Price is always 25, 35, or 45. gardenparty genuinely has 4 Third Course options.
+
+## 2025 Data Status
+- **File**: `2025-restaurants.json` (121 restaurants)
+- **Wayback snapshot used**: `20250306132630` (primary), `20250401000000` (backup for some)
+- **Complete (3/3/3+)**: 111 restaurants
+- **gardenparty**: 3/3/4 — correct, it genuinely offers 4 dessert choices
+- **tavolata**: 3/3/0 — needs fix-tavolata.ps1 run when rate limit resets
+- **0/0/0 (JS-only, unrecoverable)**: heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
+
+## Scripts in Project Directory
+- `fix-tavolata.ps1` — run after rate limit resets to recover tavolata Third Course
+  - Copy to local temp and run: `cp ...\fix-tavolata.ps1 C:\Users\derekc.CHNSLocal\AppData\Local\Temp\`
+  - Then: `powershell.exe -ExecutionPolicy Bypass -File C:\Users\derekc.CHNSLocal\AppData\Local\Temp\fix-tavolata.ps1`
@@ -0,0 +1,152 @@
+# IRW Website HTML Structure Reference
+
+## Restaurant Page URL
+Live: `https://inlanderrestaurantweek.com/project/SLUG/`
+Archived: `https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/`
+
+## Page Framework
+The site uses WordPress + Divi theme. Relevant container class: `et_pb_text_inner`.
+Each course section typically occupies one or two `et_pb_text_inner` divs.
+
+---
+
+## Course Layout Types
+
+### Layout A — Heading and items in SEPARATE divs (most restaurants)
+```html
+<div class="et_pb_text_inner"><h3>First Course</h3></div>
+<div class="et_pb_text_inner">
+  <p><strong>Dish Name</strong><br/>Description</p>
+  <p><strong>Dish Name 2</strong><br/>Description 2</p>
+</div>
+<div class="et_pb_text_inner"><h3>Second Course</h3></div>
+...
+```
+
+### Layout B — Heading and items in SAME div (tavolata, durkins, table13, others)
+```html
+<div class="et_pb_text_inner">
+  <h3>First Course</h3>
+  <p><strong>Dish Name</strong><br/>Description</p>
+  <p><strong>Dish Name 2</strong><br/>Description 2</p>
+</div>
+<div class="et_pb_text_inner">
+  <h3>Second Course</h3>
+  ...
+</div>
+```
+
+---
+
+## Dish Name Tag Styles
+
+### Style 1 — `<strong>` tag (most restaurants)
+Examples: 315cuisine, anthonys, bardenay, barkrescuepub, etc.
+```html
+<p><strong>Dish Name</strong><br/>Description text here</p>
+<p><strong>Dish Name</strong> <br/>With space before br</p>
+```
+
+### Style 2 — `<b>` tag with `<br/>` inside (India House, Lebanon, Karma, ponderosa)
+```html
+<p><b>Dish Name <br/></b><span>Description text</span></p>
+<p><b>Dish Name<br/></b> Description without span</p>
+```
+Key: name is inside `<b>`, the `<br/>` is INSIDE the `<b>` tag.
+
+### Style 3 — `<b>` + `<strong>` combo (1898 restaurant)
+```html
+<p><span><b>First Part</b></span><strong>Second Part</strong> Description</p>
+```
+Full dish name = "First Part" + " " + "Second Part"
+
+---
+
+## Field Extraction Patterns
+
+### Name (from page title)
+```
+<title>Restaurant Name | Inlander Restaurant Week</title>
+```
+Regex: `<title>(.+?) \| Inlander`
+
+### Price (WARNING: unreliable — use price listing page instead)
+```html
+<h1 style="text-align: left;"><strong>$45</strong></h1>
+```
+Regex: `<strong>\$(\d+)</strong>`
+PROBLEM: Some pages show drink prices like $22 that match before the real price.
+SOLUTION: Always build an authoritative slug→price map from the price listing page.
+
+### Price Listing Page — Authoritative Prices
+URL: `https://inlanderrestaurantweek.com/price/` (or Wayback archived version)
+```html
+<article class="et_pb_portfolio_item ... project_category_45 ...">
+  ...
+  <a href="https://inlanderrestaurantweek.com/project/SLUG/">
+```
+Extract price tier from `project_category_(25|35|45)` CSS class.
+Extract slug from `href=".../project/SLUG/"`.
+
+### Cuisine
+```html
+CUISINE: AMERICAN COMFORT FOOD
+```
+Often inside `<strong>` or `<em>` tags. Extract uppercase text after "CUISINE:".
+Apply `.ToTitleCase()` for proper formatting.
+
+### Phone
+Area codes: 509 (Spokane area) or 208 (Idaho/CDA area)
+Pattern: `(509) 555-1234` or `(208) 555-1234`
+Regex: `\((?:208|509)\) \d{3}-\d{4}`
+
+### Hours
+```
+Menu served 5pm-9pm nightly
+Menu served Thursday-Sunday, 5-9pm
+```
+Regex: `Menu served [^<]+`
+
+### Area
+Look for area keywords (ALL CAPS in source) anywhere in the HTML:
+- DOWNTOWN, NORTH SPOKANE, SOUTH SPOKANE, WEST SPOKANE, SPOKANE VALLEY
+- AIRWAY HEIGHTS, LIBERTY LAKE, COEUR D'ALENE, POST FALLS, HAYDEN, ATHOL, WORLEY
+Default to ["Downtown"] if nothing matched.
+Some restaurants appear in multiple areas — collect all matches.
+
+---
+
+## Dietary Tag Filtering
+Skip these as dish names — they appear in `<strong>` but are dietary labels, not dish names:
+- GF (gluten free)
+- GFA (gluten free available)
+- V, V+ (vegetarian, vegan)
+- DF, DFA (dairy free, dairy free available)
+- V:, V+A (legend lines)
+- 2025 (year marker some restaurants include)
+- Drink (some restaurants label beverage course)
+
+Full regex: `^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$`
+Also skip names matching `^[A-Z]{1,3}:` (legend lines like "GF: Gluten Free")
+Also skip names shorter than 3 chars or longer than 80 chars.
+
+---
+
+## Restaurants by Known HTML Style (2025)
+
+**Layout B (same-block)**: tavolata, durkins, table13, terraza, and others
+**Style 2 (`<b>` tags)**: indiahouse, lebanon, karma, ponderosa, collectivekitchen, dryfly, masselowslounge, vieuxcarre, wileys, osprey, shawnodonnells, ganderryegrass
+**Style 3 (`<b>`+`<strong>` combo)**: 1898
+
+Note: These styles may change year to year as restaurants update their pages.
+Always check a few representative pages before assuming the same structure applies.
+
+---
+
+## JS-Only Pages (no static HTML menu content)
+These restaurants had no recoverable menu data from any Wayback snapshot in 2025:
+heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
+
+Their pages are fully JS-rendered — the static HTML captured by Wayback Machine
+shows the page shell but not the menu content. For future years, these may or may not
+have static caches depending on server-side rendering changes.
@@ -0,0 +1,237 @@
+# IRW Scraping Guide — Full Process for Adding a New Year
+
+## Overview
+The Inlander Restaurant Week website (inlanderrestaurantweek.com) is WordPress/Divi.
+Menu pages are partially JS-rendered but WP-Super-Cache creates static HTML snapshots
+that the Wayback Machine archives. We scrape those static snapshots.
+
+---
+
+## Step 1: Find Restaurant Slugs
+
+Fetch the price listing page to get all slugs for that year:
+```bash
+curl -s "https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/price/" \
+  -o /tmp/irw-price-YEAR.html
+```
+
+Pick a timestamp close to the event (Wayback Machine format: YYYYMMDDHHmmss).
+The price listing page has portfolio items like:
+```html
+<article class="et_pb_portfolio_item ... project_category_45">
+  <a href="https://inlanderrestaurantweek.com/project/SLUG/">
+```
+Extract slug from the href. The class `project_category_(25|35|45)` gives authoritative price.
+
+**Important**: Scrape the price listing page FIRST and save the slug→price map.
+Some restaurant pages have drink prices ($22, $33) that confuse the price parser.
+
+---
+
+## Step 2: Scrape Each Restaurant Page
+
+Use a PowerShell script (written to project dir, copied to local temp to run):
+
+**Wayback Machine URL format**:
+```
+https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/
+```
+
+**Key fields to extract**:
+```powershell
+# Name
+$nameM = [regex]::Match($html, '<title>(.+?) \| Inlander')
+
+# Price (from page, but USE PRICE LISTING MAP - this can be wrong)
+$priceM = [regex]::Match($html, '<strong>\$(\d+)</strong>')
+
+# Cuisine
+$cuisineM = [regex]::Match($html, 'CUISINE:\s*([A-Z][A-Za-z/ ]+?)(?:\s*</|\s*<)')
+$cuisine = (Get-Culture).TextInfo.ToTitleCase($c.ToLower())
+
+# Phone
+$phoneM = [regex]::Match($html, '\((?:208|509)\) \d{3}-\d{4}')
+
+# Hours
+$hoursM = [regex]::Match($html, 'Menu served [^<]+')
+
+# Area (match against known area keys, case-insensitive)
+$areaMap keys: "AIRWAY HEIGHTS","ATHOL","COEUR D'ALENE","POST FALLS","HAYDEN",
+               "LIBERTY LAKE","NORTH SPOKANE","SOUTH SPOKANE","SPOKANE VALLEY",
+               "WEST SPOKANE","WORLEY","DOWNTOWN"
+```
+
+**Rate limiting**: Add `Start-Sleep -Milliseconds 2000` between each request.
+After a 429, stop and wait 30+ minutes before trying again.
+
+---
+
+## Step 3: Parse Menu Courses
+
+### Course Block Extraction (`Get-CourseBlock`)
+Two HTML layouts exist:
+
+**Layout A** (most common): heading and items in SEPARATE `et_pb_text_inner` blocks
+```powershell
+# Strategy 1: find content between this label and next label
+$m = [regex]::Match($html, [regex]::Escape($label) + '(.+?)(?=' + [regex]::Escape($nextLabel) + ')', $opts)
+
+# Strategy 3 (fallback): items in next et_pb_text_inner block
+$im = [regex]::Match($sub, '(?s)et_pb_text_inner">(?!<h[123])(.+?)(?=et_pb_text_inner"><h|</div>\s*</div>\s*</div>\s*</div>\s*<div)', $opts)
+```
+
+**Layout B** (some restaurants — tavolata, durkins, table13, etc.): heading + items in SAME block
+```powershell
+# Strategy 2: extract <p> tags after </h3> within same div
+$sameDivM = [regex]::Match($sub, '(?s)</h[123]>\s*(<p.+?)(?=</div>)', $opts)
+```
+
+### Dish Parsing (`Parse-Dish`)
+Three tag styles exist:
+
+**Style 1** (most restaurants): `<strong>` for name
+```html
+<p><strong>Dish Name</strong><br/>Description text</p>
+```
+
+**Style 2** (India House, Lebanon, Karma, others): `<b>` with `<br/>` before `</b>`
+```html
+<p><b>Dish Name <br/></b><span>Description text</span></p>
+```
+
+**Style 3** (1898): `<b>` + `<strong>` combination
+```html
+<p><span><b>Part1</b></span><strong>Part2</strong> Description</p>
+```
+
+**Multi-strategy parser** (handles all three):
+```powershell
+function Parse-Dish($pContent) {
+    $opts = [System.Text.RegularExpressions.RegexOptions]::Singleline
+
+    # Style 2: <b>Name <br/></b>
+    $bWithBrM = [regex]::Match($pContent, '(?s)<b>(.*?)<br\s*/?>', $opts)
+    if ($bWithBrM.Success) {
+        $name = Get-CleanText $bWithBrM.Groups[1].Value
+        if (Test-ValidDishName $name) {
+            $desc = Get-CleanText ($pContent.Substring($bWithBrM.Index + $bWithBrM.Length))
+            return [PSCustomObject]@{ name = $name; desc = $desc }
+        }
+    }
+
+    # Style 3: <b>Part1</b>...<strong>Part2</strong>
+    $bM = [regex]::Match($pContent, '(?s)<b>(.*?)</b>', $opts)
+    if ($bM.Success) {
+        $namePart = Get-CleanText $bM.Groups[1].Value
+        if (Test-ValidDishName $namePart) {
+            $afterB = $pContent.Substring($bM.Index + $bM.Length)
+            $sM2 = [regex]::Match($afterB, '(?s)^[^<]*<strong>(.*?)</strong>(.*)', $opts)
+            if ($sM2.Success) {
+                $p2 = Get-CleanText $sM2.Groups[1].Value
+                if (-not (Test-DietaryTag $p2) -and $p2.Length -ge 2) {
+                    return [PSCustomObject]@{ name = "$namePart $p2".Trim(); desc = Get-CleanText $sM2.Groups[2].Value }
+                }
+            }
+            return [PSCustomObject]@{ name = $namePart; desc = Get-CleanText $afterB }
+        }
+    }
+
+    # Style 1: <strong>Name</strong>
+    $sM = [regex]::Match($pContent, '(?s)<strong>(.*?)</strong>', $opts)
+    if ($sM.Success) {
+        $name = Get-CleanText $sM.Groups[1].Value
+        if (-not (Test-ValidDishName $name)) { return $null }
+        $afterBr = ''
+        if ($pContent -match '(?s)<br\s*/?>(.*?)$') { $afterBr = $matches[1] }
+        else { $am = [regex]::Match($pContent, '(?s)</strong>(.*?)$', $opts); if ($am.Success) { $afterBr = $am.Groups[1].Value } }
+        return [PSCustomObject]@{ name = $name; desc = Get-CleanText $afterBr }
+    }
+    return $null
+}
+
+function Test-ValidDishName($name) {
+    $name.Length -ge 3 -and $name.Length -le 80 -and
+    $name -notmatch '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$' -and
+    $name -notmatch '^[A-Z]{1,3}:'
+}
+
+function Test-DietaryTag($str) {
+    $str -match '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$'
+}
+```
+
+### HTML Cleanup
+```powershell
+function Get-CleanText($rawHtml) {
+    $t = $rawHtml -replace '<[^>]+>', ' '
+    $t = $t -replace '&amp;', '&' -replace '&#039;', "'" -replace '&quot;', '"'
+    $t = $t -replace '&lt;', '<' -replace '&gt;', '>' -replace '&nbsp;', ' '
+    $t = $t -replace '&#8211;', '-' -replace '&#8212;', '-'
+    ($t -replace '\s+', ' ').Trim()
+}
+```
+
+---
+
+## Step 4: Fix Prices
+
+After scraping, apply authoritative prices from the price listing page:
+- Parse `project_category_(25|35|45)` CSS class from portfolio items
+- Match slug from adjacent `href` attribute
+- Build a hashtable and apply to all entries
+
+Common gotcha: Restaurant pages may show $22 (wine), $33 (lunch) — these are NOT the event price.
+
+---
+
+## Step 5: Recover Missing Restaurants
+
+If a restaurant has 0/0/0 courses:
+1. Try alternate Wayback timestamps: `20250401000000`, `20250415000000`, `20250501000000`, `20250601000000`
+2. Check if page uses Layout B (same-block) — add Strategy 2 to course block extractor
+3. Check if page uses `<b>` tags instead of `<strong>` for dish names
+
+**Known JS-only restaurants** (no static cache recoverable for 2025):
+heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
+
+---
+
+## Step 6: Output and Validation
+
+```powershell
+# Save as UTF-8 (important — special characters in restaurant names)
+$json = $data | ConvertTo-Json -Depth 10
+[System.IO.File]::WriteAllText($outPath, $json, [System.Text.Encoding]::UTF8)
+
+# Validate: list any restaurant not at 3/3/3
+$data | Where-Object {
+    $_.menu.courses.'First Course'.Count -ne 3 -or
+    $_.menu.courses.'Second Course'.Count -ne 3 -or
+    $_.menu.courses.'Third Course'.Count -ne 3
+} | ForEach-Object {
+    "$($_.slug): $($_.menu.courses.'First Course'.Count)/$($_.menu.courses.'Second Course'.Count)/$($_.menu.courses.'Third Course'.Count)"
+}
+```
+
+---
+
+## PowerShell Script Execution Pattern (REQUIRED)
+
+```bash
+# Write script to project dir (via Write tool or Edit)
+# Then in bash:
+cp "//WinServ-20-3.chns.local/Profiles/derekc/Documents/Coding Projects/.../script.ps1" \
+   "/c/Users/derekc.CHNSLocal/AppData/Local/Temp/script.ps1"
+powershell.exe -ExecutionPolicy Bypass -File "C:\Users\derekc.CHNSLocal\AppData\Local\Temp\script.ps1"
+```
+
+**Never** use `powershell -Command "..."` for multi-line scripts — escaping is unreliable.
+**Never** try to run `.ps1` directly from `\\WinServ-20-3...` UNC path — execution policy blocks it.
+
+---
+
+## PowerShell Gotchas
+- `"$slug: text"` fails if `:` follows var — use `"${slug}: text"`
+- Function names like `Is-X`, `Decode-X`, `Parse-X` get PSScriptAnalyzer warnings (unapproved verbs) but work fine
+- `return ,$array` (comma prefix) forces PowerShell to return an array, not unroll it
+- `[System.IO.File]::WriteAllText(path, json, UTF8)` — use this, not `Out-File`, to avoid BOM/encoding issues