Setup 2025 files and started parsing the archive site but was rate limited. Will need to finish it in the future.

This commit is contained in:
2026-02-24 15:25:39 -08:00
parent 17c8270742
commit ab9abdb53e
11 changed files with 1158 additions and 1006 deletions

152
memory/html-structures.md Normal file
View File

@@ -0,0 +1,152 @@
# IRW Website HTML Structure Reference
## Restaurant Page URL
Live: `https://inlanderrestaurantweek.com/project/SLUG/`
Archived: `https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/`
## Page Framework
The site uses WordPress + Divi theme. Relevant container class: `et_pb_text_inner`.
Each course section typically occupies one or two `et_pb_text_inner` divs.
---
## Course Layout Types
### Layout A — Heading and items in SEPARATE divs (most restaurants)
```html
<div class="et_pb_text_inner"><h3>First Course</h3></div>
<div class="et_pb_text_inner">
<p><strong>Dish Name</strong><br/>Description</p>
<p><strong>Dish Name 2</strong><br/>Description 2</p>
</div>
<div class="et_pb_text_inner"><h3>Second Course</h3></div>
...
```
### Layout B — Heading and items in SAME div (tavolata, durkins, table13, others)
```html
<div class="et_pb_text_inner">
<h3>First Course</h3>
<p><strong>Dish Name</strong><br/>Description</p>
<p><strong>Dish Name 2</strong><br/>Description 2</p>
</div>
<div class="et_pb_text_inner">
<h3>Second Course</h3>
...
</div>
```
---
## Dish Name Tag Styles
### Style 1 — `<strong>` tag (most restaurants)
Examples: 315cuisine, anthonys, bardenay, barkrescuepub, etc.
```html
<p><strong>Dish Name</strong><br/>Description text here</p>
<p><strong>Dish Name</strong> <br/>With space before br</p>
```
### Style 2 — `<b>` tag with `<br/>` inside (India House, Lebanon, Karma, ponderosa)
```html
<p><b>Dish Name <br/></b><span>Description text</span></p>
<p><b>Dish Name<br/></b> Description without span</p>
```
Key: name is inside `<b>`, the `<br/>` is INSIDE the `<b>` tag.
### Style 3 — `<b>` + `<strong>` combo (1898 restaurant)
```html
<p><span><b>First Part</b></span><strong>Second Part</strong> Description</p>
```
Full dish name = "First Part" + " " + "Second Part"
---
## Field Extraction Patterns
### Name (from page title)
```
<title>Restaurant Name | Inlander Restaurant Week</title>
```
Regex: `<title>(.+?) \| Inlander`
### Price (WARNING: unreliable — use price listing page instead)
```html
<h1 style="text-align: left;"><strong>$45</strong></h1>
```
Regex: `<strong>\$(\d+)</strong>`
PROBLEM: Some pages show drink prices like $22 that match before the real price.
SOLUTION: Always build an authoritative slug→price map from the price listing page.
### Price Listing Page — Authoritative Prices
URL: `https://inlanderrestaurantweek.com/price/` (or Wayback archived version)
```html
<article class="et_pb_portfolio_item ... project_category_45 ...">
...
<a href="https://inlanderrestaurantweek.com/project/SLUG/">
```
Extract price tier from `project_category_(25|35|45)` CSS class.
Extract slug from `href=".../project/SLUG/"`.
### Cuisine
```html
CUISINE: AMERICAN COMFORT FOOD
```
Often inside `<strong>` or `<em>` tags. Extract uppercase text after "CUISINE:".
Apply `.ToTitleCase()` for proper formatting.
### Phone
Area codes: 509 (Spokane area) or 208 (Idaho/CDA area)
Pattern: `(509) 555-1234` or `(208) 555-1234`
Regex: `\((?:208|509)\) \d{3}-\d{4}`
### Hours
```
Menu served 5pm-9pm nightly
Menu served Thursday-Sunday, 5-9pm
```
Regex: `Menu served [^<]+`
### Area
Look for area keywords (ALL CAPS in source) anywhere in the HTML:
- DOWNTOWN, NORTH SPOKANE, SOUTH SPOKANE, WEST SPOKANE, SPOKANE VALLEY
- AIRWAY HEIGHTS, LIBERTY LAKE, COEUR D'ALENE, POST FALLS, HAYDEN, ATHOL, WORLEY
Default to ["Downtown"] if nothing matched.
Some restaurants appear in multiple areas — collect all matches.
---
## Dietary Tag Filtering
Skip these as dish names — they appear in `<strong>` but are dietary labels, not dish names:
- GF (gluten free)
- GFA (gluten free available)
- V, V+ (vegetarian, vegan)
- DF, DFA (dairy free, dairy free available)
- V:, V+A (legend lines)
- 2025 (year marker some restaurants include)
- Drink (some restaurants label beverage course)
Full regex: `^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$`
Also skip names matching `^[A-Z]{1,3}:` (legend lines like "GF: Gluten Free")
Also skip names shorter than 3 chars or longer than 80 chars.
---
## Restaurants by Known HTML Style (2025)
**Layout B (same-block)**: tavolata, durkins, table13, terraza, and others
**Style 2 (`<b>` tags)**: indiahouse, lebanon, karma, ponderosa, collectivekitchen, dryfly, masselowslounge, vieuxcarre, wileys, osprey, shawnodonnells, ganderryegrass
**Style 3 (`<b>`+`<strong>` combo)**: 1898
Note: These styles may change year to year as restaurants update their pages.
Always check a few representative pages before assuming the same structure applies.
---
## JS-Only Pages (no static HTML menu content)
These restaurants had no recoverable menu data from any Wayback snapshot in 2025:
heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
Their pages are fully JS-rendered — the static HTML captured by Wayback Machine
shows the page shell but not the menu content. For future years, these may or may not
have static caches depending on server-side rendering changes.