Reading Baseball Box Scores

tommythumb · (This post was last modified: May-15-2024, 05:32 PM by tommythumb.)

Here is a sample of 100 baseball box scores. These are automatically generated
by another app. There may be hundreds of these box scores chained together in 1 file.
This file contains 100 which are separated by a date/time line.
Your area of interest should be concentrated to two lines of each box score.

They are the 2 lines immediately following the line that contains "R H E"
These 2 lines are called the linescore lines. The linescore lines are arranged in groups of 3 innings
played, followed by a space, and then 3 more innings, etc. A typical game has 3 groupings of 3 innings
per game (for a total of 9 innings). If the HOME team (the 2nd line) has a dash ("-") where the 9th inning
appears, that signifies that the HOME team did not require an at bat in their half of the 9th inning,
because they were already leading and won the game.

Extra inning games will have additional groups of 3. on the same line, following the inning-by-inning
linescore (regardless of the number of innings played), are the game totals of "Runs", "Hits", "and
"Errors" for the given team... this is the total of the number of runs, hits, and errors accumulated by the team.

The two lines of the linescore begin with a 2-digit year of the team playing, followed by visiting team
name on the first line, and then an inning-by-inning tally of runs scored in each inning for that team,
followed by the last 3 entries on the line representing the total number of runs scored, hits made, and
errors made by the team. The HOME team stats are on the on the next line (or second line) of the linescore.

In this example, your area of interest for the first 3 games listed, are located on lines:
29 & 30 (for the 1st game)
58 & 59 (for the 2nd game)
87 & 88 (for the third game)

NOTE: These line numbers will vary due to number of players/pitchers who played, etc.
You will have to "find" the lines of interest with code.
The code should find a line in each box score that has the string "R H E"
(2 spaces in between the "R&H", and 2 spaces between the "H&E")
Once you find that string, the next 2 lines are what you're looking for.
AWAY (or visiting) team is the first line, and the HOME team is on the 2nd line.

Here are 3 sample games box scores: There is a game # at the end of the first line on each
new box score identified by the "#" (pound sign):

{See File Attachment]

Write a Python program that:
Prompts the user for filename/location of the data file containing the box scores
Opens that file and finds all linescore lines (the 2 lines you're looking for in each box score).

Identify/Count the teams participating in the entire list of box scores
How many games Won and Lost by each team
Total Runs, Hits, and Errors accumulated for each team.
Save all readable output to "SBSMatrix.dat"
.txt

x1.txt (Size: 174.89 KB / Downloads: 28)

Pedroski55 · (This post was last modified: May-16-2024, 09:13 AM by Pedroski55.)

Must be homework!

I think you will find, people will not answer because they are waiting for you to show some effort, to show what you have tried so far.

Here is a little something to get you started. You can find the other stuff you need in a very similar fashion.

def myApp():
    #! /usr/bin/python3
    import re

    # where is the data?
    path2data = '/home/pedro/myPython/re/text_files/baseball_data1.txt'
    # savepath will be csv format, can open in Excel, whatever you call it
    savepath = '/home/pedro/myPython/re/text_files/SBSMatrix.dat'
    # your "box data", maybe you have more data files?
    baseball_data1 = 'baseball_data1.txt'

    # find the lines with date and game number
    # # example: date = 05-10-2024 07:24:29
    # example: line = '05-10-2024 07:24:29  # 1 ---------------------------------------------------- '
    # find the whole thing then split it is easiest
    pattern1 = re.compile(r'\d\d-\d\d-\d\d\d\d \d\d:\d\d:\d\d  # \d+')

    with open(path2data) as infile, open(savepath, 'w') as output:
        # groups is a generator will not overtax computer memory, no matter how big the input data is.
        # search finds the first instance of pattern
        output.write('date,gamenum\n')
        groups = (pattern1.search(line) for line in infile)
        for g in groups:
            # g can be None when pattern is not found, don't want None lines
            if g:
                #print(g[0])
                # split on space get a list
                # date is g[0], game number is g[-1]
                data = g[0].split()
                date = data[0]
                gamenum = data[-1]            
                output.write(f'{date},{gamenum}\n')

I would prefer if you used cricket data, I know nothing about baseball! Big Grin

tommythumb · (This post was last modified: May-19-2024, 06:28 AM by Gribouillis.)

Sorry about not including my code... Here is my attempt but as you can see by what it outputs, its pretty sparse.

import os  # Import the os module

from collections import defaultdict


def find_inning_lines(lines):
  """
  Finds the lines containing inning-by-inning runs based on context clues.

  Args:
      lines: A list of strings representing the game data.

  Returns:
      A list of line numbers containing inning runs, or None if not found.
  """
  for i in range(len(lines)):
    line = lines[i].strip()
    if line.startswith("R H E") or line == "R H E":  # Check for potential inning headers
      return [i + 2, i + 3]  # Assuming inning data 2 lines after the header
  return None


def parse_game(lines):
  """
  Parses the box score lines for a single game.

  Args:
      lines: A list of strings representing the lines for a game.

  Returns:
      A dictionary containing game data, or None if required data is missing.
  """
  data = {}
  # Extract team names
  try:
    data["away_team"] = lines[1].split()[1]
    data["home_team"] = lines[2].split()[1]
  except IndexError:
    return None  # Handle missing team names

  # Extract team scores (assuming scores are present even if inning data is missing)
  try:
    away_score = int(lines[-3].split()[-3])
    home_score = int(lines[-2].split()[-3])
    data["away_runs"] = away_score
    data["home_runs"] = home_score
  except IndexError:
    return None  # Handle missing scores

  # Extract win/loss
  data["away_win"] = away_score > home_score
  data["home_win"] = not data["away_win"]

  # Extract inning-by-inning runs (handle potential missing lines)
  inning_runs = {"away": [0] * 10, "home": [0] * 10}
  inning_lines = find_inning_lines(lines)
  if inning_lines:
    for i, team_line in enumerate(lines[inning_lines[0]:inning_lines[1]]):
      inning_runs[team_line.split()[0]] = [int(r) for r in team_line.split()[2:-4]]
  data["inning_runs"] = inning_runs
  return data


def process_data(filename):
  """
  Processes the data file and generates requested outputs.

  Args:
      filename: The path to the data file.
  """
  # Team counters
  teams = set()
  team_wins = defaultdict(int)
  team_losses = defaultdict(int)

  # Open and parse games
  with open(filename, "r") as f:
    lines = f.readlines()
    games = []
    start = 0  # Track starting line of each game block
    for i in range(len(lines)):
      if lines[i].startswith("05"):  # Identify game start marker
        games.append(parse_game(lines[start:i + 1]))  # Parse game block
        start = i + 1  # Update start for next game block
      elif lines[i].strip() == "--- end of data ---":  # Check for end of data
        break

  # Process games data (only for games with valid data)
  for game in games:
    if game:  # Check if game data is valid before processing
      teams.add(game["away_team"])
      teams.add(game["home_team"])
      if game["away_win"]:
        team_wins[game["away_team"]] += 1
        team_losses[game["home_team"]] += 1
      else:
        team_wins[game["home_team"]] += 1
        team_losses[game["away_team"]] += 1

  # Generate win-loss matrix
  win_loss_matrix = {}
  for team in teams:
    win_loss_matrix[team] = {"wins": team_wins[team], "losses": team_losses[team]}

  # Save win-loss matrix to file
  with open("MyOutput.dat", "w") as f:
    f.write(f"Teams\tWins\tLosses\n")  # Write header with proper closing quotation mark
    for team, stats in win_loss_matrix.items():
      f.write(f"{team}\t{stats['wins']}\t{stats['losses']}\n")

if __name__ == "__main__":
  # Get filename from user
  while True:
    filename = input("Enter the path to your data file: ")
    if os.path.isfile(filename):
      process_data(filename)
      break  # Exit loop after processing a valid file
    else:
      print("File not found. Please try again.")

HERE IS THIS PROGRAMS OUTPUT (MyOutput.dat):
Teams Wins Losses

Gribouillis write May-19-2024, 06:28 AM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

Pedroski55 · May-21-2024, 09:13 AM

This works for me and delivers a csv.

You can then open the csv in pandas, or Excel, and calculate anything you want! (Don't know what you want.) There are 12 columns.

Not all the lines we want are the same format. Before the columns R, H, E there are sometimes 5 numbers, sometimes only 3

# Example 61 Jersey Ci 002 000 110 00 4 10 2 # team name has 2 parts!
# Example 61 Buffalo 001 000 01- 2 4 1 # 01- caused problems because of the -, I was looking for 3 numbers!
# Example 61 Jersey Ci 000 000 012 000 000 3 13 2 # 2 extra columns of numbers. Some have 1 extra column

The datetime lines all seem to be the same!

I don't know what the 61 siginifies? The year? "61 Team name" appears on some other lines too.

#! /usr/bin/python3
import re

# long version of the data
# not all of the lines we want are the same
path2data = '/home/pedro/myPython/re/text_files/baseball_data1.txt'
# savepath will be csv format, can open in Excel, whatever you call it
savepath = '/home/pedro/myPython/re/text_files/SBSMatrix.dat'

# find the lines with date and game number
# Example: 05-10-2024 07:24:29  # 27
# p finds 3 groups: (date) (time) (game number)
p = re.compile(r'([0-9-]+)\s+([0-9:]+)\s+#\s+(\d+)')

# not sure what 61 is
# Example 61 Jersey Ci 002 000 110 00  4 10  2
# Example 61 Buffalo   001 000 01-   2  4  1 # 01- caused problems
# Example 61 Jersey Ci 000 000 012 000 000   3 13  2 # 2 extra columns of numbers
q = re.compile(r'(61\s[A-Za-z ]+)\s+(\d+\s[0-9- ]+)')

def date_time_gamenum(line):
    result = p.search(line) 
    if result:
        return [result.group(1), result.group(2), result.group(3)]
    else:
        return None

def results(line):
    res = q.search(line)
    if res:
        res_list = res.group(2).split()
        # some lines have more numbers than other, need 5 numbers + R H E
        while not len(res_list) == 8:
            if len(res_list) == 6:
                res_list.insert(3, '-')
            if len(res_list) == 7:
                res_list.insert(4, '-')
        return [res.group(1)] + res_list
    else:
        return None
                        
# this works
with open(path2data) as infile, open(savepath, 'w') as output:
    # put some column headers for each data element of the csv
    output.write('Date,Time,Game_num,Team,num1,num2,num3,num4,num5,Runs,Hits,Errors\n')
    groups = (line for line in infile)
    for g in groups:
        dt = date_time_gamenum(g)
        if dt:
            #print(f'************** game number is: {dt[2]}')
            #print(f'dt = {dt}')
            date, time, gamenum = dt[0], dt[1], dt[2]
        scores = results(g)
        if scores:
            print(scores)
            csvdata = [date, time, gamenum] + scores
            csvstring = ','.join(csvdata) + '\n'
            #print(csvstring)
            output.write(csvstring)

Like I said, cricket scores would be easier!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Calculating BLEU scores in Python	kg17	1	2,655	Jan-12-2021, 08:26 PM Last Post: Gribouillis

Reading Baseball Box Scores

User Panel Messages

Announcements