Soccer Dashboard

Data from Brazil, Mexico, USA and Argentina Soccer Leagues

Soccer Dashboard (BRA and US)

For this project I scraped data from a webpage, loaded it in a Database and created this HTML page to display the information.

Applied Technologies

Python
BeautifulSoup
SQL (PostgreSQL)
JSON
HTML/CSS
JavaScript
D3

Steps

Scrape
- Website scraped: Transfermarkt
- Using BeautifulSoup, scrape the website and save the data in a .csv file.

SQL and JSON
- With SQLAlchemy, store all the data scraped from the URL above into in a single Database.
- Create a JSON file with all the data and used it to build the dashboard.

HTML and JavaScript
- Build a webpage to present the information.
- Using JavaScript and D3, parse the JSON file and display the data.

Detailed Steps Descriptions

Further Improvements

Presentation
- Improve Technical Info with code snips and better descriptions
- Improve About
- Add Leagues Data page
- Add charts
- Improve Tableau Dashboards

Structure
- Improve Database
- Develop a Flask App and API

Prediction
- Develop a Machine Learning Model to predict results based on past games
- Use a Sentiment Analysis to read the feedback of a game.

Detailed Descriptions

Step 1 - Scraping

To kickoff this project, first I needed to select the data source for the players’ market values. I wanted a data source that was reliable and up to date with transfers and values, therefore I chose the Transfermarkt website, since they have just completed 20 years working with soccer data. The leagues I decided to work with were the first divisions from Brazil, USA, Mexico and Argetina.

Once I had chosen the source, the next step was obtain the data to work with. With the direct link to each league’s page I used Beautiful Soup to read and scrape information about the clubs for the respective league and look up the players info on each club page. Then used pandas to create a DataFrame and write a .csv file.

While scraping I ran into some issues:

I wanted the values to be displayed in US Dollars instead of Euros. To overcome this, I built a simple function that would get the exchange rate for the day and apply on all the values.

                            
  convert_url =  "https://www.google.com/search?q=euro+to+dollar....."
  html = requests.get(convert_url, headers=headers)
  soup = BeautifulSoup(html.content, 'html.parser')
  curr_value = soup.find('div', class_= "b1hJbf")
  curr_value = round(float(curr_value["data-exchange-rate"]),2)
  
  def get_value_us(x):
      value = []
      for char in x:
          value.append(char)
      if value[-1] == "m":
          float_value = "".join(value[1:-1])
          return round(float(float_value)*curr_value,2)
      else:
          float_value = "".join(value[1:-3])
          return round((float(float_value)*curr_value)/1000,2)

The second problem was accent marks and special characters. I created another function to run while getting the names, that normalizes to Unicode and strips any left spaces.
Finally, I had to deal with some missing heights. My approach to this issue was to use the average height of a player with the same position and from the same country to fill the missing data on other players.

  
  def fix_heights(df):
      heights = []
      
      for index, row in df.iterrows():
          if pd.notnull(row["Height"]):
              heights.append(row["Height"])
          else:
              nat = row["Nat"]
              position = row["position"]
              field_position = row["field_position"]
              comp_players = df.loc[((df["Nat"] == nat) & (df["position"] == position))]
              avg_df = round(comp_players["Height"].mean(skipna = True),2)
              row["Height"] = avg_df
              if pd.isnull(row["Height"]):
                  comp_players2 = df.loc[((df["Nat"] == nat) & (df["field_position"] == field_position))]
                  avg_df2 = round(comp_players2["Height"].mean(skipna = True),2)
                  row["Height"] = avg_df2
                  if pd.isnull(row["Height"]):
                      comp_players3 = df.loc[((df["position"] == position))]
                      avg_df3 = round(comp_players3["Height"].mean(skipna = True),2)
                      row["Height"] = avg_df3
              heights.append(row["Height"])
      
      df.height = heights

Step 2 - SQL and JSON

Once I had the data, I decided to store it on a SQL database. Using SQLAlchemy library I built 3 tables (Leagues, Teams and Players) and loaded the Database on Postgres SQL. Then I used JSON to export the results of a JOIN SQL query to a JSON file.

                        
  # read database to json file
  pd.read_sql_query('SELECT * FROM  leagues JOIN (teams JOIN players USING("team_ID")) USING ("league_ID")', con=engine)\
  .to_json(r'../dashboard/static/data/data.json', orient='records')

Step 3 - HTML and JavaScript

Soccer Dashboard

Data from Brazil, Mexico, USA and Argentina Soccer Leagues