Soccer Dashboard

Data from Brazil, Mexico, USA and Argentina Soccer Leagues

Soccer Dashboard (BRA and US)


For this project I scraped data from a webpage, loaded it in a Database and created this HTML page to display the information.


Applied Technologies
  • Python
  • BeautifulSoup
  • SQL (PostgreSQL)
  • JSON
  • HTML/CSS
  • JavaScript
  • D3
Steps
  1. Scrape
    • Website scraped: Transfermarkt
    • Using BeautifulSoup, scrape the website and save the data in a .csv file.

  2. SQL and JSON
    • With SQLAlchemy, store all the data scraped from the URL above into in a single Database.
    • Create a JSON file with all the data and used it to build the dashboard.

  3. HTML and JavaScript
    • Build a webpage to present the information.
    • Using JavaScript and D3, parse the JSON file and display the data.
  • Detailed Steps Descriptions

  • Further Improvements
    • Presentation
      • Improve Technical Info with code snips and better descriptions
      • Improve About
      • Add Leagues Data page
      • Add charts
      • Improve Tableau Dashboards

    • Structure
      • Improve Database
      • Develop a Flask App and API

    • Prediction
      • Develop a Machine Learning Model to predict results based on past games
      • Use a Sentiment Analysis to read the feedback of a game.


    • Detailed Descriptions

      • Step 1 - Scraping
      • To kickoff this project, first I needed to select the data source for the players’ market values. I wanted a data source that was reliable and up to date with transfers and values, therefore I chose the Transfermarkt website, since they have just completed 20 years working with soccer data. The leagues I decided to work with were the first divisions from Brazil, USA, Mexico and Argetina.

        Once I had chosen the source, the next step was obtain the data to work with. With the direct link to each league’s page I used Beautiful Soup to read and scrape information about the clubs for the respective league and look up the players info on each club page. Then used pandas to create a DataFrame and write a .csv file.

        While scraping I ran into some issues:

        • I wanted the values to be displayed in US Dollars instead of Euros. To overcome this, I built a simple function that would get the exchange rate for the day and apply on all the values.
        •                             
            convert_url =  "https://www.google.com/search?q=euro+to+dollar....."
            html = requests.get(convert_url, headers=headers)
            soup = BeautifulSoup(html.content, 'html.parser')
            curr_value = soup.find('div', class_= "b1hJbf")
            curr_value = round(float(curr_value["data-exchange-rate"]),2)
            
            def get_value_us(x):
                value = []
                for char in x:
                    value.append(char)
                if value[-1] == "m":
                    float_value = "".join(value[1:-1])
                    return round(float(float_value)*curr_value,2)
                else:
                    float_value = "".join(value[1:-3])
                    return round((float(float_value)*curr_value)/1000,2)
                                          
                                    
        • The second problem was accent marks and special characters. I created another function to run while getting the names, that normalizes to Unicode and strips any left spaces.
        • Finally, I had to deal with some missing heights. My approach to this issue was to use the average height of a player with the same position and from the same country to fill the missing data on other players.
        •   
            def fix_heights(df):
                heights = []
                
                for index, row in df.iterrows():
                    if pd.notnull(row["Height"]):
                        heights.append(row["Height"])
                    else:
                        nat = row["Nat"]
                        position = row["position"]
                        field_position = row["field_position"]
                        comp_players = df.loc[((df["Nat"] == nat) & (df["position"] == position))]
                        avg_df = round(comp_players["Height"].mean(skipna = True),2)
                        row["Height"] = avg_df
                        if pd.isnull(row["Height"]):
                            comp_players2 = df.loc[((df["Nat"] == nat) & (df["field_position"] == field_position))]
                            avg_df2 = round(comp_players2["Height"].mean(skipna = True),2)
                            row["Height"] = avg_df2
                            if pd.isnull(row["Height"]):
                                comp_players3 = df.loc[((df["position"] == position))]
                                avg_df3 = round(comp_players3["Height"].mean(skipna = True),2)
                                row["Height"] = avg_df3
                        heights.append(row["Height"])
                
                df.height = heights
            
          

      • Step 2 - SQL and JSON
      • Once I had the data, I decided to store it on a SQL database. Using SQLAlchemy library I built 3 tables (Leagues, Teams and Players) and loaded the Database on Postgres SQL. Then I used JSON to export the results of a JOIN SQL query to a JSON file.

                                
          # read database to json file
          pd.read_sql_query('SELECT * FROM  leagues JOIN (teams JOIN players USING("team_ID")) USING ("league_ID")', con=engine)\
          .to_json(r'../dashboard/static/data/data.json', orient='records')                          
                                
                              
      • Step 3 - HTML and JavaScript