2. Waste Data Analysis using Colab Python Notebook

16 Aug 2024

Backend and Database

Complete Guide to Waste Data Analysis for each sub-district of Tangerang City and Geospatial Visualisation with Python

In today’s data-driven world, extracting valuable insights from raw data is crucial. By leveraging the power of Python, we can perform data analysis, geospatial visualization, and even machine learning tasks efficiently. This article walks you through the step-by-step process we used to analyze and visualize data spanning multiple years: 2020, 2023, and 2024. Let’s dive into the details.

Libraries Used

To effectively manage, analyze, and visualize the data, we utilized the following Python libraries:

Pandas: A powerful library for data manipulation and analysis.
Numpy: Essential for numerical operations.
Seaborn: Useful for creating attractive and informative statistical graphics.
Scikit-learn: A popular machine learning library.
Geopandas: Provides tools to work with geospatial data.

These libraries form the core of the data science toolkit, enabling us to handle data, create insightful visualizations, and apply machine learning models.

Data Loading

We loaded data from CSV files for the years 2020, 2023, and 2024. This approach allows for comparative analysis across different years, which is essential for identifying trends and patterns. Loading data from multiple years is particularly useful when studying long-term trends in fields like environmental science, economics, or public health.

Data Cleaning and Preparation

This phase is crucial to ensure that the dataset is reliable and ready for analysis. The following steps were taken:

Checking Data:
- The structure of the data for 2020 and 2023 was inspected using the .info() method in Pandas. This provided an overview of data types, non-null counts, and memory usage, helping us identify any irregularities or missing values.
- For the 2024 dataset, unnecessary columns such as “tps3r” and “bank_sampah” were dropped. The remaining columns were reordered to match the structure of previous years, ensuring consistency across datasets.
Data Cleaning:
- Handling Missing Values: Missing values can distort analysis results, so identifying and handling them is critical. For numerical columns, missing values were replaced with the median of those columns. The median was chosen because it is less sensitive to outliers compared to the mean.
- Removing Duplicates: Duplicate rows were identified and removed using the drop_duplicates() method in Pandas. This step ensured that redundant data did not skew the analysis.
Resolving Data Inconsistencies:
- After cleaning and consolidating the data, the cleaned dataset was reviewed to confirm that all necessary adjustments had been made.
Data Normalization:
- Numerical data, such as waste reduction, waste management, the number of trucks, population, and household data, were normalized using StandardScaler from Scikit-learn. This step is crucial, as normalization ensures that all numerical data is on the same scale. Machine learning algorithms, especially those based on distance metrics, benefit greatly from normalization as it prevents features with larger numerical ranges from dominating the analysis.
Encoding Categorical Variables:
- Categorical variables like “kecamatan” (districts) were converted into numerical formats using one-hot encoding (pd.get_dummies()). Machine learning models cannot work directly with categorical data, so encoding is necessary to transform these categorical columns into a format suitable for machine learning.

Data Export

After all the cleaning, normalization, and encoding, the processed dataset was saved as a CSV file. Exporting the cleaned data ensures that it can be reused for future analysis, whether for further data exploration or for feeding into machine learning models.

To find out more analysis through python code, you can click the following colabs link : Analysis in collabs